[PATCH] drm/xe: Fix early wedge on GuC load failure

Fri Jun 6 23:33:40 UTC 2025

On Thu, Jun 05, 2025 at 03:22:57PM -0700, Daniele Ceraolo Spurio wrote:
>When the GuC fails to load we declare the device wedged. However, the
>very first GuC load attempt on GT0 (from xe_gt_init_hwconfig) is done
>before the GT1 GuC objects are initialized, so things go bad when the
>wedge code attempts to cleanup GT1. To fix this, check the initialization
>status in the functions called during wedge.
>
>Fixes: 7dbe8af13c18 ("drm/xe: Wedge the entire device")
>Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>Cc: Matthew Brost <matthew.brost at intel.com>
>Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
>Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>Cc: Zhanjun Dong <zhanjun.dong at intel.com>
>---
>
>Note that this patch depends on another fix in the same area:
>	1e1981b16bb1 ("drm/xe: Fix taking invalid lock on wedge")
>
>which however was merged without a fixes tag. Not sure if we need to
>propagate that one through fixes manually before this one, or if we can
>just fix it when this one fails to build in the fixes tree.

Since this will then need to go to stable, we can Cc stable in this
patch and then mark that other one as required, so both go back to 6.12.
At the time of merging/writing that patch I didn't think a wedge on
early init would deserve a backport since it doesn't/shouldn't happen
outside of our dev systems. Better safe than sorry though.

Reviewed-by: Lucas De Marchi <lucas.demarchi at intel.com>

Lucas De Marchi

>
> drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 8 ++++++++
> drivers/gpu/drm/xe/xe_guc_ct.c              | 7 +++++--
> drivers/gpu/drm/xe/xe_guc_ct.h              | 5 +++++
> drivers/gpu/drm/xe/xe_guc_submit.c          | 3 +++
> 4 files changed, 21 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>index 084cbdeba8ea..e1362e608146 100644
>--- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>+++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>@@ -137,6 +137,14 @@ void xe_gt_tlb_invalidation_reset(struct xe_gt *gt)
> 	struct xe_gt_tlb_invalidation_fence *fence, *next;
> 	int pending_seqno;
>
>+	/*
>+	 * we can get here before the CTs are even initialized if we're wedging
>+	 * very early, in which case there are not going to be any pending
>+	 * fences so we can bail immediately.
>+	 */
>+	if (!xe_guc_ct_initialized(&gt->uc.guc.ct))
>+		return;
>+
> 	/*
> 	 * CT channel is already disabled at this point. No new TLB requests can
> 	 * appear.
>diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>index 822f4c33f730..e303fec18174 100644
>--- a/drivers/gpu/drm/xe/xe_guc_ct.c
>+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>@@ -517,6 +517,9 @@ void xe_guc_ct_disable(struct xe_guc_ct *ct)
>  */
> void xe_guc_ct_stop(struct xe_guc_ct *ct)
> {
>+	if (!xe_guc_ct_initialized(ct))
>+		return;
>+
> 	xe_guc_ct_set_state(ct, XE_GUC_CT_STATE_STOPPED);
> 	stop_g2h_handler(ct);
> }
>@@ -788,7 +791,7 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
> 	u16 seqno;
> 	int ret;
>
>-	xe_gt_assert(gt, ct->state != XE_GUC_CT_STATE_NOT_INITIALIZED);
>+	xe_gt_assert(gt, xe_guc_ct_initialized(ct));
> 	xe_gt_assert(gt, !g2h_len || !g2h_fence);
> 	xe_gt_assert(gt, !num_g2h || !g2h_fence);
> 	xe_gt_assert(gt, !g2h_len || num_g2h);
>@@ -1424,7 +1427,7 @@ static int g2h_read(struct xe_guc_ct *ct, u32 *msg, bool fast_path)
> 	u32 action;
> 	u32 *hxg;
>
>-	xe_gt_assert(gt, ct->state != XE_GUC_CT_STATE_NOT_INITIALIZED);
>+	xe_gt_assert(gt, xe_guc_ct_initialized(ct));
> 	lockdep_assert_held(&ct->fast_lock);
>
> 	if (ct->state == XE_GUC_CT_STATE_DISABLED)
>diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
>index 5649bda82823..99c5dec446f2 100644
>--- a/drivers/gpu/drm/xe/xe_guc_ct.h
>+++ b/drivers/gpu/drm/xe/xe_guc_ct.h
>@@ -24,6 +24,11 @@ void xe_guc_ct_print(struct xe_guc_ct *ct, struct drm_printer *p, bool want_ctb)
>
> void xe_guc_ct_fixup_messages_with_ggtt(struct xe_guc_ct *ct, s64 ggtt_shift);
>
>+static inline bool xe_guc_ct_initialized(struct xe_guc_ct *ct)
>+{
>+	return ct->state != XE_GUC_CT_STATE_NOT_INITIALIZED;
>+}
>+
> static inline bool xe_guc_ct_enabled(struct xe_guc_ct *ct)
> {
> 	return ct->state == XE_GUC_CT_STATE_ENABLED;
>diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>index 26acc3d91e68..408eeeb78a0b 100644
>--- a/drivers/gpu/drm/xe/xe_guc_submit.c
>+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>@@ -1765,6 +1765,9 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> {
> 	int ret;
>
>+	if (!guc->submission_state.initialized)
>+		return 0;
>+
> 	/*
> 	 * Using an atomic here rather than submission_state.lock as this
> 	 * function can be called while holding the CT lock (engine reset
>-- 
>2.43.0
>