[PATCH] drm/xe: Fix early wedge on GuC load failure

Fri Jun 6 14:15:49 UTC 2025

-----Original Message-----
From: Ceraolo Spurio, Daniele <daniele.ceraolospurio at intel.com> 
Sent: Thursday, June 5, 2025 3:23 PM
To: intel-xe at lists.freedesktop.org
Cc: Ceraolo Spurio, Daniele <daniele.ceraolospurio at intel.com>; Vivi, Rodrigo <rodrigo.vivi at intel.com>; Brost, Matthew <matthew.brost at intel.com>; Cavitt, Jonathan <jonathan.cavitt at intel.com>; De Marchi, Lucas <lucas.demarchi at intel.com>; Dong, Zhanjun <zhanjun.dong at intel.com>
Subject: [PATCH] drm/xe: Fix early wedge on GuC load failure
> 
> When the GuC fails to load we declare the device wedged. However, the
> very first GuC load attempt on GT0 (from xe_gt_init_hwconfig) is done
> before the GT1 GuC objects are initialized, so things go bad when the
> wedge code attempts to cleanup GT1. To fix this, check the initialization
> status in the functions called during wedge.
> 
> Fixes: 7dbe8af13c18 ("drm/xe: Wedge the entire device")
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> Cc: Zhanjun Dong <zhanjun.dong at intel.com>
> ---
> 
> Note that this patch depends on another fix in the same area:
> 	1e1981b16bb1 ("drm/xe: Fix taking invalid lock on wedge")
> 
> which however was merged without a fixes tag. Not sure if we need to
> propagate that one through fixes manually before this one, or if we can
> just fix it when this one fails to build in the fixes tree.
> 
>  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 8 ++++++++
>  drivers/gpu/drm/xe/xe_guc_ct.c              | 7 +++++--
>  drivers/gpu/drm/xe/xe_guc_ct.h              | 5 +++++
>  drivers/gpu/drm/xe/xe_guc_submit.c          | 3 +++
>  4 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> index 084cbdeba8ea..e1362e608146 100644
> --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> @@ -137,6 +137,14 @@ void xe_gt_tlb_invalidation_reset(struct xe_gt *gt)
>  	struct xe_gt_tlb_invalidation_fence *fence, *next;
>  	int pending_seqno;
>  
> +	/*
> +	 * we can get here before the CTs are even initialized if we're wedging
> +	 * very early, in which case there are not going to be any pending
> +	 * fences so we can bail immediately.
> +	 */
> +	if (!xe_guc_ct_initialized(&gt->uc.guc.ct))
> +		return;
> +
>  	/*
>  	 * CT channel is already disabled at this point. No new TLB requests can
>  	 * appear.
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 822f4c33f730..e303fec18174 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -517,6 +517,9 @@ void xe_guc_ct_disable(struct xe_guc_ct *ct)
>   */
>  void xe_guc_ct_stop(struct xe_guc_ct *ct)
>  {
> +	if (!xe_guc_ct_initialized(ct))
> +		return;
> +
>  	xe_guc_ct_set_state(ct, XE_GUC_CT_STATE_STOPPED);
>  	stop_g2h_handler(ct);
>  }
> @@ -788,7 +791,7 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
>  	u16 seqno;
>  	int ret;
>  
> -	xe_gt_assert(gt, ct->state != XE_GUC_CT_STATE_NOT_INITIALIZED);
> +	xe_gt_assert(gt, xe_guc_ct_initialized(ct));
>  	xe_gt_assert(gt, !g2h_len || !g2h_fence);
>  	xe_gt_assert(gt, !num_g2h || !g2h_fence);
>  	xe_gt_assert(gt, !g2h_len || num_g2h);
> @@ -1424,7 +1427,7 @@ static int g2h_read(struct xe_guc_ct *ct, u32 *msg, bool fast_path)
>  	u32 action;
>  	u32 *hxg;
>  
> -	xe_gt_assert(gt, ct->state != XE_GUC_CT_STATE_NOT_INITIALIZED);
> +	xe_gt_assert(gt, xe_guc_ct_initialized(ct));
>  	lockdep_assert_held(&ct->fast_lock);
>  
>  	if (ct->state == XE_GUC_CT_STATE_DISABLED)
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> index 5649bda82823..99c5dec446f2 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> @@ -24,6 +24,11 @@ void xe_guc_ct_print(struct xe_guc_ct *ct, struct drm_printer *p, bool want_ctb)
>  
>  void xe_guc_ct_fixup_messages_with_ggtt(struct xe_guc_ct *ct, s64 ggtt_shift);
>  
> +static inline bool xe_guc_ct_initialized(struct xe_guc_ct *ct)
> +{
> +	return ct->state != XE_GUC_CT_STATE_NOT_INITIALIZED;
> +}
> +
>  static inline bool xe_guc_ct_enabled(struct xe_guc_ct *ct)
>  {
>  	return ct->state == XE_GUC_CT_STATE_ENABLED;
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 26acc3d91e68..408eeeb78a0b 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1765,6 +1765,9 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>  {
>  	int ret;
>  
> +	if (!guc->submission_state.initialized)
> +		return 0;
> +

This change here seems like a bit of an outlier compared to the CT checks above, but
given the purpose of this patch is to prevent us from attempting to reset/cleanup
uninitialized GuC components on wedge, I'd say this is correct here.

Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
-Jonathan Cavitt

>  	/*
>  	 * Using an atomic here rather than submission_state.lock as this
>  	 * function can be called while holding the CT lock (engine reset
> -- 
> 2.43.0
> 
>