[PATCH] drm/xe: Fix early wedge on GuC load failure
Cavitt, Jonathan
jonathan.cavitt at intel.com
Fri Jun 6 14:15:49 UTC 2025
-----Original Message-----
From: Ceraolo Spurio, Daniele <daniele.ceraolospurio at intel.com>
Sent: Thursday, June 5, 2025 3:23 PM
To: intel-xe at lists.freedesktop.org
Cc: Ceraolo Spurio, Daniele <daniele.ceraolospurio at intel.com>; Vivi, Rodrigo <rodrigo.vivi at intel.com>; Brost, Matthew <matthew.brost at intel.com>; Cavitt, Jonathan <jonathan.cavitt at intel.com>; De Marchi, Lucas <lucas.demarchi at intel.com>; Dong, Zhanjun <zhanjun.dong at intel.com>
Subject: [PATCH] drm/xe: Fix early wedge on GuC load failure
>
> When the GuC fails to load we declare the device wedged. However, the
> very first GuC load attempt on GT0 (from xe_gt_init_hwconfig) is done
> before the GT1 GuC objects are initialized, so things go bad when the
> wedge code attempts to cleanup GT1. To fix this, check the initialization
> status in the functions called during wedge.
>
> Fixes: 7dbe8af13c18 ("drm/xe: Wedge the entire device")
> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> Cc: Zhanjun Dong <zhanjun.dong at intel.com>
> ---
>
> Note that this patch depends on another fix in the same area:
> 1e1981b16bb1 ("drm/xe: Fix taking invalid lock on wedge")
>
> which however was merged without a fixes tag. Not sure if we need to
> propagate that one through fixes manually before this one, or if we can
> just fix it when this one fails to build in the fixes tree.
>
> drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 8 ++++++++
> drivers/gpu/drm/xe/xe_guc_ct.c | 7 +++++--
> drivers/gpu/drm/xe/xe_guc_ct.h | 5 +++++
> drivers/gpu/drm/xe/xe_guc_submit.c | 3 +++
> 4 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> index 084cbdeba8ea..e1362e608146 100644
> --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> @@ -137,6 +137,14 @@ void xe_gt_tlb_invalidation_reset(struct xe_gt *gt)
> struct xe_gt_tlb_invalidation_fence *fence, *next;
> int pending_seqno;
>
> + /*
> + * we can get here before the CTs are even initialized if we're wedging
> + * very early, in which case there are not going to be any pending
> + * fences so we can bail immediately.
> + */
> + if (!xe_guc_ct_initialized(>->uc.guc.ct))
> + return;
> +
> /*
> * CT channel is already disabled at this point. No new TLB requests can
> * appear.
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 822f4c33f730..e303fec18174 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -517,6 +517,9 @@ void xe_guc_ct_disable(struct xe_guc_ct *ct)
> */
> void xe_guc_ct_stop(struct xe_guc_ct *ct)
> {
> + if (!xe_guc_ct_initialized(ct))
> + return;
> +
> xe_guc_ct_set_state(ct, XE_GUC_CT_STATE_STOPPED);
> stop_g2h_handler(ct);
> }
> @@ -788,7 +791,7 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
> u16 seqno;
> int ret;
>
> - xe_gt_assert(gt, ct->state != XE_GUC_CT_STATE_NOT_INITIALIZED);
> + xe_gt_assert(gt, xe_guc_ct_initialized(ct));
> xe_gt_assert(gt, !g2h_len || !g2h_fence);
> xe_gt_assert(gt, !num_g2h || !g2h_fence);
> xe_gt_assert(gt, !g2h_len || num_g2h);
> @@ -1424,7 +1427,7 @@ static int g2h_read(struct xe_guc_ct *ct, u32 *msg, bool fast_path)
> u32 action;
> u32 *hxg;
>
> - xe_gt_assert(gt, ct->state != XE_GUC_CT_STATE_NOT_INITIALIZED);
> + xe_gt_assert(gt, xe_guc_ct_initialized(ct));
> lockdep_assert_held(&ct->fast_lock);
>
> if (ct->state == XE_GUC_CT_STATE_DISABLED)
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
> index 5649bda82823..99c5dec446f2 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.h
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
> @@ -24,6 +24,11 @@ void xe_guc_ct_print(struct xe_guc_ct *ct, struct drm_printer *p, bool want_ctb)
>
> void xe_guc_ct_fixup_messages_with_ggtt(struct xe_guc_ct *ct, s64 ggtt_shift);
>
> +static inline bool xe_guc_ct_initialized(struct xe_guc_ct *ct)
> +{
> + return ct->state != XE_GUC_CT_STATE_NOT_INITIALIZED;
> +}
> +
> static inline bool xe_guc_ct_enabled(struct xe_guc_ct *ct)
> {
> return ct->state == XE_GUC_CT_STATE_ENABLED;
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 26acc3d91e68..408eeeb78a0b 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -1765,6 +1765,9 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> {
> int ret;
>
> + if (!guc->submission_state.initialized)
> + return 0;
> +
This change here seems like a bit of an outlier compared to the CT checks above, but
given the purpose of this patch is to prevent us from attempting to reset/cleanup
uninitialized GuC components on wedge, I'd say this is correct here.
Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
-Jonathan Cavitt
> /*
> * Using an atomic here rather than submission_state.lock as this
> * function can be called while holding the CT lock (engine reset
> --
> 2.43.0
>
>
More information about the Intel-xe
mailing list