[PATCH v8] drm/xe/uc: Disable GuC communication on hardware initialization error.

Matthew Brost matthew.brost at intel.com
Mon Jul 7 18:30:07 UTC 2025


On Mon, Jul 07, 2025 at 07:36:29PM +0200, Michal Wajdeczko wrote:
> 
> 
> On 07.07.2025 15:56, Cavitt, Jonathan wrote:
> > -----Original Message-----
> > From: Dong, Zhanjun <zhanjun.dong at intel.com> 
> > Sent: Thursday, July 3, 2025 2:39 PM
> > To: intel-xe at lists.freedesktop.org
> > Cc: Dong, Zhanjun <zhanjun.dong at intel.com>; Wajdeczko, Michal <Michal.Wajdeczko at intel.com>; Summers, Stuart <stuart.summers at intel.com>; Cavitt, Jonathan <jonathan.cavitt at intel.com>
> > Subject: [PATCH v8] drm/xe/uc: Disable GuC communication on hardware initialization error.
> >>
> >> Disable GuC communication on Xe micro controller hardware initialization
> >> error.
> >>
> >> Signed-off-by: Zhanjun Dong <zhanjun.dong at intel.com>
> >> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4917
> >>
> >> ---
> >> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
> >> Cc: Stuart Summers <stuart.summers at intel.com>
> >> Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
> >>
> >> Change list:
> >> v8: Fix kernel-doc style
> >>     Add error handling in vf_guc_min_load_for_hwconfig
> >> v7: Add kernel-doc for xe_guc_disable_communication
> >>     Unset submission_state.enabled as well
> >> v6: Skip disable ct on xe_guc_enable_communication error
> >> v5: Set wedge is excessive action, revert back to disable ct
> >> v4: Fix typo and add new line
> >> v3: v2 CI re-run
> >> v2: Remove unnecessary jump to err-out
> >>     Drop disable ct, switch to set wedge
> >> ---
> >>  drivers/gpu/drm/xe/xe_guc.c | 20 ++++++++++++++++++--
> >>  drivers/gpu/drm/xe/xe_guc.h |  1 +
> >>  drivers/gpu/drm/xe/xe_uc.c  | 19 ++++++++++++++-----
> >>  3 files changed, 33 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> >> index 8573957facae..6643a2cb898b 100644
> >> --- a/drivers/gpu/drm/xe/xe_guc.c
> >> +++ b/drivers/gpu/drm/xe/xe_guc.c
> >> @@ -1219,13 +1219,17 @@ static int vf_guc_min_load_for_hwconfig(struct xe_guc *guc)
> >>  
> >>  	ret = xe_gt_sriov_vf_connect(gt);
> >>  	if (ret)
> >> -		return ret;
> >> +		goto err_out;
> >>  
> >>  	ret = xe_gt_sriov_vf_query_runtime(gt);
> >>  	if (ret)
> >> -		return ret;
> >> +		goto err_out;
> >>  
> >>  	return 0;
> >> +
> >> +err_out:
> >> +	xe_guc_disable_communication(guc);
> >> +	return ret;
> >>  }
> >>  
> >>  /**
> >> @@ -1337,6 +1341,18 @@ int xe_guc_enable_communication(struct xe_guc *guc)
> >>  	return 0;
> >>  }
> >>  
> >> +/**
> >> + * xe_guc_disable_communication() - Disable GuC communication
> >> + * @guc: The GuC object
> >> + *
> >> + * This function will disable the GuC communication.
> >> + */
> >> +void xe_guc_disable_communication(struct xe_guc *guc)
> >> +{
> >> +	guc->submission_state.enabled = false;
> >> +	xe_guc_ct_disable(&guc->ct);
> > 
> > Didn't this used to just be a wrapper for xe_guc_ct_disable?
> > 
> > ... Well, I suppose needing to set the submission state enabled to false
> > would also be a rather important step, so it's probably good that was added.
> 
> hmm, while it might be a required step for shutdown, putting it here
> looks like a random choice, it's unbalanced what "enable_communication"
> did and also looks like a violation of the layering, as guc_submit code
> has its own files and we shouldn't touch internals from the outside
> 
> + @Matt
> 

In general, the use of guc->submission_state.enabled in the existing
driver doesn't appear correct. It's accessed in multiple files without
any clear locking or layering rules. This falls into the category of "I
wrote this early in Xe without much thought or due to some ignorance,
and we never cleaned it up"—and there's quite a bit of that floating
around in Xe. So I agree with Michal—we shouldn't make this worse.

Rather than calling xe_guc_disable_communication here, maybe call
xe_guc_sanitize instead? It should achieve what you're aiming for
without further complicating the handling of
guc->submission_state.enabled.

Matt

> > 
> > Everything else is LGTM, so
> > Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
> > -Jonathan Cavitt
> > 
> >> +}
> >> +
> >>  int xe_guc_suspend(struct xe_guc *guc)
> >>  {
> >>  	struct xe_gt *gt = guc_to_gt(guc);
> >> diff --git a/drivers/gpu/drm/xe/xe_guc.h b/drivers/gpu/drm/xe/xe_guc.h
> >> index 22cf019a11bf..20823b821f7d 100644
> >> --- a/drivers/gpu/drm/xe/xe_guc.h
> >> +++ b/drivers/gpu/drm/xe/xe_guc.h
> >> @@ -34,6 +34,7 @@ int xe_guc_reset(struct xe_guc *guc);
> >>  int xe_guc_upload(struct xe_guc *guc);
> >>  int xe_guc_min_load_for_hwconfig(struct xe_guc *guc);
> >>  int xe_guc_enable_communication(struct xe_guc *guc);
> >> +void xe_guc_disable_communication(struct xe_guc *guc);
> >>  int xe_guc_opt_in_features_enable(struct xe_guc *guc);
> >>  int xe_guc_suspend(struct xe_guc *guc);
> >>  void xe_guc_notify(struct xe_guc *guc);
> >> diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
> >> index 6431ba3a2c53..1012fe84b379 100644
> >> --- a/drivers/gpu/drm/xe/xe_uc.c
> >> +++ b/drivers/gpu/drm/xe/xe_uc.c
> >> @@ -13,6 +13,7 @@
> >>  #include "xe_gt_printk.h"
> >>  #include "xe_gt_sriov_vf.h"
> >>  #include "xe_guc.h"
> >> +#include "xe_guc_ct.h"
> >>  #include "xe_guc_pc.h"
> >>  #include "xe_guc_engine_activity.h"
> >>  #include "xe_huc.h"
> >> @@ -158,7 +159,7 @@ static int vf_uc_load_hw(struct xe_uc *uc)
> >>  
> >>  	err = xe_gt_sriov_vf_connect(uc_to_gt(uc));
> >>  	if (err)
> >> -		return err;
> >> +		goto err_out;
> >>  
> >>  	uc->guc.submission_state.enabled = true;
> >>  
> >> @@ -168,9 +169,13 @@ static int vf_uc_load_hw(struct xe_uc *uc)
> >>  
> >>  	err = xe_gt_record_default_lrcs(uc_to_gt(uc));
> >>  	if (err)
> >> -		return err;
> >> +		goto err_out;
> >>  
> >>  	return 0;
> >> +
> >> +err_out:
> >> +	xe_guc_disable_communication(&uc->guc);
> >> +	return err;
> >>  }
> >>  
> >>  /*
> >> @@ -202,15 +207,15 @@ int xe_uc_load_hw(struct xe_uc *uc)
> >>  
> >>  	ret = xe_gt_record_default_lrcs(uc_to_gt(uc));
> >>  	if (ret)
> >> -		return ret;
> >> +		goto err_out;
> >>  
> >>  	ret = xe_guc_post_load_init(&uc->guc);
> >>  	if (ret)
> >> -		return ret;
> >> +		goto err_out;
> >>  
> >>  	ret = xe_guc_pc_start(&uc->guc.pc);
> >>  	if (ret)
> >> -		return ret;
> >> +		goto err_out;
> >>  
> >>  	xe_guc_engine_activity_enable_stats(&uc->guc);
> >>  
> >> @@ -222,6 +227,10 @@ int xe_uc_load_hw(struct xe_uc *uc)
> >>  	xe_gsc_load_start(&uc->gsc);
> >>  
> >>  	return 0;
> >> +
> >> +err_out:
> >> +	xe_guc_disable_communication(&uc->guc);
> >> +	return ret;
> >>  }
> >>  
> >>  int xe_uc_reset_prepare(struct xe_uc *uc)
> >> -- 
> >> 2.34.1
> >>
> >>
> 


More information about the Intel-xe mailing list