[PATCH v8] drm/xe/uc: Disable GuC communication on hardware initialization error.
Dong, Zhanjun
zhanjun.dong at intel.com
Mon Jul 7 21:05:25 UTC 2025
On 2025-07-07 2:34 p.m., Cavitt, Jonathan wrote:
> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: Monday, July 7, 2025 11:30 AM
> To: Wajdeczko, Michal <Michal.Wajdeczko at intel.com>
> Cc: Cavitt, Jonathan <jonathan.cavitt at intel.com>; Dong, Zhanjun <zhanjun.dong at intel.com>; intel-xe at lists.freedesktop.org; Summers, Stuart <stuart.summers at intel.com>
> Subject: Re: [PATCH v8] drm/xe/uc: Disable GuC communication on hardware initialization error.
>>
>> On Mon, Jul 07, 2025 at 07:36:29PM +0200, Michal Wajdeczko wrote:
>>>
>>>
>>> On 07.07.2025 15:56, Cavitt, Jonathan wrote:
>>>> -----Original Message-----
>>>> From: Dong, Zhanjun <zhanjun.dong at intel.com>
>>>> Sent: Thursday, July 3, 2025 2:39 PM
>>>> To: intel-xe at lists.freedesktop.org
>>>> Cc: Dong, Zhanjun <zhanjun.dong at intel.com>; Wajdeczko, Michal <Michal.Wajdeczko at intel.com>; Summers, Stuart <stuart.summers at intel.com>; Cavitt, Jonathan <jonathan.cavitt at intel.com>
>>>> Subject: [PATCH v8] drm/xe/uc: Disable GuC communication on hardware initialization error.
>>>>>
>>>>> Disable GuC communication on Xe micro controller hardware initialization
>>>>> error.
>>>>>
>>>>> Signed-off-by: Zhanjun Dong <zhanjun.dong at intel.com>
>>>>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4917
>>>>>
>>>>> ---
>>>>> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>>>> Cc: Stuart Summers <stuart.summers at intel.com>
>>>>> Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
>>>>>
>>>>> Change list:
>>>>> v8: Fix kernel-doc style
>>>>> Add error handling in vf_guc_min_load_for_hwconfig
>>>>> v7: Add kernel-doc for xe_guc_disable_communication
>>>>> Unset submission_state.enabled as well
>>>>> v6: Skip disable ct on xe_guc_enable_communication error
>>>>> v5: Set wedge is excessive action, revert back to disable ct
>>>>> v4: Fix typo and add new line
>>>>> v3: v2 CI re-run
>>>>> v2: Remove unnecessary jump to err-out
>>>>> Drop disable ct, switch to set wedge
>>>>> ---
>>>>> drivers/gpu/drm/xe/xe_guc.c | 20 ++++++++++++++++++--
>>>>> drivers/gpu/drm/xe/xe_guc.h | 1 +
>>>>> drivers/gpu/drm/xe/xe_uc.c | 19 ++++++++++++++-----
>>>>> 3 files changed, 33 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>>>>> index 8573957facae..6643a2cb898b 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_guc.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>>>>> @@ -1219,13 +1219,17 @@ static int vf_guc_min_load_for_hwconfig(struct xe_guc *guc)
>>>>>
>>>>> ret = xe_gt_sriov_vf_connect(gt);
>>>>> if (ret)
>>>>> - return ret;
>>>>> + goto err_out;
>>>>>
>>>>> ret = xe_gt_sriov_vf_query_runtime(gt);
>>>>> if (ret)
>>>>> - return ret;
>>>>> + goto err_out;
>>>>>
>>>>> return 0;
>>>>> +
>>>>> +err_out:
>>>>> + xe_guc_disable_communication(guc);
>>>>> + return ret;
>>>>> }
>>>>>
>>>>> /**
>>>>> @@ -1337,6 +1341,18 @@ int xe_guc_enable_communication(struct xe_guc *guc)
>>>>> return 0;
>>>>> }
>>>>>
>>>>> +/**
>>>>> + * xe_guc_disable_communication() - Disable GuC communication
>>>>> + * @guc: The GuC object
>>>>> + *
>>>>> + * This function will disable the GuC communication.
>>>>> + */
>>>>> +void xe_guc_disable_communication(struct xe_guc *guc)
>>>>> +{
>>>>> + guc->submission_state.enabled = false;
>>>>> + xe_guc_ct_disable(&guc->ct);
>>>>
>>>> Didn't this used to just be a wrapper for xe_guc_ct_disable?
>>>>
>>>> ... Well, I suppose needing to set the submission state enabled to false
>>>> would also be a rather important step, so it's probably good that was added.
>>>
>>> hmm, while it might be a required step for shutdown, putting it here
>>> looks like a random choice, it's unbalanced what "enable_communication"
>>> did and also looks like a violation of the layering, as guc_submit code
>>> has its own files and we shouldn't touch internals from the outside
>>>
>>> + @Matt
>>>
>>
>> In general, the use of guc->submission_state.enabled in the existing
>> driver doesn't appear correct. It's accessed in multiple files without
>> any clear locking or layering rules. This falls into the category of "I
>> wrote this early in Xe without much thought or due to some ignorance,
>> and we never cleaned it up"—and there's quite a bit of that floating
>> around in Xe. So I agree with Michal—we shouldn't make this worse.
>>
>> Rather than calling xe_guc_disable_communication here, maybe call
>> xe_guc_sanitize instead? It should achieve what you're aiming for
>> without further complicating the handling of
>> guc->submission_state.enabled.
>
> It seems like the only difference between xe_guc_sanitize and
> xe_guc_disable_communication is that the former calls xe_uc_fw_sanitize, and
> the order of operations is swapped around a bit. So, yes, I think in hindsight
> we should be using xe_guc_sanitize instead.
>
> Doing so would also eliminate the need for the xe_guc_disable_communication
> helper function altogether, so once all uses of xe_guc_disable_communication
> have been replaced with xe_guc_sanitize, we can safely remove that helper
> function from this patch.
That looks good, I will post another revison to do so.
Regards,
Zhanjun Dong>
> -Jonathan Cavitt
>
>>
>> Matt
>>
>>>>
>>>> Everything else is LGTM, so
>>>> Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
>>>> -Jonathan Cavitt
>>>>
>>>>> +}
>>>>> +
>>>>> int xe_guc_suspend(struct xe_guc *guc)
>>>>> {
>>>>> struct xe_gt *gt = guc_to_gt(guc);
>>>>> diff --git a/drivers/gpu/drm/xe/xe_guc.h b/drivers/gpu/drm/xe/xe_guc.h
>>>>> index 22cf019a11bf..20823b821f7d 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_guc.h
>>>>> +++ b/drivers/gpu/drm/xe/xe_guc.h
>>>>> @@ -34,6 +34,7 @@ int xe_guc_reset(struct xe_guc *guc);
>>>>> int xe_guc_upload(struct xe_guc *guc);
>>>>> int xe_guc_min_load_for_hwconfig(struct xe_guc *guc);
>>>>> int xe_guc_enable_communication(struct xe_guc *guc);
>>>>> +void xe_guc_disable_communication(struct xe_guc *guc);
>>>>> int xe_guc_opt_in_features_enable(struct xe_guc *guc);
>>>>> int xe_guc_suspend(struct xe_guc *guc);
>>>>> void xe_guc_notify(struct xe_guc *guc);
>>>>> diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
>>>>> index 6431ba3a2c53..1012fe84b379 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_uc.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_uc.c
>>>>> @@ -13,6 +13,7 @@
>>>>> #include "xe_gt_printk.h"
>>>>> #include "xe_gt_sriov_vf.h"
>>>>> #include "xe_guc.h"
>>>>> +#include "xe_guc_ct.h"
>>>>> #include "xe_guc_pc.h"
>>>>> #include "xe_guc_engine_activity.h"
>>>>> #include "xe_huc.h"
>>>>> @@ -158,7 +159,7 @@ static int vf_uc_load_hw(struct xe_uc *uc)
>>>>>
>>>>> err = xe_gt_sriov_vf_connect(uc_to_gt(uc));
>>>>> if (err)
>>>>> - return err;
>>>>> + goto err_out;
>>>>>
>>>>> uc->guc.submission_state.enabled = true;
>>>>>
>>>>> @@ -168,9 +169,13 @@ static int vf_uc_load_hw(struct xe_uc *uc)
>>>>>
>>>>> err = xe_gt_record_default_lrcs(uc_to_gt(uc));
>>>>> if (err)
>>>>> - return err;
>>>>> + goto err_out;
>>>>>
>>>>> return 0;
>>>>> +
>>>>> +err_out:
>>>>> + xe_guc_disable_communication(&uc->guc);
>>>>> + return err;
>>>>> }
>>>>>
>>>>> /*
>>>>> @@ -202,15 +207,15 @@ int xe_uc_load_hw(struct xe_uc *uc)
>>>>>
>>>>> ret = xe_gt_record_default_lrcs(uc_to_gt(uc));
>>>>> if (ret)
>>>>> - return ret;
>>>>> + goto err_out;
>>>>>
>>>>> ret = xe_guc_post_load_init(&uc->guc);
>>>>> if (ret)
>>>>> - return ret;
>>>>> + goto err_out;
>>>>>
>>>>> ret = xe_guc_pc_start(&uc->guc.pc);
>>>>> if (ret)
>>>>> - return ret;
>>>>> + goto err_out;
>>>>>
>>>>> xe_guc_engine_activity_enable_stats(&uc->guc);
>>>>>
>>>>> @@ -222,6 +227,10 @@ int xe_uc_load_hw(struct xe_uc *uc)
>>>>> xe_gsc_load_start(&uc->gsc);
>>>>>
>>>>> return 0;
>>>>> +
>>>>> +err_out:
>>>>> + xe_guc_disable_communication(&uc->guc);
>>>>> + return ret;
>>>>> }
>>>>>
>>>>> int xe_uc_reset_prepare(struct xe_uc *uc)
>>>>> --
>>>>> 2.34.1
>>>>>
>>>>>
>>>
>>
More information about the Intel-xe
mailing list