[PATCH v4] drm/xe/guc: Set wedged on Xe micro hardware initialization error
Dong, Zhanjun
zhanjun.dong at intel.com
Fri Jun 13 16:30:56 UTC 2025
Please see my comments inline below.
Regards,
Zhanjun Dong
On 2025-06-13 10:32 a.m., Michal Wajdeczko wrote:
> Hi,
>
> maybe title should start with "drm/xe/uc:" instead?
Sure>
> On 13.06.2025 00:09, Zhanjun Dong wrote:
>
> the commit message should also say "why" not just "what", see [1]
>
> [1] https://docs.kernel.org/process/submitting-patches.html#explanation-body
>
>> Declare wedged on Xe micro controller hardware initialization
>> failed.
>
> failure ?
> error ?
Will change to error>
>>
>> Signed-off-by: Zhanjun Dong <zhanjun.dong at intel.com>
>> Reviewed-by: Stuart Summers <stuart.summers at intel.com>
>>
>
> no empty lines between tags!
>
>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4917
>>
>> ---
>> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
>> Change list:
>> v4: Fix typo and add new line
>> v3: v2 CI re-run
>> v2: Remove unnecessary jump to err-out
>> Drop disable ct, switch to set wedge
>> ---
>> drivers/gpu/drm/xe/xe_uc.c | 21 ++++++++++++++-------
>> 1 file changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
>> index 3a8751a8b92d..77175938fc40 100644
>> --- a/drivers/gpu/drm/xe/xe_uc.c
>> +++ b/drivers/gpu/drm/xe/xe_uc.c
>> @@ -157,19 +157,22 @@ static int vf_uc_init_hw(struct xe_uc *uc)
>>
>> err = xe_guc_enable_communication(&uc->guc);
>> if (err)
>> - return err;
>> + goto err_out;
>>
>> err = xe_gt_sriov_vf_connect(uc_to_gt(uc));
>> if (err)
>> - return err;
>> + goto err_out;
>>
>> uc->guc.submission_state.enabled = true;
>>
>> err = xe_gt_record_default_lrcs(uc_to_gt(uc));
>> if (err)
>> - return err;
>> + goto err_out;
>>
>> return 0;
>
> add separation line here please
>
>> +err_out:
>> + xe_device_declare_wedged(uc_to_xe(uc));
>> + return err;
>> }
>>
>> /*
>> @@ -197,19 +200,19 @@ int xe_uc_init_hw(struct xe_uc *uc)
>>
>> ret = xe_guc_enable_communication(&uc->guc);
>> if (ret)
>> - return ret;
>> + goto err_out;
>>
>> ret = xe_gt_record_default_lrcs(uc_to_gt(uc));
>> if (ret)
>> - return ret;
>> + goto err_out;
>>
>> ret = xe_guc_post_load_init(&uc->guc);
>> if (ret)
>> - return ret;
>> + goto err_out;
>>
>> ret = xe_guc_pc_start(&uc->guc.pc);
>> if (ret)
>> - return ret;
>> + goto err_out;
>>
>> xe_guc_engine_activity_enable_stats(&uc->guc);
>>
>> @@ -221,6 +224,10 @@ int xe_uc_init_hw(struct xe_uc *uc)
>> xe_gsc_load_start(&uc->gsc);
>>
>> return 0;
>> +
>> +err_out:
>> + xe_device_declare_wedged(uc_to_xe(uc));
>
> hmm, but shouldn't this be really decided at the caller layer?
>
> this is already done in this way by xe_gt.c:gt_reset() and it looks that
> xe_gt.c:xe_gt_resume() or xe_device.c:xe_device_probe() are not
> handling that in the same way, so maybe fix should be there?
This function is on hardware initialization level, declare wedged on
hardware initialization errors is reasonable.
I understand reasons do it at caller level, while if the reason is
strong enough, then do it here might make it looks clean and straight
forward.
>
> and from the issue #4917 it looks that indeed the fix should be
> somewhere in xe_device_probe() or more specifically in the xe_gt_tlb
> code which missed to do proper unwind that would include explicit
> xe_gt_tlb_invalidation_reset() like done by the declare_wedged()
>
> without that we could see that delayed TLB work hit UAF
Yes, this patch is not a fix for the issue.
For this issue, the solution should be split into 2 parts:
1. This patch, declare wedge on _hw_init error
2. Fix on xe_gt_tlb code.
For part 2, I already sent email to kernel-core team offline, I will
forward that email to you guys, and might assign the issue (or create
new issue?) to kernel-core team later to track part 2.
>
> + Matt
>
>> + return ret;
>> }
>>
>> int xe_uc_fini_hw(struct xe_uc *uc)
>
>
More information about the Intel-xe
mailing list