[PATCH v4] drm/xe/guc: Set wedged on Xe micro hardware initialization error

Dong, Zhanjun zhanjun.dong at intel.com
Fri Jun 13 16:30:56 UTC 2025


Please see my comments inline below.

Regards,
Zhanjun Dong

On 2025-06-13 10:32 a.m., Michal Wajdeczko wrote:
> Hi,
> 
> maybe title should start with "drm/xe/uc:" instead?
Sure>
> On 13.06.2025 00:09, Zhanjun Dong wrote:
> 
> the commit message should also say "why" not just "what", see [1]
> 
> [1] https://docs.kernel.org/process/submitting-patches.html#explanation-body
> 
>> Declare wedged on Xe micro controller hardware initialization
>> failed.
> 
> failure ?
> error ?
Will change to error>
>>
>> Signed-off-by: Zhanjun Dong <zhanjun.dong at intel.com>
>> Reviewed-by: Stuart Summers <stuart.summers at intel.com>
>>
> 
> no empty lines between tags!
> 
>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4917
>>
>> ---
>> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
>> Change list:
>> v4: Fix typo and add new line
>> v3: v2 CI re-run
>> v2: Remove unnecessary jump to err-out
>>      Drop disable ct, switch to set wedge
>> ---
>>   drivers/gpu/drm/xe/xe_uc.c | 21 ++++++++++++++-------
>>   1 file changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
>> index 3a8751a8b92d..77175938fc40 100644
>> --- a/drivers/gpu/drm/xe/xe_uc.c
>> +++ b/drivers/gpu/drm/xe/xe_uc.c
>> @@ -157,19 +157,22 @@ static int vf_uc_init_hw(struct xe_uc *uc)
>>   
>>   	err = xe_guc_enable_communication(&uc->guc);
>>   	if (err)
>> -		return err;
>> +		goto err_out;
>>   
>>   	err = xe_gt_sriov_vf_connect(uc_to_gt(uc));
>>   	if (err)
>> -		return err;
>> +		goto err_out;
>>   
>>   	uc->guc.submission_state.enabled = true;
>>   
>>   	err = xe_gt_record_default_lrcs(uc_to_gt(uc));
>>   	if (err)
>> -		return err;
>> +		goto err_out;
>>   
>>   	return 0;
> 
> add separation line here please
> 
>> +err_out:
>> +	xe_device_declare_wedged(uc_to_xe(uc));
>> +	return err;
>>   }
>>   
>>   /*
>> @@ -197,19 +200,19 @@ int xe_uc_init_hw(struct xe_uc *uc)
>>   
>>   	ret = xe_guc_enable_communication(&uc->guc);
>>   	if (ret)
>> -		return ret;
>> +		goto err_out;
>>   
>>   	ret = xe_gt_record_default_lrcs(uc_to_gt(uc));
>>   	if (ret)
>> -		return ret;
>> +		goto err_out;
>>   
>>   	ret = xe_guc_post_load_init(&uc->guc);
>>   	if (ret)
>> -		return ret;
>> +		goto err_out;
>>   
>>   	ret = xe_guc_pc_start(&uc->guc.pc);
>>   	if (ret)
>> -		return ret;
>> +		goto err_out;
>>   
>>   	xe_guc_engine_activity_enable_stats(&uc->guc);
>>   
>> @@ -221,6 +224,10 @@ int xe_uc_init_hw(struct xe_uc *uc)
>>   	xe_gsc_load_start(&uc->gsc);
>>   
>>   	return 0;
>> +
>> +err_out:
>> +	xe_device_declare_wedged(uc_to_xe(uc));
> 
> hmm, but shouldn't this be really decided at the caller layer?
> 
> this is already done in this way by xe_gt.c:gt_reset() and it looks that
>   xe_gt.c:xe_gt_resume() or xe_device.c:xe_device_probe() are not
> handling that in the same way, so maybe fix should be there?
This function is on hardware initialization level, declare wedged on 
hardware initialization errors is reasonable.
I understand reasons do it at caller level, while if the reason is 
strong enough, then do it here might make it looks clean and straight 
forward.

> 
> and from the issue #4917 it looks that indeed the fix should be
> somewhere in xe_device_probe() or more specifically in the xe_gt_tlb
> code which missed to do proper unwind that would include explicit
> xe_gt_tlb_invalidation_reset() like done by the declare_wedged()
> 
> without that we could see that delayed TLB work hit UAF

Yes, this patch is not a fix for the issue.
For this issue, the solution should be split into 2 parts:
1. This patch, declare wedge on _hw_init error
2. Fix on xe_gt_tlb code.

For part 2, I already sent email to kernel-core team offline, I will 
forward that email to you guys, and might assign the issue (or create 
new issue?) to kernel-core team later to track part 2.

> 
> + Matt
> 
>> +	return ret;
>>   }
>>   
>>   int xe_uc_fini_hw(struct xe_uc *uc)
> 
> 



More information about the Intel-xe mailing list