[PATCH v4] drm/xe/guc: Set wedged on Xe micro hardware initialization error

Michal Wajdeczko michal.wajdeczko at intel.com
Fri Jun 13 14:32:59 UTC 2025


Hi,

maybe title should start with "drm/xe/uc:" instead?

On 13.06.2025 00:09, Zhanjun Dong wrote:

the commit message should also say "why" not just "what", see [1]

[1] https://docs.kernel.org/process/submitting-patches.html#explanation-body

> Declare wedged on Xe micro controller hardware initialization
> failed.

failure ?
error ?

> 
> Signed-off-by: Zhanjun Dong <zhanjun.dong at intel.com>
> Reviewed-by: Stuart Summers <stuart.summers at intel.com>
> 

no empty lines between tags!

> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4917
> 
> ---
> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
> Change list:
> v4: Fix typo and add new line
> v3: v2 CI re-run
> v2: Remove unnecessary jump to err-out
>     Drop disable ct, switch to set wedge
> ---
>  drivers/gpu/drm/xe/xe_uc.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
> index 3a8751a8b92d..77175938fc40 100644
> --- a/drivers/gpu/drm/xe/xe_uc.c
> +++ b/drivers/gpu/drm/xe/xe_uc.c
> @@ -157,19 +157,22 @@ static int vf_uc_init_hw(struct xe_uc *uc)
>  
>  	err = xe_guc_enable_communication(&uc->guc);
>  	if (err)
> -		return err;
> +		goto err_out;
>  
>  	err = xe_gt_sriov_vf_connect(uc_to_gt(uc));
>  	if (err)
> -		return err;
> +		goto err_out;
>  
>  	uc->guc.submission_state.enabled = true;
>  
>  	err = xe_gt_record_default_lrcs(uc_to_gt(uc));
>  	if (err)
> -		return err;
> +		goto err_out;
>  
>  	return 0;

add separation line here please

> +err_out:
> +	xe_device_declare_wedged(uc_to_xe(uc));
> +	return err;
>  }
>  
>  /*
> @@ -197,19 +200,19 @@ int xe_uc_init_hw(struct xe_uc *uc)
>  
>  	ret = xe_guc_enable_communication(&uc->guc);
>  	if (ret)
> -		return ret;
> +		goto err_out;
>  
>  	ret = xe_gt_record_default_lrcs(uc_to_gt(uc));
>  	if (ret)
> -		return ret;
> +		goto err_out;
>  
>  	ret = xe_guc_post_load_init(&uc->guc);
>  	if (ret)
> -		return ret;
> +		goto err_out;
>  
>  	ret = xe_guc_pc_start(&uc->guc.pc);
>  	if (ret)
> -		return ret;
> +		goto err_out;
>  
>  	xe_guc_engine_activity_enable_stats(&uc->guc);
>  
> @@ -221,6 +224,10 @@ int xe_uc_init_hw(struct xe_uc *uc)
>  	xe_gsc_load_start(&uc->gsc);
>  
>  	return 0;
> +
> +err_out:
> +	xe_device_declare_wedged(uc_to_xe(uc));

hmm, but shouldn't this be really decided at the caller layer?

this is already done in this way by xe_gt.c:gt_reset() and it looks that
 xe_gt.c:xe_gt_resume() or xe_device.c:xe_device_probe() are not
handling that in the same way, so maybe fix should be there?

and from the issue #4917 it looks that indeed the fix should be
somewhere in xe_device_probe() or more specifically in the xe_gt_tlb
code which missed to do proper unwind that would include explicit
xe_gt_tlb_invalidation_reset() like done by the declare_wedged()

without that we could see that delayed TLB work hit UAF

+ Matt

> +	return ret;
>  }
>  
>  int xe_uc_fini_hw(struct xe_uc *uc)




More information about the Intel-xe mailing list