[PATCH] drm/i915: Set wedged if enable guc communication failed

Ceraolo Spurio, Daniele daniele.ceraolospurio at intel.com
Sat Mar 4 00:10:21 UTC 2023



On 3/2/2023 1:50 PM, Zhanjun Dong wrote:
> Add err code check for enable_communication on resume path. When resume failed, we can no longer use the GPU, marking the GPU as wedged.

This is slightly incorrect. If we fail to enable communication, the 
consequence is that we can't use the GuC. That translates to a failure 
to use the GT only if GuC submission is enabled; in execlists submission 
mode we can keep going, although we might end up losing HuC 
functionality (which is not considered a fatal error). Therefore, the 
code below should be updated to reflect this.

>
> Signed-off-by: Zhanjun Dong <zhanjun.dong at intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_gt_pm.c | 7 ++++++-
>   drivers/gpu/drm/i915/gt/uc/intel_uc.c | 9 +++++++--
>   2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index cef3d6f5c34e..42a7d75ce39e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -401,8 +401,13 @@ int intel_gt_runtime_resume(struct intel_gt *gt)
>   	intel_ggtt_restore_fences(gt->ggtt);
>   
>   	ret = intel_uc_runtime_resume(&gt->uc);
> -	if (ret)
> +	if (ret) {
> +		/* Resume failed, we can no longer use the GPU, marking the GPU
> +		 * as wedged.
> +		 */
> +		intel_gt_set_wedged(gt);

intel_gt_set_wedged calls intel_runtime_pm_get, so it will deadlock if 
you call if from within the runtime_resume flow.

>   		return ret;
> +	}
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> index 6648691bd645..d4f428acf20a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> @@ -698,8 +698,13 @@ static int __uc_resume(struct intel_uc *uc, bool enable_communication)
>   	/* Make sure we enable communication if and only if it's disabled */
>   	GEM_BUG_ON(enable_communication == intel_guc_ct_enabled(&guc->ct));
>   
> -	if (enable_communication)
> -		guc_enable_communication(guc);
> +	if (enable_communication) {
> +		err = guc_enable_communication(guc);
> +		if (err) {
> +			guc_dbg(guc, "Failed to resume, %pe", ERR_PTR(err));

nit: this isn't exactly a failure to resume because the GuC is running. 
It's more a failure to re-establish communication with the GuC.

Daniele

> +			return err;
> +		}
> +	}
>   
>   	/* If we are only resuming GuC communication but not reloading
>   	 * GuC, we need to ensure the ARAT timer interrupt is enabled



More information about the dri-devel mailing list