[PATCH] drm/amdgpu: Use -ENODATA for GPU job timeout queue recovery

Wed Jan 15 11:56:24 UTC 2025

Am 15.01.25 um 07:52 schrieb Jesse.zhang at amd.com:
> When a GPU job times out, the driver attempts to recover by restarting
> the scheduler. Previously, the scheduler was restarted with an error
> code of 0, which does not distinguish between a full GPU reset and a
> queue reset. This patch changes the error code to -ENODATA for queue
> resets, while -ECANCELED or -ETIME is used for full GPU resets.
>
> This change improves error handling by:
> 1. Clearly differentiating between queue resets and full GPU resets.
> 2. Providing more specific error codes for better debugging and recovery.
> 3. Aligning with kernel best practices for error reporting.
>
> The related commit "b2ef808786d93df3658" (drm/sched: add optional errno
> to drm_sched_start())
> introduced support for passing an error code to
> drm_sched_start(), enabling this improvement.

I'm about to remove the scheduler stop/start for queue resets which 
would make this here superfluous.

On the other hand I'm not sure when I will be done with that work. So 
could be that this will take a while and we should commit this in the 
meantime.

Regards,
Christian.

>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak at amd.com>
> Signed-off-by: Jesse Zhang <jesse.zhang at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 100f04475943..b18b316872a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -148,7 +148,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   			atomic_inc(&ring->adev->gpu_reset_counter);
>   			amdgpu_fence_driver_force_completion(ring);
>   			if (amdgpu_ring_sched_ready(ring))
> -				drm_sched_start(&ring->sched, 0);
> +				drm_sched_start(&ring->sched, -ENODATA);
>   			goto exit;
>   		}
>   		dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);