[PATCH] drm/amdgpu: Use -ENODATA for GPU job timeout queue recovery
Christian König
christian.koenig at amd.com
Wed Jan 15 11:56:24 UTC 2025
Am 15.01.25 um 07:52 schrieb Jesse.zhang at amd.com:
> When a GPU job times out, the driver attempts to recover by restarting
> the scheduler. Previously, the scheduler was restarted with an error
> code of 0, which does not distinguish between a full GPU reset and a
> queue reset. This patch changes the error code to -ENODATA for queue
> resets, while -ECANCELED or -ETIME is used for full GPU resets.
>
> This change improves error handling by:
> 1. Clearly differentiating between queue resets and full GPU resets.
> 2. Providing more specific error codes for better debugging and recovery.
> 3. Aligning with kernel best practices for error reporting.
>
> The related commit "b2ef808786d93df3658" (drm/sched: add optional errno
> to drm_sched_start())
> introduced support for passing an error code to
> drm_sched_start(), enabling this improvement.
I'm about to remove the scheduler stop/start for queue resets which
would make this here superfluous.
On the other hand I'm not sure when I will be done with that work. So
could be that this will take a while and we should commit this in the
meantime.
Regards,
Christian.
>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak at amd.com>
> Signed-off-by: Jesse Zhang <jesse.zhang at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 100f04475943..b18b316872a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -148,7 +148,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> atomic_inc(&ring->adev->gpu_reset_counter);
> amdgpu_fence_driver_force_completion(ring);
> if (amdgpu_ring_sched_ready(ring))
> - drm_sched_start(&ring->sched, 0);
> + drm_sched_start(&ring->sched, -ENODATA);
> goto exit;
> }
> dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
More information about the amd-gfx
mailing list