[PATCH] drm/amdgpu: Use -ENODATA for GPU job timeout queue recovery
Zhang, Jesse(Jie)
Jesse.Zhang at amd.com
Thu Jan 16 09:20:15 UTC 2025
[AMD Official Use Only - AMD Internal Distribution Only]
-----Original Message-----
From: Koenig, Christian <Christian.Koenig at amd.com>
Sent: Wednesday, January 15, 2025 7:56 PM
To: Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang, Tim <Tim.Huang at amd.com>; Prosyak, Vitaly <Vitaly.Prosyak at amd.com>
Subject: Re: [PATCH] drm/amdgpu: Use -ENODATA for GPU job timeout queue recovery
Am 15.01.25 um 07:52 schrieb Jesse.zhang at amd.com:
> When a GPU job times out, the driver attempts to recover by restarting
> the scheduler. Previously, the scheduler was restarted with an error
> code of 0, which does not distinguish between a full GPU reset and a
> queue reset. This patch changes the error code to -ENODATA for queue
> resets, while -ECANCELED or -ETIME is used for full GPU resets.
>
> This change improves error handling by:
> 1. Clearly differentiating between queue resets and full GPU resets.
> 2. Providing more specific error codes for better debugging and recovery.
> 3. Aligning with kernel best practices for error reporting.
>
> The related commit "b2ef808786d93df3658" (drm/sched: add optional
> errno to drm_sched_start()) introduced support for passing an error
> code to drm_sched_start(), enabling this improvement.
I'm about to remove the scheduler stop/start for queue resets which would make this here superfluous.
On the other hand I'm not sure when I will be done with that work. So could be that this will take a while and we should commit this in the meantime.
Thanks Christian, I hold this patch till you finish it.
Thanks
Jesse
Regards,
Christian.
>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak at amd.com>
> Signed-off-by: Jesse Zhang <jesse.zhang at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 100f04475943..b18b316872a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -148,7 +148,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> atomic_inc(&ring->adev->gpu_reset_counter);
> amdgpu_fence_driver_force_completion(ring);
> if (amdgpu_ring_sched_ready(ring))
> - drm_sched_start(&ring->sched, 0);
> + drm_sched_start(&ring->sched, -ENODATA);
> goto exit;
> }
> dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
More information about the amd-gfx
mailing list