[PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty

Wed Feb 19 23:26:45 UTC 2025

[Public]

> -----Original Message-----
> From: jesse.zhang at amd.com <jesse.zhang at amd.com>
> Sent: Thursday, February 13, 2025 12:47 AM
> To: amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix
> <Felix.Kuehling at amd.com>; Kim, Jonathan <Jonathan.Kim at amd.com>; Zhu,
> Jiadong <Jiadong.Zhu at amd.com>; Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>;
> Deucher, Alexander <Alexander.Deucher at amd.com>; Zhang, Jesse(Jie)
> <Jesse.Zhang at amd.com>
> Subject: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the
> ring is guilty
>
> From: "Jesse.zhang at amd.com" <Jesse.zhang at amd.com>
>
> This patch updates the `amdgpu_job_timedout` function to check if the ring is
> actually guilty of causing the timeout. If not, it skips error handling and fence
> completion.
>

Thinking about this more, I'm not sure if this is the right approach.  If we detect a timeout on a kernel ring, we still want to do the reset, but we don't want to kill the job if it's not guilty.  This approach makes sense if we have all kernel rings as we'll eventually get the timeout on the other ring and the reset will eventually get triggered.  But if the hang is on a KFD queue, it won't get noticed until we attempt to preempt the user queues for some other reason which may take a while.  How about the following instead.  We move the is_guilty check down into the queue reset area.  Something like this:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 100f044759435..48350d1030612 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -130,8 +130,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
                amdgpu_vm_put_task_info(ti);
        }

-       dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
-
        /* attempt a per ring reset */
        if (amdgpu_gpu_recovery &&
            ring->funcs->reset) {
@@ -146,13 +144,22 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
                        if (amdgpu_ring_sched_ready(ring))
                                drm_sched_stop(&ring->sched, s_job);
                        atomic_inc(&ring->adev->gpu_reset_counter);
-                       amdgpu_fence_driver_force_completion(ring);
+                       if (ring->funcs->is_guilty) {
+                               if (ring->funcs->is_guilty(ring)) {
+                                       dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
+                                       amdgpu_fence_driver_force_completion(ring);
+                               }
+                       } else {
+                               amdgpu_fence_driver_force_completion(ring);
+                               dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
+                       }
                        if (amdgpu_ring_sched_ready(ring))
                                drm_sched_start(&ring->sched, 0);
                        goto exit;
                }
                dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
        }
+       dma_fence_set_error(&s_job->s_fence->finished, -ETIME);

        if (amdgpu_device_should_recover_gpu(ring->adev)) {
                struct amdgpu_reset_context reset_context;



> Suggested-by: Alex Deucher <alexander.deucher at amd.com>
> Signed-off-by: Jesse Zhang <jesse.zhang at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 100f04475943..f94c876db72b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -101,6 +101,16 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
>               /* Effectively the job is aborted as the device is gone */
>               return DRM_GPU_SCHED_STAT_ENODEV;
>       }
> +     /* Check if the ring is actually guilty of causing the timeout.
> +     * If not, skip error handling and fence completion.
> +     */
> +     if (amdgpu_gpu_recovery && ring->funcs->is_guilty) {
> +             if (!ring->funcs->is_guilty(ring)) {
> +                     dev_err(adev->dev, "ring %s timeout, but not guilty\n",
> +                             s_job->sched->name);
> +                     goto exit;
> +             }
> +     }
>
>       /*
>        * Do the coredump immediately after a job timeout to get a very
> --
> 2.25.1