[PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty
Zhang, Jesse(Jie)
Jesse.Zhang at amd.com
Fri Feb 21 02:25:41 UTC 2025
[Public]
-----Original Message-----
From: Deucher, Alexander <Alexander.Deucher at amd.com>
Sent: Thursday, February 20, 2025 10:22 PM
To: Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Kuehling, Felix <Felix.Kuehling at amd.com>; Kim, Jonathan <Jonathan.Kim at amd.com>; Zhu, Jiadong <Jiadong.Zhu at amd.com>; Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>; Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>
Subject: RE: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty
[Public]
> -----Original Message-----
> From: Deucher, Alexander
> Sent: Wednesday, February 19, 2025 6:27 PM
> To: jesse.zhang at amd.com; amd-gfx at lists.freedesktop.org
> Cc: Kuehling, Felix <Felix.Kuehling at amd.com>; Kim, Jonathan
> <Jonathan.Kim at amd.com>; Zhu, Jiadong <Jiadong.Zhu at amd.com>; Zhang,
> Jesse(Jie) <Jesse.Zhang at amd.com>; Zhang, Jesse(Jie)
> <Jesse.Zhang at amd.com>
> Subject: RE: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to
> check if the ring is guilty
>
> > -----Original Message-----
> > From: jesse.zhang at amd.com <jesse.zhang at amd.com>
> > Sent: Thursday, February 13, 2025 12:47 AM
> > To: amd-gfx at lists.freedesktop.org
> > Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix
> > <Felix.Kuehling at amd.com>; Kim, Jonathan <Jonathan.Kim at amd.com>; Zhu,
> > Jiadong <Jiadong.Zhu at amd.com>; Zhang, Jesse(Jie)
> > <Jesse.Zhang at amd.com>; Deucher, Alexander
> <Alexander.Deucher at amd.com>;
> > Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>
> > Subject: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to
> > check if the ring is guilty
> >
> > From: "Jesse.zhang at amd.com" <Jesse.zhang at amd.com>
> >
> > This patch updates the `amdgpu_job_timedout` function to check if
> > the ring is actually guilty of causing the timeout. If not, it skips
> > error handling and fence completion.
> >
>
> Thinking about this more, I'm not sure if this is the right approach.
> If we detect a timeout on a kernel ring, we still want to do the
> reset, but we don't want to kill the job if it's not guilty. This
> approach makes sense if we have all kernel rings as we'll eventually
> get the timeout on the other ring and the reset will eventually get
> triggered. But if the hang is on a KFD queue, it won't get noticed
> until we attempt to preempt the user queues for some other reason
> which may take a while. How about the following instead. We move the is_guilty check down into the queue reset area. Something like this:
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 100f044759435..48350d1030612 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -130,8 +130,6 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> amdgpu_vm_put_task_info(ti);
> }
>
> - dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> -
> /* attempt a per ring reset */
> if (amdgpu_gpu_recovery &&
> ring->funcs->reset) {
> @@ -146,13 +144,22 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> if (amdgpu_ring_sched_ready(ring))
> drm_sched_stop(&ring->sched, s_job);
> atomic_inc(&ring->adev->gpu_reset_counter);
> - amdgpu_fence_driver_force_completion(ring);
> + if (ring->funcs->is_guilty) {
> + if (ring->funcs->is_guilty(ring)) {
> +
> + dma_fence_set_error(&s_job->s_fence-
> >finished, -ETIME);
> + amdgpu_fence_driver_force_completion(ring);
> + }
> + } else {
> + amdgpu_fence_driver_force_completion(ring);
> +
> + dma_fence_set_error(&s_job->s_fence->finished, -
> ETIME);
> + }
> if (amdgpu_ring_sched_ready(ring))
> drm_sched_start(&ring->sched, 0);
> goto exit;
> }
> dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
> }
> + dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>
> if (amdgpu_device_should_recover_gpu(ring->adev)) {
> struct amdgpu_reset_context reset_context;
>
Actually something like the attached patch. Need to call is_guilty before reset.
Thanks Alex, will update before pushing to drm-next
Jesse
Alex
>
>
> > Suggested-by: Alex Deucher <alexander.deucher at amd.com>
> > Signed-off-by: Jesse Zhang <jesse.zhang at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 100f04475943..f94c876db72b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -101,6 +101,16 @@ static enum drm_gpu_sched_stat
> > amdgpu_job_timedout(struct drm_sched_job *s_job)
> > /* Effectively the job is aborted as the device is gone */
> > return DRM_GPU_SCHED_STAT_ENODEV;
> > }
> > + /* Check if the ring is actually guilty of causing the timeout.
> > + * If not, skip error handling and fence completion.
> > + */
> > + if (amdgpu_gpu_recovery && ring->funcs->is_guilty) {
> > + if (!ring->funcs->is_guilty(ring)) {
> > + dev_err(adev->dev, "ring %s timeout, but not guilty\n",
> > + s_job->sched->name);
> > + goto exit;
> > + }
> > + }
> >
> > /*
> > * Do the coredump immediately after a job timeout to get a very
> > --
> > 2.25.1
More information about the amd-gfx
mailing list