[PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction

Tue May 20 16:38:56 UTC 2025

On Tue, May 20, 2025 at 9:49 AM Christian König
<christian.koenig at amd.com> wrote:
>
> On 5/20/25 15:09, Alex Deucher wrote:
> > On Mon, May 19, 2025 at 2:30 PM Alex Deucher <alexander.deucher at amd.com> wrote:
> >>
> >> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> >>
> >> Stopping the scheduler for queue reset is generally a good idea because
> >> it prevents any worker from touching the ring buffer.
> >>
> >> But using amdgpu_fence_driver_force_completion() before restarting it was
> >> a really bad idea because it marked fences as failed while the work was
> >> potentially still running.
> >>
> >> Stop doing that and cleanup the comment a bit.
> >>
> >> Signed-off-by: Christian König <christian.koenig at amd.com>
> >> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> >> ---
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 ++++++++++++-------------
> >>  1 file changed, 13 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >> index acb21fc8b3ce5..a0fab947143b5 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >> @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>         } else if (amdgpu_gpu_recovery && ring->funcs->reset) {
> >>                 bool is_guilty;
> >>
> >> -               dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
> >> -               /* stop the scheduler, but don't mess with the
> >> -                * bad job yet because if ring reset fails
> >> -                * we'll fall back to full GPU reset.
> >> +               dev_err(adev->dev, "Starting %s ring reset\n",
> >> +                       s_job->sched->name);
> >> +
> >> +               /*
> >> +                * Stop the scheduler to prevent anybody else from touching the
> >> +                * ring buffer.
> >>                  */
> >>                 drm_sched_wqueue_stop(&ring->sched);
> >>
> >> @@ -157,19 +159,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>
> >>                 r = amdgpu_ring_reset(ring, job->vmid);
> >>                 if (!r) {
> >> -                       if (amdgpu_ring_sched_ready(ring))
> >> -                               drm_sched_stop(&ring->sched, s_job);
> >> -                       if (is_guilty) {
> >> +                       if (is_guilty)
> >>                                 atomic_inc(&ring->adev->gpu_reset_counter);
> >> -                               amdgpu_fence_driver_force_completion(ring);
> >
> > Do we still need this in the case of rings where we reset the entire
> > queue?  E.g., compute or VCN?  In which case we should move this to
> > the ring->reset callback.
>
> No, it shouldn't be necessary in the first place as long as the rings still execute their fence packages.
>
> And that should be the case at least for both graphics and compute.
>
> Forcing completion only makes sense when the whole ASIC was resetted and nothing executed any more.

This seems to result in a deadlock if you reset the entire queue
rather than just the vmid.   I.e., if you test with just this patch
and not any of the following patches.  In that case, the queue is
reset so none of the fences are signaled.

Alex

>
> Regards,
> Christian.
>
>
> >
> > Alex
> >
> >> -                       }
> >> -                       if (amdgpu_ring_sched_ready(ring))
> >> -                               drm_sched_start(&ring->sched, 0);
> >> -                       dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
> >> -                       drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> >> +                       drm_sched_wqueue_start(&ring->sched);
> >> +                       dev_err(adev->dev, "Ring %s reset succeeded\n",
> >> +                               ring->sched.name);
> >> +                       drm_dev_wedged_event(adev_to_drm(adev),
> >> +                                            DRM_WEDGE_RECOVERY_NONE);
> >>                         goto exit;
> >>                 }
> >> -               dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
> >> +               dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
> >>         }
> >>         dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> >>
> >> --
> >> 2.49.0
> >>
>