[PATCH V5 00/28] Reset improvements for GC10+

Alex Deucher alexdeucher at gmail.com
Sat May 31 17:18:12 UTC 2025


On Thu, May 29, 2025 at 4:54 PM Alex Deucher <alexdeucher at gmail.com> wrote:
>
> On Thu, May 29, 2025 at 4:08 PM Alex Deucher <alexander.deucher at amd.com> wrote:
> >
> > This set improves per queue reset support for GC10+.
> > When we reset the queue, the queue is lost so we need
> > to re-emit the unprocessed state from subsequent submissions.
> > To that end, in order to make sure we actually restore
> > unprocessed state, we need to enable legacy enforce isolation
> > so that we can safely re-emit the unprocessed state.  If
> > we don't multiple jobs can run in parallel and we may not
> > end up resetting the correct one.  This is similar to how
> > windows handles queues.  This also gives us correct guilty
> > tracking for GC.
> >
> > Tested on GC 10 and 11 chips with a game running and
> > then running hang tests.  The game pauses when the
> > hang happens, then continues after the queue reset.
> >
> > I tried this same approach and GC8 and 9, but it
> > was not as reliable as soft recovery.  As such, I've dropped
> > the KGQ reset code for pre-GC10.
> >
> > The same approach is extended to SDMA and VCN.
> > They don't need enforce isolation because those engines
> > are single threaded so they always operate serially.
> >
> > Rework re-emit to signal the seq number of the bad job and
> > verify that to verify that the reset worked, then re-emit the
> > rest of the non-guilty state.  This way we are not waiting on
> > the rest of the state to complete, and if the subsequent state
> > also contains a bad job, we'll end up in queue reset again rather
> > than adapter reset.
>
> git tree available here:
> https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads

I've pushed several fixes since I last sent this and will continue to
push updates.

Alex

>
> Alex
>
> >
> > v4: Drop explicit padding patches
> >     Drop new timeout macro
> >     Rework re-emit sequence
> > v5: Add a helper for reemit
> >     Convert VCN, JPEG, SDMA to use new helpers
> >
> > Alex Deucher (27):
> >   drm/amdgpu: enable legacy enforce isolation by default
> >   drm/amdgpu/gfx7: drop reset_kgq
> >   drm/amdgpu/gfx8: drop reset_kgq
> >   drm/amdgpu/gfx9: drop reset_kgq
> >   drm/amdgpu: move force completion into ring resets
> >   drm/amdgpu: track ring state associated with a job
> >   drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
> >   drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
> >   drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
> >   drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
> >   drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
> >   drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
> >   drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
> >   drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
> >   drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
> >   drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
> >   drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
> >   drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
> >   drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
> >   drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
> >   drm/amdgpu/jpeg5.0.0: add queue reset
> >   drm/amdgpu/jpeg5: re-emit unprocessed state on ring reset
> >   drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
> >   drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
> >   drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
> >   drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
> >   drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
> >
> > Christian König (1):
> >   drm/amdgpu: rework queue reset scheduler interaction
> >
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 ++++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c     |  6 ++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 32 +++++-----
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h    |  2 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 46 ++++++++++++++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  8 +++
> >  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 31 ++--------
> >  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 21 +------
> >  drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     | 21 +------
> >  drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c      | 71 ----------------------
> >  drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c      | 71 ----------------------
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 51 +---------------
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    |  6 +-
> >  drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c     |  3 +-
> >  drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c     |  3 +-
> >  drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c     |  3 +-
> >  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c     |  3 +-
> >  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c   |  3 +-
> >  drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c   | 12 ++++
> >  drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c   |  3 +-
> >  drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c   |  4 ++
> >  drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c     |  7 ++-
> >  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c     |  7 ++-
> >  drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c     |  6 +-
> >  drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c     |  6 +-
> >  drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c      |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c    |  3 +-
> >  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c    |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c    |  2 +-
> >  30 files changed, 162 insertions(+), 289 deletions(-)
> >
> > --
> > 2.49.0
> >


More information about the amd-gfx mailing list