[PATCH V5 00/28] Reset improvements for GC10+
Alex Deucher
alexdeucher at gmail.com
Sat May 31 17:18:12 UTC 2025
On Thu, May 29, 2025 at 4:54 PM Alex Deucher <alexdeucher at gmail.com> wrote:
>
> On Thu, May 29, 2025 at 4:08 PM Alex Deucher <alexander.deucher at amd.com> wrote:
> >
> > This set improves per queue reset support for GC10+.
> > When we reset the queue, the queue is lost so we need
> > to re-emit the unprocessed state from subsequent submissions.
> > To that end, in order to make sure we actually restore
> > unprocessed state, we need to enable legacy enforce isolation
> > so that we can safely re-emit the unprocessed state. If
> > we don't multiple jobs can run in parallel and we may not
> > end up resetting the correct one. This is similar to how
> > windows handles queues. This also gives us correct guilty
> > tracking for GC.
> >
> > Tested on GC 10 and 11 chips with a game running and
> > then running hang tests. The game pauses when the
> > hang happens, then continues after the queue reset.
> >
> > I tried this same approach and GC8 and 9, but it
> > was not as reliable as soft recovery. As such, I've dropped
> > the KGQ reset code for pre-GC10.
> >
> > The same approach is extended to SDMA and VCN.
> > They don't need enforce isolation because those engines
> > are single threaded so they always operate serially.
> >
> > Rework re-emit to signal the seq number of the bad job and
> > verify that to verify that the reset worked, then re-emit the
> > rest of the non-guilty state. This way we are not waiting on
> > the rest of the state to complete, and if the subsequent state
> > also contains a bad job, we'll end up in queue reset again rather
> > than adapter reset.
>
> git tree available here:
> https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads
I've pushed several fixes since I last sent this and will continue to
push updates.
Alex
>
> Alex
>
> >
> > v4: Drop explicit padding patches
> > Drop new timeout macro
> > Rework re-emit sequence
> > v5: Add a helper for reemit
> > Convert VCN, JPEG, SDMA to use new helpers
> >
> > Alex Deucher (27):
> > drm/amdgpu: enable legacy enforce isolation by default
> > drm/amdgpu/gfx7: drop reset_kgq
> > drm/amdgpu/gfx8: drop reset_kgq
> > drm/amdgpu/gfx9: drop reset_kgq
> > drm/amdgpu: move force completion into ring resets
> > drm/amdgpu: track ring state associated with a job
> > drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
> > drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
> > drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
> > drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
> > drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
> > drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
> > drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
> > drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
> > drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
> > drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
> > drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
> > drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
> > drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
> > drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
> > drm/amdgpu/jpeg5.0.0: add queue reset
> > drm/amdgpu/jpeg5: re-emit unprocessed state on ring reset
> > drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
> > drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
> > drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
> > drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
> > drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
> >
> > Christian König (1):
> > drm/amdgpu: rework queue reset scheduler interaction
> >
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 12 ++++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 6 ++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 32 +++++-----
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 +
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 46 ++++++++++++++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 8 +++
> > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 31 ++--------
> > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 21 +------
> > drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 21 +------
> > drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 71 ----------------------
> > drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 71 ----------------------
> > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 51 +---------------
> > drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 6 +-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 +-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 +-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 +-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 +-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 +-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c | 12 ++++
> > drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 +-
> > drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 4 ++
> > drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 7 ++-
> > drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 7 ++-
> > drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 6 +-
> > drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 6 +-
> > drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 +-
> > drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 2 +-
> > 30 files changed, 162 insertions(+), 289 deletions(-)
> >
> > --
> > 2.49.0
> >
More information about the amd-gfx
mailing list