[PATCH V13 00/28] Reset improvements

Alex Deucher alexander.deucher at amd.com
Tue Jul 1 18:44:23 UTC 2025


This set improves per queue reset support for a number of IPs.
When we reset the queue, the queue is lost so we need
to re-emit the unprocessed state from subsequent submissions.
This is handled in gfx/compute queues via switch buffer and
pipeline sync packets.  However, you can still end up with
parallel execution across queues.  For correctness in that
cause, enforce isolation needs to be enabled.  That can
impact certain use cases however and in most cases, the
guilty job is correctly identified even without enforce isolation.

Tested on GC 10 and 11 chips with a game running and
then running hang tests.  The game pauses when the
hang happens, then continues after the queue reset.

The same approach is extended to SDMA and VCN.
They don't need enforce isolation because those engines
are single threaded so they always operate serially.

Rework re-emit to signal the seq number of the bad job and
verify that to verify that the reset worked, then re-emit the
rest of the non-guilty state.  This way we are not waiting on
the rest of the state to complete, and if the subsequent state
also contains a bad job, we'll end up in queue reset again rather
than adapter reset.

Patches apply to the amd-staging-drm-next or drm-next branches in my
git tree.

Git tree:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads

The IGT deadlock tests need the following fixes to properly handle -ETIME fences:
https://patchwork.freedesktop.org/series/150724/

v4: Drop explicit padding patches
    Drop new timeout macro
    Rework re-emit sequence
v5: Add a helper for reemit
    Convert VCN, JPEG, SDMA to use new helpers
v6: Update SDMA 4.4.2 to use new helpers
    Move ptr tracking to amdgpu_fence
    Skip all jobs from the bad context on the ring
v7: Rework the backup logic
    Move and clean up the guilty logic for engine resets
    Integrate suggestions from Christian
    Add JPEG 4.0.5 support
v8: Add non-guilty ring backup handling
    Clean up new function signatures
    Reorder some bug fixes to the start of the series
v9: Clean up fence_emit
    SDMA 5.x fixes
    Add new reset helpers
    sched wqueue stop/start cleanup
    Add support for VCNs without unified queues
v10: Drop enforce isolation default change
     Add more documentation
     Clean up ring backup logic
v11: SDMA6/7 fixes
v12: Ring backup and reemit fixes
     SDMA cleanups
     SDMA5.x reemit support
     GFX10 KGQ reset fix
v13: drop SDMA cleaups, they caused regressions in some IGT tests

Alex Deucher (28):
  drm/amdgpu/sdma: consolidate engine reset handling
  drm/amdgpu/sdma: allow caller to handle kernel rings in engine reset
  drm/amdgpu: track ring state associated with a fence
  drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
  drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
  drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4.0.5: add queue reset
  drm/amdgpu/jpeg5: add queue reset
  drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn: add a helper framework for engine resets
  drm/amdgpu/vcn2: implement ring reset
  drm/amdgpu/vcn2.5: implement ring reset
  drm/amdgpu/vcn3: implement ring reset

 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     | 90 +++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        | 15 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      | 67 ++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      | 18 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c      | 43 +++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h      |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c       | 76 ++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h       |  6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  4 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c        | 41 ++-------
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c        | 35 +-------
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c        | 35 +-------
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c         | 12 +--
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c       | 12 +--
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c        | 11 +--
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c        | 11 +--
 drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c        | 11 +--
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c        | 11 +--
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c      | 11 +--
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c      | 11 +++
 drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c      | 14 +++
 drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c      | 11 +--
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c      | 19 +---
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c        | 23 +++--
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c        | 23 +++--
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c        | 18 ++--
 drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c        | 18 ++--
 drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c         | 12 +++
 drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c         | 11 +++
 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c         | 13 +++
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c         | 11 +--
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c       | 10 +--
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c       | 11 +--
 drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c       | 11 +--
 .../drm/amd/amdkfd/kfd_device_queue_manager.c |  2 +-
 36 files changed, 454 insertions(+), 280 deletions(-)

-- 
2.50.0



More information about the amd-gfx mailing list