[PATCH V13 00/28] Reset improvements

Rodrigo Siqueira siqueira at igalia.com
Sun Jul 6 15:05:31 UTC 2025


On 07/01, Alex Deucher wrote:
> This set improves per queue reset support for a number of IPs.
> When we reset the queue, the queue is lost so we need
> to re-emit the unprocessed state from subsequent submissions.
> This is handled in gfx/compute queues via switch buffer and
> pipeline sync packets.  However, you can still end up with
> parallel execution across queues.  For correctness in that
> cause, enforce isolation needs to be enabled.  That can
> impact certain use cases however and in most cases, the
> guilty job is correctly identified even without enforce isolation.
> 
> Tested on GC 10 and 11 chips with a game running and
> then running hang tests.  The game pauses when the

Hi Alex,

Which hang test did you run?

Thanks

> hang happens, then continues after the queue reset.
> 
> The same approach is extended to SDMA and VCN.
> They don't need enforce isolation because those engines
> are single threaded so they always operate serially.
> 
> Rework re-emit to signal the seq number of the bad job and
> verify that to verify that the reset worked, then re-emit the
> rest of the non-guilty state.  This way we are not waiting on
> the rest of the state to complete, and if the subsequent state
> also contains a bad job, we'll end up in queue reset again rather
> than adapter reset.
> 
> Patches apply to the amd-staging-drm-next or drm-next branches in my
> git tree.
> 
> Git tree:
> https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads
> 
> The IGT deadlock tests need the following fixes to properly handle -ETIME fences:
> https://patchwork.freedesktop.org/series/150724/
> 
> v4: Drop explicit padding patches
>     Drop new timeout macro
>     Rework re-emit sequence
> v5: Add a helper for reemit
>     Convert VCN, JPEG, SDMA to use new helpers
> v6: Update SDMA 4.4.2 to use new helpers
>     Move ptr tracking to amdgpu_fence
>     Skip all jobs from the bad context on the ring
> v7: Rework the backup logic
>     Move and clean up the guilty logic for engine resets
>     Integrate suggestions from Christian
>     Add JPEG 4.0.5 support
> v8: Add non-guilty ring backup handling
>     Clean up new function signatures
>     Reorder some bug fixes to the start of the series
> v9: Clean up fence_emit
>     SDMA 5.x fixes
>     Add new reset helpers
>     sched wqueue stop/start cleanup
>     Add support for VCNs without unified queues
> v10: Drop enforce isolation default change
>      Add more documentation
>      Clean up ring backup logic
> v11: SDMA6/7 fixes
> v12: Ring backup and reemit fixes
>      SDMA cleanups
>      SDMA5.x reemit support
>      GFX10 KGQ reset fix
> v13: drop SDMA cleaups, they caused regressions in some IGT tests
> 
> Alex Deucher (28):
>   drm/amdgpu/sdma: consolidate engine reset handling
>   drm/amdgpu/sdma: allow caller to handle kernel rings in engine reset
>   drm/amdgpu: track ring state associated with a fence
>   drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
>   drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
>   drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
>   drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
>   drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
>   drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
>   drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
>   drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
>   drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg4.0.5: add queue reset
>   drm/amdgpu/jpeg5: add queue reset
>   drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
>   drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
>   drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
>   drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
>   drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
>   drm/amdgpu/vcn: add a helper framework for engine resets
>   drm/amdgpu/vcn2: implement ring reset
>   drm/amdgpu/vcn2.5: implement ring reset
>   drm/amdgpu/vcn3: implement ring reset
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     | 90 +++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        | 15 +++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      | 67 ++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      | 18 ++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c      | 43 +++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h      |  3 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c       | 76 ++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h       |  6 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  4 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c        | 41 ++-------
>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c        | 35 +-------
>  drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c        | 35 +-------
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c         | 12 +--
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c       | 12 +--
>  drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c        | 11 +--
>  drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c        | 11 +--
>  drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c        | 11 +--
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c        | 11 +--
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c      | 11 +--
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c      | 11 +++
>  drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c      | 14 +++
>  drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c      | 11 +--
>  drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c      | 19 +---
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c        | 23 +++--
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c        | 23 +++--
>  drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c        | 18 ++--
>  drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c        | 18 ++--
>  drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c         | 12 +++
>  drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c         | 11 +++
>  drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c         | 13 +++
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c         | 11 +--
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c       | 10 +--
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c       | 11 +--
>  drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c       | 11 +--
>  .../drm/amd/amdkfd/kfd_device_queue_manager.c |  2 +-
>  36 files changed, 454 insertions(+), 280 deletions(-)
> 
> -- 
> 2.50.0
> 

-- 
Rodrigo Siqueira


More information about the amd-gfx mailing list