[PATCH V5 00/28] Reset improvements for GC10+

Alex Deucher alexdeucher at gmail.com
Thu May 29 20:54:51 UTC 2025


On Thu, May 29, 2025 at 4:08 PM Alex Deucher <alexander.deucher at amd.com> wrote:
>
> This set improves per queue reset support for GC10+.
> When we reset the queue, the queue is lost so we need
> to re-emit the unprocessed state from subsequent submissions.
> To that end, in order to make sure we actually restore
> unprocessed state, we need to enable legacy enforce isolation
> so that we can safely re-emit the unprocessed state.  If
> we don't multiple jobs can run in parallel and we may not
> end up resetting the correct one.  This is similar to how
> windows handles queues.  This also gives us correct guilty
> tracking for GC.
>
> Tested on GC 10 and 11 chips with a game running and
> then running hang tests.  The game pauses when the
> hang happens, then continues after the queue reset.
>
> I tried this same approach and GC8 and 9, but it
> was not as reliable as soft recovery.  As such, I've dropped
> the KGQ reset code for pre-GC10.
>
> The same approach is extended to SDMA and VCN.
> They don't need enforce isolation because those engines
> are single threaded so they always operate serially.
>
> Rework re-emit to signal the seq number of the bad job and
> verify that to verify that the reset worked, then re-emit the
> rest of the non-guilty state.  This way we are not waiting on
> the rest of the state to complete, and if the subsequent state
> also contains a bad job, we'll end up in queue reset again rather
> than adapter reset.

git tree available here:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads

Alex

>
> v4: Drop explicit padding patches
>     Drop new timeout macro
>     Rework re-emit sequence
> v5: Add a helper for reemit
>     Convert VCN, JPEG, SDMA to use new helpers
>
> Alex Deucher (27):
>   drm/amdgpu: enable legacy enforce isolation by default
>   drm/amdgpu/gfx7: drop reset_kgq
>   drm/amdgpu/gfx8: drop reset_kgq
>   drm/amdgpu/gfx9: drop reset_kgq
>   drm/amdgpu: move force completion into ring resets
>   drm/amdgpu: track ring state associated with a job
>   drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
>   drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
>   drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
>   drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
>   drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
>   drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
>   drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
>   drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
>   drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg5.0.0: add queue reset
>   drm/amdgpu/jpeg5: re-emit unprocessed state on ring reset
>   drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
>   drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
>   drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
>   drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
>   drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
>
> Christian König (1):
>   drm/amdgpu: rework queue reset scheduler interaction
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 ++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c     |  6 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 32 +++++-----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h    |  2 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 46 ++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  8 +++
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 31 ++--------
>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 21 +------
>  drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     | 21 +------
>  drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c      | 71 ----------------------
>  drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c      | 71 ----------------------
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 51 +---------------
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    |  6 +-
>  drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c     |  3 +-
>  drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c     |  3 +-
>  drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c     |  3 +-
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c     |  3 +-
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c   |  3 +-
>  drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c   | 12 ++++
>  drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c   |  3 +-
>  drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c   |  4 ++
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c     |  7 ++-
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c     |  7 ++-
>  drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c     |  6 +-
>  drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c     |  6 +-
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c      |  2 +-
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c    |  3 +-
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c    |  2 +-
>  drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c    |  2 +-
>  30 files changed, 162 insertions(+), 289 deletions(-)
>
> --
> 2.49.0
>


More information about the amd-gfx mailing list