[PATCH V2 00/10] Reset improvements for GC10+

Alex Deucher alexdeucher at gmail.com
Fri May 23 13:58:02 UTC 2025


On Fri, May 23, 2025 at 9:27 AM Christian König
<christian.koenig at amd.com> wrote:
>
> On 5/23/25 05:04, Alex Deucher wrote:
> > On Thu, May 22, 2025 at 5:57 PM Alex Deucher <alexander.deucher at amd.com> wrote:
> >>
> >> This set improves per queue reset support for GC10+.
> >> This uses vmid resets for GFX.  GFX resets all state
> >> associated with a vmid and then continues where it
> >> left off.  Since once the IB uses the vmid, only
> >> the IB is reset and execution continues after the IB.
> >> Tested on GC 10 and 11 chips with a game running and
> >> then running hang tests.  The game pauses when the
> >> hang happens, then continues after the queue reset.
> >
> > After further investigation, this appears to work as expected, but
> > only by chance.  The ring is reset, but any pipelined content in the
> > ring after the job is lost.  We either need to limit the ring to one
> > job or patch in the subsequent packets after resetting.
>
> Yeah, I feared that this wouldn't work.
>
> Any idea why the VMID based reset isn't working?

I think it works similarly to the preemption sequence.  E.g., see
gfx_v9_0_ring_preempt_ib(), but with a reset rather than a preemption,
but I don't think this will be easily portable to gfx11 and newer as
they no longer have direct access to the HWS.

>
> On the other hand we could just restart from the ring RPTR again.

I think that's probably the best option.  I was thinking we could
mirror the ring frames for each gang and after a reset, we submit the
unprocessed frames again.  That way we can still do a ring test to
make sure the ring is functional after the reset and then submit the
unprocessed work.

Alex

>
> Regards,
> Christian.
>
> >
> > Alex
> >
> >>
> >> I tried this same approach and GC8 and 9, but it
> >> was not as reliable as soft recovery.  I also compared
> >> this to Christian's reset patches, but I was not
> >> able to make them work as reliably as this series.
> >>
> >> Alex Deucher (9):
> >>   Revert "drm/amd/amdgpu: add pipe1 hardware support"
> >>   drm/amdgpu: add AMDGPU_QUEUE_RESET_TIMEOUT
> >>   drm/amdgpu: set the exec flag on the IB fence
> >>   drm/amdgpu/gfx11: adjust ring reset sequences
> >>   drm/amdgpu/gfx11: drop soft recovery
> >>   drm/amdgpu/gfx12: adjust ring reset sequences
> >>   drm/amdgpu/gfx12: drop soft recovery
> >>   drm/amdgpu/gfx10: adjust ring reset sequences
> >>   drm/amdgpu/gfx10: drop soft recovery
> >>
> >> Christian König (1):
> >>   drm/amdgpu: rework queue reset scheduler interaction
> >>
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c  |  3 +-
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 26 ++++++++--------
> >>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 41 ++++++++-----------------
> >>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c  | 35 ++++++---------------
> >>  drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c  | 35 ++++++---------------
> >>  drivers/gpu/drm/amd/amdgpu/nvd.h        |  1 +
> >>  7 files changed, 50 insertions(+), 92 deletions(-)
> >>
> >> --
> >> 2.49.0
> >>
>


More information about the amd-gfx mailing list