[PATCH V2 00/10] Reset improvements for GC10+
Alex Deucher
alexdeucher at gmail.com
Fri May 23 13:58:02 UTC 2025
On Fri, May 23, 2025 at 9:27 AM Christian König
<christian.koenig at amd.com> wrote:
>
> On 5/23/25 05:04, Alex Deucher wrote:
> > On Thu, May 22, 2025 at 5:57 PM Alex Deucher <alexander.deucher at amd.com> wrote:
> >>
> >> This set improves per queue reset support for GC10+.
> >> This uses vmid resets for GFX. GFX resets all state
> >> associated with a vmid and then continues where it
> >> left off. Since once the IB uses the vmid, only
> >> the IB is reset and execution continues after the IB.
> >> Tested on GC 10 and 11 chips with a game running and
> >> then running hang tests. The game pauses when the
> >> hang happens, then continues after the queue reset.
> >
> > After further investigation, this appears to work as expected, but
> > only by chance. The ring is reset, but any pipelined content in the
> > ring after the job is lost. We either need to limit the ring to one
> > job or patch in the subsequent packets after resetting.
>
> Yeah, I feared that this wouldn't work.
>
> Any idea why the VMID based reset isn't working?
I think it works similarly to the preemption sequence. E.g., see
gfx_v9_0_ring_preempt_ib(), but with a reset rather than a preemption,
but I don't think this will be easily portable to gfx11 and newer as
they no longer have direct access to the HWS.
>
> On the other hand we could just restart from the ring RPTR again.
I think that's probably the best option. I was thinking we could
mirror the ring frames for each gang and after a reset, we submit the
unprocessed frames again. That way we can still do a ring test to
make sure the ring is functional after the reset and then submit the
unprocessed work.
Alex
>
> Regards,
> Christian.
>
> >
> > Alex
> >
> >>
> >> I tried this same approach and GC8 and 9, but it
> >> was not as reliable as soft recovery. I also compared
> >> this to Christian's reset patches, but I was not
> >> able to make them work as reliably as this series.
> >>
> >> Alex Deucher (9):
> >> Revert "drm/amd/amdgpu: add pipe1 hardware support"
> >> drm/amdgpu: add AMDGPU_QUEUE_RESET_TIMEOUT
> >> drm/amdgpu: set the exec flag on the IB fence
> >> drm/amdgpu/gfx11: adjust ring reset sequences
> >> drm/amdgpu/gfx11: drop soft recovery
> >> drm/amdgpu/gfx12: adjust ring reset sequences
> >> drm/amdgpu/gfx12: drop soft recovery
> >> drm/amdgpu/gfx10: adjust ring reset sequences
> >> drm/amdgpu/gfx10: drop soft recovery
> >>
> >> Christian König (1):
> >> drm/amdgpu: rework queue reset scheduler interaction
> >>
> >> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 3 +-
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 26 ++++++++--------
> >> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 41 ++++++++-----------------
> >> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 35 ++++++---------------
> >> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 35 ++++++---------------
> >> drivers/gpu/drm/amd/amdgpu/nvd.h | 1 +
> >> 7 files changed, 50 insertions(+), 92 deletions(-)
> >>
> >> --
> >> 2.49.0
> >>
>
More information about the amd-gfx
mailing list