[PATCH 12/19] drm/amdgpu/gfx10: re-emit unprocessed state on kgq reset

Wed May 28 13:57:10 UTC 2025

On Wed, May 28, 2025 at 9:48 AM Christian König
<christian.koenig at amd.com> wrote:
>
> On 5/28/25 15:38, Alex Deucher wrote:
> > On Wed, May 28, 2025 at 7:40 AM Christian König
> > <christian.koenig at amd.com> wrote:
> >>
> >> On 5/28/25 06:19, Alex Deucher wrote:
> >>> Re-emit the unprocessed state after resetting the queue.
> >>>
> >>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> >>> ---
> >>>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 19 ++++++++++++-------
> >>>  1 file changed, 12 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> >>> index 3193eb88b6889..f6e04cf21abcc 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> >>> @@ -9537,6 +9537,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>>       struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> >>>       struct amdgpu_ring *kiq_ring = &kiq->ring;
> >>>       unsigned long flags;
> >>> +     unsigned int i;
> >>>       u32 tmp;
> >>>       u64 addr;
> >>>       int r;
> >>> @@ -9571,10 +9572,8 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>>                                    SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0, 0xffffffff);
> >>>       kiq->pmf->kiq_map_queues(kiq_ring, ring);
> >>>       amdgpu_ring_commit(kiq_ring);
> >>> -
> >>> -     spin_unlock_irqrestore(&kiq->ring_lock, flags);
> >>> -
> >>>       r = amdgpu_ring_test_ring(kiq_ring);
> >>
> >> I don't think we should do a ring test on the KIQ here That basically doesn't tells as much and might cause additional problems.
> >
> > We need some way to wait for the KIQ submission to complete.  This is
> > a simple way to accomplish that.
>
> Ok, that makes sense.
>
> >
> >>
> >>> +     spin_unlock_irqrestore(&kiq->ring_lock, flags);
> >>>       if (r)
> >>>               return r;
> >>>
> >>> @@ -9584,7 +9583,15 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>>               return r;
> >>>       }
> >>>
> >>> -     return amdgpu_ring_test_ring(ring);
> >>> +     if (amdgpu_ring_alloc(ring, 8 + ring->ring_backup_entries_to_copy))
> >>> +             return -ENOMEM;
> >>> +     amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
> >>> +                            ring->ring_backup_seq, 0);
> >>> +     for (i = 0; i < ring->ring_backup_entries_to_copy; i++)
> >>> +             amdgpu_ring_write(ring, ring->ring_backup[i]);
> >>> +     amdgpu_ring_commit(ring);
> >>
> >> I'm not sure if the commands are always relocatable. We should probably just instruct the ring to re-start with the original RPTR/WPTR.
> >>
> >> That would also avoid the need to save/restore the ring content with the CPU.
> >
> > I tried that originally, but I couldn't make it work for a few reasons:
> > 1. We need to have enforce isolation enabled otherwise we almost
> > always reset the wrong VMID so then when we execute the rest of the
> > pipeline, we hang again.
> > 2. When enforce isolation is enabled, we need to signal the fence
> > associated with the guilty job first otherwise we get stuck waiting on
> > the pipeline sync when we execute the rest of the pipeline
>
> So we execute the guilty job submission again? What prevents that one from running?

Without enforce isolation we usually end up reseting the wrong job. I
re-emit everything from the ring after the bad job.  What usually
happens is that a game is running and then you start an app which
hangs the GPU (e.g., an infinite loop in a shader).  Most of the time
we reset the game rather than the bad app.  The scheduler times out on
the app first and then we reset the queue and re-emit the following
jobs, one of which is the bad app.  Queue reset fails and we fall back
to a full adapter reset.  With enforce isolation, we reset the bad app
instead.

Alex

>
> Christian.
>
> >
> > Alex
> >
> >>
> >> Regards,
> >> Christian.
> >>
> >>> +
> >>> +     return gfx_v10_0_ring_test_ib(ring, AMDGPU_QUEUE_RESET_TIMEOUT);
> >>>  }
> >>>
> >>>  static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
> >>> @@ -9612,9 +9619,8 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
> >>>       kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES,
> >>>                                  0, 0);
> >>>       amdgpu_ring_commit(kiq_ring);
> >>> -     spin_unlock_irqrestore(&kiq->ring_lock, flags);
> >>> -
> >>>       r = amdgpu_ring_test_ring(kiq_ring);
> >>> +     spin_unlock_irqrestore(&kiq->ring_lock, flags);
> >>>       if (r)
> >>>               return r;
> >>>
> >>> @@ -9891,7 +9897,6 @@ static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_gfx = {
> >>>       .emit_wreg = gfx_v10_0_ring_emit_wreg,
> >>>       .emit_reg_wait = gfx_v10_0_ring_emit_reg_wait,
> >>>       .emit_reg_write_reg_wait = gfx_v10_0_ring_emit_reg_write_reg_wait,
> >>> -     .soft_recovery = gfx_v10_0_ring_soft_recovery,
> >>>       .emit_mem_sync = gfx_v10_0_emit_mem_sync,
> >>>       .reset = gfx_v10_0_reset_kgq,
> >>>       .emit_cleaner_shader = gfx_v10_0_ring_emit_cleaner_shader,
> >>
>