[PATCH 12/19] drm/amdgpu/gfx10: re-emit unprocessed state on kgq reset

Wed May 28 14:26:11 UTC 2025

On Wed, May 28, 2025 at 9:57 AM Alex Deucher <alexdeucher at gmail.com> wrote:
>
> On Wed, May 28, 2025 at 9:48 AM Christian König
> <christian.koenig at amd.com> wrote:
> >
> > On 5/28/25 15:38, Alex Deucher wrote:
> > > On Wed, May 28, 2025 at 7:40 AM Christian König
> > > <christian.koenig at amd.com> wrote:
> > >>
> > >> On 5/28/25 06:19, Alex Deucher wrote:
> > >>> Re-emit the unprocessed state after resetting the queue.
> > >>>
> > >>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> > >>> ---
> > >>>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 19 ++++++++++++-------
> > >>>  1 file changed, 12 insertions(+), 7 deletions(-)
> > >>>
> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> > >>> index 3193eb88b6889..f6e04cf21abcc 100644
> > >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> > >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> > >>> @@ -9537,6 +9537,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> > >>>       struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> > >>>       struct amdgpu_ring *kiq_ring = &kiq->ring;
> > >>>       unsigned long flags;
> > >>> +     unsigned int i;
> > >>>       u32 tmp;
> > >>>       u64 addr;
> > >>>       int r;
> > >>> @@ -9571,10 +9572,8 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> > >>>                                    SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0, 0xffffffff);
> > >>>       kiq->pmf->kiq_map_queues(kiq_ring, ring);
> > >>>       amdgpu_ring_commit(kiq_ring);
> > >>> -
> > >>> -     spin_unlock_irqrestore(&kiq->ring_lock, flags);
> > >>> -
> > >>>       r = amdgpu_ring_test_ring(kiq_ring);
> > >>
> > >> I don't think we should do a ring test on the KIQ here That basically doesn't tells as much and might cause additional problems.
> > >
> > > We need some way to wait for the KIQ submission to complete.  This is
> > > a simple way to accomplish that.
> >
> > Ok, that makes sense.
> >
> > >
> > >>
> > >>> +     spin_unlock_irqrestore(&kiq->ring_lock, flags);
> > >>>       if (r)
> > >>>               return r;
> > >>>
> > >>> @@ -9584,7 +9583,15 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> > >>>               return r;
> > >>>       }
> > >>>
> > >>> -     return amdgpu_ring_test_ring(ring);
> > >>> +     if (amdgpu_ring_alloc(ring, 8 + ring->ring_backup_entries_to_copy))
> > >>> +             return -ENOMEM;
> > >>> +     amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
> > >>> +                            ring->ring_backup_seq, 0);
> > >>> +     for (i = 0; i < ring->ring_backup_entries_to_copy; i++)
> > >>> +             amdgpu_ring_write(ring, ring->ring_backup[i]);
> > >>> +     amdgpu_ring_commit(ring);
> > >>
> > >> I'm not sure if the commands are always relocatable. We should probably just instruct the ring to re-start with the original RPTR/WPTR.
> > >>
> > >> That would also avoid the need to save/restore the ring content with the CPU.
> > >
> > > I tried that originally, but I couldn't make it work for a few reasons:
> > > 1. We need to have enforce isolation enabled otherwise we almost
> > > always reset the wrong VMID so then when we execute the rest of the
> > > pipeline, we hang again.
> > > 2. When enforce isolation is enabled, we need to signal the fence
> > > associated with the guilty job first otherwise we get stuck waiting on
> > > the pipeline sync when we execute the rest of the pipeline
> >
> > So we execute the guilty job submission again? What prevents that one from running?
>
> Without enforce isolation we usually end up reseting the wrong job. I
> re-emit everything from the ring after the bad job.  What usually
> happens is that a game is running and then you start an app which
> hangs the GPU (e.g., an infinite loop in a shader).  Most of the time
> we reset the game rather than the bad app.  The scheduler times out on
> the app first and then we reset the queue and re-emit the following
> jobs, one of which is the bad app.  Queue reset fails and we fall back
> to a full adapter reset.  With enforce isolation, we reset the bad app
> instead.

Thinking about it more, we'll always end up in a full adapter reset if
we have back to back bad apps.  I think what we need to do is emit the
seq number from the bad app, verify that that completes and use that
to determine success in the reset function then re-emit the follow on
work and exit the function.  That way the reset succeeds and then if
the follow on job times out, we can repeat the process cleanly.

Alex

>
> Alex
>
> >
> > Christian.
> >
> > >
> > > Alex
> > >
> > >>
> > >> Regards,
> > >> Christian.
> > >>
> > >>> +
> > >>> +     return gfx_v10_0_ring_test_ib(ring, AMDGPU_QUEUE_RESET_TIMEOUT);
> > >>>  }
> > >>>
> > >>>  static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
> > >>> @@ -9612,9 +9619,8 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
> > >>>       kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES,
> > >>>                                  0, 0);
> > >>>       amdgpu_ring_commit(kiq_ring);
> > >>> -     spin_unlock_irqrestore(&kiq->ring_lock, flags);
> > >>> -
> > >>>       r = amdgpu_ring_test_ring(kiq_ring);
> > >>> +     spin_unlock_irqrestore(&kiq->ring_lock, flags);
> > >>>       if (r)
> > >>>               return r;
> > >>>
> > >>> @@ -9891,7 +9897,6 @@ static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_gfx = {
> > >>>       .emit_wreg = gfx_v10_0_ring_emit_wreg,
> > >>>       .emit_reg_wait = gfx_v10_0_ring_emit_reg_wait,
> > >>>       .emit_reg_write_reg_wait = gfx_v10_0_ring_emit_reg_write_reg_wait,
> > >>> -     .soft_recovery = gfx_v10_0_ring_soft_recovery,
> > >>>       .emit_mem_sync = gfx_v10_0_emit_mem_sync,
> > >>>       .reset = gfx_v10_0_reset_kgq,
> > >>>       .emit_cleaner_shader = gfx_v10_0_ring_emit_cleaner_shader,
> > >>
> >