[PATCH 2/5] drm/amdgpu: add ring soft recovery v2

Thu Aug 23 13:03:05 UTC 2018

On Thu, Aug 23, 2018 at 05:21:53PM +0800, Christian König wrote:
> Am 23.08.2018 um 09:17 schrieb Huang Rui:
> > On Wed, Aug 22, 2018 at 12:55:43PM -0400, Alex Deucher wrote:
> >> On Wed, Aug 22, 2018 at 6:05 AM Christian König
> >> <ckoenig.leichtzumerken at gmail.com> wrote:
> >>> Instead of hammering hard on the GPU try a soft recovery first.
> >>>
> >>> v2: reorder code a bit
> >>>
> >>> Signed-off-by: Christian König <christian.koenig at amd.com>
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++++++
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 ++++++++++++++++++++++++
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 ++++
> >>>   3 files changed, 34 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> index 265ff90f4e01..d93e31a5c4e7 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>>          struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> >>>          struct amdgpu_job *job = to_amdgpu_job(s_job);
> >>>
> >>> +       if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
> >>> +               DRM_ERROR("ring %s timeout, but soft recovered\n",
> >>> +                         s_job->sched->name);
> >>> +               return;
> >>> +       }
> >> I think we should still bubble up the error to userspace even if we
> >> can recover.  Data is lost when the wave is killed.  We should treat
> >> it like a GPU reset.
> >>
> > May I know what does the wavefront stand for? Why we can do the "light"
> > recover than reset here?
> 
> Wavefront means a running shader in the SQ.
> 
> Basically this only covers the case when the application sends down a 
> shader with an endless loop to the hardware. Here we just kill the 
> shader and try to continue.
> 
> When you run into a hang because of a corrupted resource descriptor you 
> need usually need a full ASIC reset to get out of that again.
> 

Good to know this, thank you.

Series are Acked-by: Huang Rui <ray.huang at amd.com>

> Regards,
> Christian.
> 
> >
> > Thanks,
> > Ray
> >
> >> Alex
> >>
> >>> +
> >>>          DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
> >>>                    job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
> >>>                    ring->fence_drv.sync_seq);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> index 5dfd26be1eec..c045a4e38ad1 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> @@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
> >>>          amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
> >>>   }
> >>>
> >>> +/**
> >>> + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
> >>> + *
> >>> + * @ring: ring to try the recovery on
> >>> + * @vmid: VMID we try to get going again
> >>> + * @fence: timedout fence
> >>> + *
> >>> + * Tries to get a ring proceeding again when it is stuck.
> >>> + */
> >>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> >>> +                              struct dma_fence *fence)
> >>> +{
> >>> +       ktime_t deadline = ktime_add_us(ktime_get(), 1000);
> >>> +
> >>> +       if (!ring->funcs->soft_recovery)
> >>> +               return false;
> >>> +
> >>> +       while (!dma_fence_is_signaled(fence) &&
> >>> +              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
> >>> +               ring->funcs->soft_recovery(ring, vmid);
> >>> +
> >>> +       return dma_fence_is_signaled(fence);
> >>> +}
> >>> +
> >>>   /*
> >>>    * Debugfs info
> >>>    */
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> index 409fdd9b9710..9cc239968e40 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
> >>>          /* priority functions */
> >>>          void (*set_priority) (struct amdgpu_ring *ring,
> >>>                                enum drm_sched_priority priority);
> >>> +       /* Try to soft recover the ring to make the fence signal */
> >>> +       void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
> >>>   };
> >>>
> >>>   struct amdgpu_ring {
> >>> @@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
> >>>   void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
> >>>                                                  uint32_t reg0, uint32_t val0,
> >>>                                                  uint32_t reg1, uint32_t val1);
> >>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> >>> +                              struct dma_fence *fence);
> >>>
> >>>   static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
> >>>   {
> >>> --
> >>> 2.14.1
> >>>
> >>> _______________________________________________
> >>> amd-gfx mailing list
> >>> amd-gfx at lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>