[RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

Daniel Vetter daniel at ffwll.ch
Mon Dec 20 09:43:38 UTC 2021


On Mon, Dec 20, 2021 at 08:25:05AM +0100, Christian König wrote:
> Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
> > This patchset is based on earlier work by Boris[1] that allowed to have an
> > ordered workqueue at the driver level that will be used by the different
> > schedulers to queue their timeout work. On top of that I also serialized
> > any GPU reset we trigger from within amdgpu code to also go through the same
> > ordered wq and in this way simplify somewhat our GPU reset code so we don't need
> > to protect from concurrency by multiple GPU reset triggeres such as TDR on one
> > hand and sysfs trigger or RAS trigger on the other hand.
> > 
> > As advised by Christian and Daniel I defined a reset_domain struct such that
> > all the entities that go through reset together will be serialized one against
> > another.
> > 
> > TDR triggered by multiple entities within the same domain due to the same reason will not
> > be triggered as the first such reset will cancel all the pending resets. This is
> > relevant only to TDR timers and not to triggered resets coming from RAS or SYSFS,
> > those will still happen after the in flight resets finishes.
> > 
> > [1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/
> > 
> > P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there.
> 
> Patches #1 and #5, #6 are Reviewed-by: Christian König
> <christian.koenig at amd.com>
> 
> Some minor comments on the rest, but in general absolutely looks like the
> way we want to go.

I only scrolled through quickly, but yeah I'm concurring.
-Daniel
> 
> Regards,
> Christian.
> 
> > 
> > Andrey Grodzovsky (6):
> >    drm/amdgpu: Init GPU reset single threaded wq
> >    drm/amdgpu: Move scheduler init to after XGMI is ready
> >    drm/amdgpu: Fix crash on modprobe
> >    drm/amdgpu: Serialize non TDR gpu recovery with TDRs
> >    drm/amdgpu: Drop hive->in_reset
> >    drm/amdgpu: Drop concurrent GPU reset protection for device
> > 
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++++++++++----------
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
> >   7 files changed, 132 insertions(+), 136 deletions(-)
> > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


More information about the amd-gfx mailing list