[RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu

Wed Feb 9 16:08:30 UTC 2022

Thanks a lot!

Andrey

On 2022-02-09 01:06, JingWen Chen wrote:
> Hi Andrey,
>
> I have been testing your patch and it seems fine till now.
>
> Best Regards,
>
> Jingwen Chen
>
> On 2022/2/3 上午2:57, Andrey Grodzovsky wrote:
>> Just another ping, with Shyun's help I was able to do some smoke testing on XGMI SRIOV system (booting and triggering hive reset)
>> and for now looks good.
>>
>> Andrey
>>
>> On 2022-01-28 14:36, Andrey Grodzovsky wrote:
>>> Just a gentle ping if people have more comments on this patch set ? Especially last 5 patches
>>> as first 7 are exact same as V2 and we already went over them mostly.
>>>
>>> Andrey
>>>
>>> On 2022-01-25 17:37, Andrey Grodzovsky wrote:
>>>> This patchset is based on earlier work by Boris[1] that allowed to have an
>>>> ordered workqueue at the driver level that will be used by the different
>>>> schedulers to queue their timeout work. On top of that I also serialized
>>>> any GPU reset we trigger from within amdgpu code to also go through the same
>>>> ordered wq and in this way simplify somewhat our GPU reset code so we don't need
>>>> to protect from concurrency by multiple GPU reset triggeres such as TDR on one
>>>> hand and sysfs trigger or RAS trigger on the other hand.
>>>>
>>>> As advised by Christian and Daniel I defined a reset_domain struct such that
>>>> all the entities that go through reset together will be serialized one against
>>>> another.
>>>>
>>>> TDR triggered by multiple entities within the same domain due to the same reason will not
>>>> be triggered as the first such reset will cancel all the pending resets. This is
>>>> relevant only to TDR timers and not to triggered resets coming from RAS or SYSFS,
>>>> those will still happen after the in flight resets finishes.
>>>>
>>>> v2:
>>>> Add handling on SRIOV configuration, the reset notify coming from host
>>>> and driver already trigger a work queue to handle the reset so drop this
>>>> intermediate wq and send directly to timeout wq. (Shaoyun)
>>>>
>>>> v3:
>>>> Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain struct.
>>>> I followed his advise and also moved adev->reset_sem into same place. This
>>>> in turn caused to do some follow-up refactor of the original patches
>>>> where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive because hive is destroyed and
>>>> reconstructed for the case of reset the devices in the XGMI hive during probe for SRIOV See [2]
>>>> while we need the reset sem and gpu_reset flag to always be present. This was attained
>>>> by adding refcount to amdgpu_reset_domain so each device can safely point to it as long as
>>>> it needs.
>>>>
>>>>
>>>> [1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/
>>>> [2] https://www.spinics.net/lists/amd-gfx/msg58836.html
>>>>
>>>> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there.
>>>>
>>>> P.P.S Patches 8-12 are the refactor on top of the original V2 patchset.
>>>>
>>>> P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV system because drm-misc-next fails to load there.
>>>> Would appriciate if maybe jingwech can try it on his system like he tested V2.
>>>>
>>>> Andrey Grodzovsky (12):
>>>>     drm/amdgpu: Introduce reset domain
>>>>     drm/amdgpu: Move scheduler init to after XGMI is ready
>>>>     drm/amdgpu: Fix crash on modprobe
>>>>     drm/amdgpu: Serialize non TDR gpu recovery with TDRs
>>>>     drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
>>>>     drm/amdgpu: Drop hive->in_reset
>>>>     drm/amdgpu: Drop concurrent GPU reset protection for device
>>>>     drm/amdgpu: Rework reset domain to be refcounted.
>>>>     drm/amdgpu: Move reset sem into reset_domain
>>>>     drm/amdgpu: Move in_gpu_reset into reset_domain
>>>>     drm/amdgpu: Rework amdgpu_device_lock_adev
>>>>     Revert 'drm/amdgpu: annotate a false positive recursive locking'
>>>>
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  15 +-
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  10 +-
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 275 ++++++++++--------
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  43 +--
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |   2 +-
>>>>    .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    |  18 +-
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c     |  39 +++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h     |  12 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   2 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c      |  24 +-
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h      |   3 +-
>>>>    drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c        |   6 +-
>>>>    drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c         |  14 +-
>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |  19 +-
>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c         |  19 +-
>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |  11 +-
>>>>    16 files changed, 313 insertions(+), 199 deletions(-)
>>>>