[RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Wed Feb 2 18:57:44 UTC 2022
Just another ping, with Shyun's help I was able to do some smoke testing
on XGMI SRIOV system (booting and triggering hive reset)
and for now looks good.
Andrey
On 2022-01-28 14:36, Andrey Grodzovsky wrote:
> Just a gentle ping if people have more comments on this patch set ?
> Especially last 5 patches
> as first 7 are exact same as V2 and we already went over them mostly.
>
> Andrey
>
> On 2022-01-25 17:37, Andrey Grodzovsky wrote:
>> This patchset is based on earlier work by Boris[1] that allowed to
>> have an
>> ordered workqueue at the driver level that will be used by the different
>> schedulers to queue their timeout work. On top of that I also serialized
>> any GPU reset we trigger from within amdgpu code to also go through
>> the same
>> ordered wq and in this way simplify somewhat our GPU reset code so we
>> don't need
>> to protect from concurrency by multiple GPU reset triggeres such as
>> TDR on one
>> hand and sysfs trigger or RAS trigger on the other hand.
>>
>> As advised by Christian and Daniel I defined a reset_domain struct
>> such that
>> all the entities that go through reset together will be serialized
>> one against
>> another.
>>
>> TDR triggered by multiple entities within the same domain due to the
>> same reason will not
>> be triggered as the first such reset will cancel all the pending
>> resets. This is
>> relevant only to TDR timers and not to triggered resets coming from
>> RAS or SYSFS,
>> those will still happen after the in flight resets finishes.
>>
>> v2:
>> Add handling on SRIOV configuration, the reset notify coming from host
>> and driver already trigger a work queue to handle the reset so drop this
>> intermediate wq and send directly to timeout wq. (Shaoyun)
>>
>> v3:
>> Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain
>> struct.
>> I followed his advise and also moved adev->reset_sem into same place.
>> This
>> in turn caused to do some follow-up refactor of the original patches
>> where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive
>> because hive is destroyed and
>> reconstructed for the case of reset the devices in the XGMI hive
>> during probe for SRIOV See [2]
>> while we need the reset sem and gpu_reset flag to always be present.
>> This was attained
>> by adding refcount to amdgpu_reset_domain so each device can safely
>> point to it as long as
>> it needs.
>>
>>
>> [1]
>> https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/
>> [2] https://www.spinics.net/lists/amd-gfx/msg58836.html
>>
>> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris
>> work hasn't landed yet there.
>>
>> P.P.S Patches 8-12 are the refactor on top of the original V2 patchset.
>>
>> P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV
>> system because drm-misc-next fails to load there.
>> Would appriciate if maybe jingwech can try it on his system like he
>> tested V2.
>>
>> Andrey Grodzovsky (12):
>> drm/amdgpu: Introduce reset domain
>> drm/amdgpu: Move scheduler init to after XGMI is ready
>> drm/amdgpu: Fix crash on modprobe
>> drm/amdgpu: Serialize non TDR gpu recovery with TDRs
>> drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
>> drm/amdgpu: Drop hive->in_reset
>> drm/amdgpu: Drop concurrent GPU reset protection for device
>> drm/amdgpu: Rework reset domain to be refcounted.
>> drm/amdgpu: Move reset sem into reset_domain
>> drm/amdgpu: Move in_gpu_reset into reset_domain
>> drm/amdgpu: Rework amdgpu_device_lock_adev
>> Revert 'drm/amdgpu: annotate a false positive recursive locking'
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 15 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 10 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 275 ++++++++++--------
>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 43 +--
>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>> .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 18 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 39 +++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 12 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 24 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 3 +-
>> drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +-
>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 14 +-
>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 19 +-
>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 19 +-
>> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 11 +-
>> 16 files changed, 313 insertions(+), 199 deletions(-)
>>
More information about the dri-devel
mailing list