[PATCH v3 0/7] Fix multiple GPU resets in XGMI hive.
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Wed May 25 19:04:40 UTC 2022
Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.
Fix: Rework GPU reset to actively stop any pending reset
works while another in progress.
v2: Switch from generic list as was in v1[1] to eplicit
stopping of each reset request from each reset source
per each request submitter.
v3: Switch back to work_struct from delayed_work (Christian)
[1] - https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
Andrey Grodzovsky (7):
Revert "workqueue: remove unused cancel_work()"
drm/amdgpu: Cache result of last reset at reset domain level.
drm/admgpu: Serialize RAS recovery work directly into reset domain
queue.
drm/amdgpu: Add work_struct for GPU reset from debugfs
drm/amdgpu: Add work_struct for GPU reset from kfd.
drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
amdgpu_device_gpu_recover
drm/amdgpu: Stop any pending reset if another in progress.
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 ++++++-
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 +
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 2 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 2 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 2 +-
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 9 ++++
14 files changed, 84 insertions(+), 41 deletions(-)
--
2.25.1
More information about the amd-gfx
mailing list