[PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.

Christian König christian.koenig at amd.com
Wed May 18 06:07:10 UTC 2022


Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
> Problem:
> During hive reset caused by command timing out on a ring
> extra resets are generated by triggered by KFD which is
> unable to accesses registers on the resetting ASIC.
>
> Fix: Rework GPU reset to actively stop any pending reset
> works while another in progress.
>
> v2: Switch from generic list as was in v1[1] to eplicit
> stopping of each reset request from each reset source
> per each request submitter.

Looks mostly good to me.

Apart from the naming nit pick on patch #1 the only thing I couldn't of 
hand figure out is why you are using a delayed work everywhere instead 
of a just a work item.

That needs a bit further explanation what's happening here.

Christian.

>
> [1] - https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>
> Andrey Grodzovsky (7):
>    drm/amdgpu: Cache result of last reset at reset domain level.
>    drm/amdgpu: Switch to delayed work from work_struct.
>    drm/admgpu: Serialize RAS recovery work directly into reset domain
>      queue.
>    drm/amdgpu: Add delayed work for GPU reset from debugfs
>    drm/amdgpu: Add delayed work for GPU reset from kfd.
>    drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>      amdgpu_device_gpu_recover
>    drm/amdgpu: Stop any pending reset if another in progress.
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
>   14 files changed, 87 insertions(+), 54 deletions(-)
>



More information about the amd-gfx mailing list