[PATCH 0/5] Add work pool to reset domain

Lijo Lazar lijo.lazar at amd.com
Fri Aug 11 06:02:29 UTC 2023


Presently, there are multiple clients of reset like RAS, job timeout, KFD hang
detection and debug method. Instead of each client maintaining a work item,
reset work pool is moved to reset domain. When a client makes a recovery request,
a work item is allocated by the reset domain and queued for execution. For the
case of job timeout, each ring has its own TDR queue to which tdr work is
scheduled. From there, it's further queued to a reset domain based on the device
configuration.

This allows flexibility to have multiple reset domains. For example, when
there are partitions, each partition can maintain its own reset domain and a job
timeout on one partition doesn't affect jobs on the other partition (when the
jobs don't have any interdependency). The reset logic will select the
appropriate reset domain based on the current device configuration.

Lijo Lazar (5):
  drm/amdgpu: Add work pool to reset domain
  drm/amdgpu: Move to reset_schedule_work
  drm/amdgpu: Set flags to cancel all pending resets
  drm/amdgpu: Add API to queue and do reset work
  drm/amdgpu: Add TDR queue for ring

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  32 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |   1 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  24 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  40 +++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  16 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    |  71 ++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  | 122 ++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  32 +++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   |   5 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |   1 -
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  38 +++----
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  44 ++++----
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  33 +++---
 15 files changed, 285 insertions(+), 177 deletions(-)

-- 
2.25.1



More information about the amd-gfx mailing list