[PATCH 00/10] Support XGMI reset on init

Lijo Lazar lijo.lazar at amd.com
Mon Sep 2 07:34:07 UTC 2024


There are case where a device needs to be reset first before it is fully
initialized. An example case is a driver reinstallation with a different version
of PSP TOS. In such a case, if a device supports reset in which PSP TOS is
unloaded, then driver needs to reset device first and then load the new firmware
components.

For devices in an XGMI hive, a reset needs to be sent on all devices in the
hive. Thus driver should discover first devices that belong to a hive with
PSP support.

There is an existing delayed reset handler, however it has the below
limitations- 
1) It doesn't discover devices in the hive, instead it tries to do XGMI reset
for all devices registered to mgpu struct. mgpu struct may have other devices
than the one which belong to a hive. Also, if there is more than one hive, it
doesn't work.
2) It doesn't take a reset lock and since this is a delayed reset, that could
result in unwanted hardware accesses during a reset.
3) It doesn't initialize RAS properly (left as TODO)

This series overcomes the above limitations. Instead of marking a pending reset,
init levels are defined where the level of initialization may be defined. In
case of a pending reset, only specific hardware blocks may be initialized. 

Further work (not done in this series) may be done to have fine grain controls
for init levels - say skip enabling features like DPM enablement, or skip
loading specific set of fimwares as they won't be required during a minimal init
scenario where device is going to be reset.

The series adds an API interface to check if a PSP TOS reload is required.


Lijo Lazar (10):
  drm/amdgpu: Add init levels
  drm/amdgpu: Use init level for pending_reset flag
  drm/amdgpu: Separate reinitialization after reset
  drm/amdgpu: Add reset on init handler for XGMI
  drm/amdgpu: Add helper to initialize badpage info
  drm/amdgpu: Refactor XGMI reset on init handling
  drm/amdgpu: Drop delayed reset work handler
  drm/amdgpu: Support reset-on-init on select SOCs
  drm/amdgpu: Add interface for TOS reload cases
  drm/amdgpu: Add PSP reload case to reset-on-init

 drivers/gpu/drm/amd/amdgpu/aldebaran.c        |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  21 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 245 +++++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  81 ------
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h       |   1 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c       |  13 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h       |   3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       |  62 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c     | 148 +++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h     |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c      |  72 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h      |   2 +
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c         |  14 +-
 drivers/gpu/drm/amd/amdgpu/psp_v13_0.c        |  25 ++
 drivers/gpu/drm/amd/amdgpu/soc15.c            |   7 +
 .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c    |   3 +-
 17 files changed, 492 insertions(+), 214 deletions(-)

-- 
2.25.1



More information about the amd-gfx mailing list