[PATCH 00/10] Support XGMI reset on init

Xu, Feifei Feifei.Xu at amd.com
Thu Sep 5 09:18:37 UTC 2024


[AMD Official Use Only - AMD Internal Distribution Only]

Patch3~10:

Reviewed-by: Feifei Xu <Feifei.Xu at amd.com>

-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Lijo Lazar
Sent: Monday, September 2, 2024 3:34 PM
To: amd-gfx at lists.freedesktop.org
Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
Subject: [PATCH 00/10] Support XGMI reset on init

There are case where a device needs to be reset first before it is fully initialized. An example case is a driver reinstallation with a different version of PSP TOS. In such a case, if a device supports reset in which PSP TOS is unloaded, then driver needs to reset device first and then load the new firmware components.

For devices in an XGMI hive, a reset needs to be sent on all devices in the hive. Thus driver should discover first devices that belong to a hive with PSP support.

There is an existing delayed reset handler, however it has the below
limitations-
1) It doesn't discover devices in the hive, instead it tries to do XGMI reset for all devices registered to mgpu struct. mgpu struct may have other devices than the one which belong to a hive. Also, if there is more than one hive, it doesn't work.
2) It doesn't take a reset lock and since this is a delayed reset, that could result in unwanted hardware accesses during a reset.
3) It doesn't initialize RAS properly (left as TODO)

This series overcomes the above limitations. Instead of marking a pending reset, init levels are defined where the level of initialization may be defined. In case of a pending reset, only specific hardware blocks may be initialized.

Further work (not done in this series) may be done to have fine grain controls for init levels - say skip enabling features like DPM enablement, or skip loading specific set of fimwares as they won't be required during a minimal init scenario where device is going to be reset.

The series adds an API interface to check if a PSP TOS reload is required.


Lijo Lazar (10):
  drm/amdgpu: Add init levels
  drm/amdgpu: Use init level for pending_reset flag
  drm/amdgpu: Separate reinitialization after reset
  drm/amdgpu: Add reset on init handler for XGMI
  drm/amdgpu: Add helper to initialize badpage info
  drm/amdgpu: Refactor XGMI reset on init handling
  drm/amdgpu: Drop delayed reset work handler
  drm/amdgpu: Support reset-on-init on select SOCs
  drm/amdgpu: Add interface for TOS reload cases
  drm/amdgpu: Add PSP reload case to reset-on-init

 drivers/gpu/drm/amd/amdgpu/aldebaran.c        |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  21 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 245 +++++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  81 ------
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h       |   1 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c       |  13 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h       |   3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       |  62 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c     | 148 +++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h     |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c      |  72 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h      |   2 +
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c         |  14 +-
 drivers/gpu/drm/amd/amdgpu/psp_v13_0.c        |  25 ++
 drivers/gpu/drm/amd/amdgpu/soc15.c            |   7 +
 .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c    |   3 +-
 17 files changed, 492 insertions(+), 214 deletions(-)

--
2.25.1



More information about the amd-gfx mailing list