[PATCH 00/10] Support XGMI reset on init

Christian König ckoenig.leichtzumerken at gmail.com
Wed Sep 4 12:47:59 UTC 2024


Am 02.09.24 um 09:34 schrieb Lijo Lazar:
> There are case where a device needs to be reset first before it is fully
> initialized. An example case is a driver reinstallation with a different version
> of PSP TOS. In such a case, if a device supports reset in which PSP TOS is
> unloaded, then driver needs to reset device first and then load the new firmware
> components.
>
> For devices in an XGMI hive, a reset needs to be sent on all devices in the
> hive. Thus driver should discover first devices that belong to a hive with
> PSP support.
>
> There is an existing delayed reset handler, however it has the below
> limitations-
> 1) It doesn't discover devices in the hive, instead it tries to do XGMI reset
> for all devices registered to mgpu struct. mgpu struct may have other devices
> than the one which belong to a hive. Also, if there is more than one hive, it
> doesn't work.
> 2) It doesn't take a reset lock and since this is a delayed reset, that could
> result in unwanted hardware accesses during a reset.
> 3) It doesn't initialize RAS properly (left as TODO)
>
> This series overcomes the above limitations. Instead of marking a pending reset,
> init levels are defined where the level of initialization may be defined. In
> case of a pending reset, only specific hardware blocks may be initialized.
>
> Further work (not done in this series) may be done to have fine grain controls
> for init levels - say skip enabling features like DPM enablement, or skip
> loading specific set of fimwares as they won't be required during a minimal init
> scenario where device is going to be reset.
>
> The series adds an API interface to check if a PSP TOS reload is required.

At least from the high level that sounds totally sane, but I have no 
idea where to get time from to review the details.

I need to discuss that with Alex and/or Tim. Maybe I can delegate some 
more work.

Christian.

>
>
> Lijo Lazar (10):
>    drm/amdgpu: Add init levels
>    drm/amdgpu: Use init level for pending_reset flag
>    drm/amdgpu: Separate reinitialization after reset
>    drm/amdgpu: Add reset on init handler for XGMI
>    drm/amdgpu: Add helper to initialize badpage info
>    drm/amdgpu: Refactor XGMI reset on init handling
>    drm/amdgpu: Drop delayed reset work handler
>    drm/amdgpu: Support reset-on-init on select SOCs
>    drm/amdgpu: Add interface for TOS reload cases
>    drm/amdgpu: Add PSP reload case to reset-on-init
>
>   drivers/gpu/drm/amd/amdgpu/aldebaran.c        |   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  21 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 245 +++++++++++-------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  81 ------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h       |   1 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c       |  13 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h       |   3 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c       |  62 +++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h       |   4 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c     | 148 +++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h     |   4 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c      |  72 ++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h      |   2 +
>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c         |  14 +-
>   drivers/gpu/drm/amd/amdgpu/psp_v13_0.c        |  25 ++
>   drivers/gpu/drm/amd/amdgpu/soc15.c            |   7 +
>   .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c    |   3 +-
>   17 files changed, 492 insertions(+), 214 deletions(-)
>



More information about the amd-gfx mailing list