[PATCH v4 0/4] upport for multi-GPU interconnection to trigger dpc recovery

Fri Mar 21 03:24:29 UTC 2025

This section describes the DPC high-level workflow for Multi-GPU configurations.
The GPUs are connected to root ports or switches using PCIe, and there are xGMI
links between the GPUs.
Multi-GPU DPC Workflowï¼š
1.When an uncorrectable AER error occurs (assuming GPU-a encountered the error),
the error is reported to the root port or switch downstream port

2.The root port or switch downstream port disables the PCIe link to GPU-a and
sends an interrupt to the OS kernel

3.In GPU-a:
  (a) PMFW receives PMI from PCIe link down event
  (b) PMFW log the DPC event
  (c) PMFW runs the link reset sequence, which resets PCIe and related IP blocks
  (d) PMFW runs mode-1 reset flow

4.In the OS kernel and driver:
  (a) DPC error handler notifies GPU driver with per-registered callback function
  (b) The driver stops or halts any uncompleted activities, and triggers SW-UP
      link reset on other GPUs in the same hive.
  (c) Start DPC recovery sequence and release PCIe link, the link training starts
  (d) PCIe error callback notifies GPU driver to wait for GPU recovery

5.In other GPUs:
  (a) PMFW runs SW-UP link reset sequence
  (b) PMFW runs mode-1 reset flow

6.After the PCIe link is up and the mode-1 reset is completed, the GPU driver
initializes the device and gets it back to normal state.

problem:
[  188.024875] amdgpu 0000:38:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:158 vmid:0 pasid:0)
[  188.035550] amdgpu 0000:38:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x12 (VMC)
[  188.047297] amdgpu 0000:38:00.0: amdgpu:   cookie node_id 4 fault from die AID1
[  188.055531] amdgpu 0000:38:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x0000DB3D
[  188.063957] amdgpu 0000:38:00.0: amdgpu:      Faulty UTCL2 client ID: VCNU0 (0x6d)
[  188.072188] amdgpu 0000:38:00.0: amdgpu:      MORE_FAULTS: 0x1
[  188.078458] amdgpu 0000:38:00.0: amdgpu:      WALKER_ERROR: 0x6
[  188.084835] amdgpu 0000:38:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[  188.091698] amdgpu 0000:38:00.0: amdgpu:      MAPPING_ERROR: 0x1
[  188.098172] amdgpu 0000:38:00.0: amdgpu:      RW: 0x0

Solved by patch-4

Ce Sun (4):
  drm/amd/pm: Add link reset for SMU 13.0.6
  drm/amdgpu: refactor amdgpu_device_gpu_recover
  drm/amdgpu: Multi-GPU DPC recovery support
  drm/amdgpu/vcn: during dpc recovery will corrupt VCPU buffer

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  11 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 411 +++++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c       |   4 +-
 drivers/gpu/drm/amd/amdgpu/soc15.c            |   5 +
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c           |  28 ++
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h       |   2 +
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c     |  26 ++
 drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |  23 +-
 .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c  |  22 +
 9 files changed, 377 insertions(+), 155 deletions(-)

-- 
2.34.1