[PATCH] drm/amdgpu: fix OLAND card ip_init failed during kdump caputrue kernel boot

Lu Yao yaolu at kylinos.cn
Thu Aug 29 08:11:41 UTC 2024


On 2024/8/22 22:05, Mario Limonciello wrote:
> On 7/23/2024 04:42, Lu Yao wrote:
>> [Why]
>> When running kdump test on a machine with R7340 card, a hang is caused due
>> to the failure of 'amdgpu_device_ip_init()', error message as follows:
>>
>>    '[drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <si_dpm> failed -22'
>>    '[drm:uvd_v3_1_hw_init [amdgpu]] *ERROR* amdgpu: UVD Firmware validate fail (-22).'
>>    '[drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <uvd_v3_1> failed -22'
>>    'amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed'
>>    'amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init'
>>
>> This is because the caputrue kernel does not power off when it starts, 
>
> Presumably you mean:
> s/caputrue/capture/ 
Oh, you're right. It's a mistake.
>
>> cause hardware status does not reset.
>>
>> [How]
>> Add 'is_kdump_kernel()' judgment.
>> For 'si_dpm' block, use disable and then enable.
>> For 'uvd_v3_1' block, skip loading during the initialization phase.
>>
>> Signed-off-by: Lu Yao <yaolu at kylinos.cn>
>> ---
>> During test, I first modified the 'amdgpu_device_ip_hw_init_phase*', make
>> it does not end directly when a block hw_init failed.
>>
>> After analysis, 'si_dpm' block failed at 'si_dpm_enable()->
>> amdgpu_si_is_smc_running()', calling 'si_dpm_disable()' before can resolve.
>> 'uvd_v3_1' block failed at 'uvd_v3_1_hw_init()->uvd_v3_1_fw_validate()',
>> read mmUVD_FW_STATUS value is 0x27220102, I didn't find out why. But for
>> caputrue kernel, UVD is not required. Therefore, don't added this block. 
>
> Hmm, a few thoughs.
>
> 1) Although you used this for the R7340, these concepts you're identifying probably make sense on most AMD GPUs.  SUch checks might be better to uplevel to earlier in IP discovery code.
>
> 2) I'd actually argue we don't want to have the kdump capture kernel do ANY hardware init.  You're going to lose hardware state which "could" be valuable information for debugging a problem that caused a panic.
>
So, maybe  should skip all the  ip_block hw_init functions when kdump?
> That being said, I'm not really sure what framebuffer can drive the display across a kexec if you don't load amdgpu.  What actually happens if you blacklist amdgpu in the capture kernel?
>
> What happens with your patch in place?
>
> At least for me I'd like to see a kernel log from both cases.
>

After add 'initcall_blacklist=amdgpu_init' in KDUMP_CMDLINE_APPEND,  kernel logs are as follow:

[    4.085602][ 0]   nvme0n1: p1 p2 p3 p4 p5 p6
[    4.157927][ 0]  [drm] radeon kernel modesetting enabled.
[    4.163383][ 0]  radeon 0000:01:00.0: SI support disabled by module param
[    5.387012][ 0]  initcall amdgpu_init blacklisted
[    6.613733][ 0]  initcall amdgpu_init blacklisted
[    7.859320][ 0]  mtsnd build info: e3fc429
[    8.687512][ 0]  EXT4-fs (nvme0n1p3): orphan cleanup on readonly fs
[    8.694035][ 0]  EXT4-fs (nvme0n1p3): mounted filesystem 75c1e96b-cef8-4ed3-86ea-45010c7b859c ro with ordered data mode. Quota mode: none.
[    9.309862][ 0]  device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
[    9.325236][ 0]  device-mapper: uevent: version 1.0.3
[    9.330946][ 0]  systemd[1]: Starting modprobe at fuse.service - Load Kernel Module fuse...
[    9.341512][ 0]  device-mapper: ioctl: 4.48.0-ioctl (2023-03-01) initialised: dm-devel at redhat.com
[    9.380944][ 0]  fuse: init (API version 7.39)
[    9.390196][ 0]  loop: module loaded
[    9.486957][ 0]  lp: driver loaded but no devices found
[    9.494904][ 0]  EXT4-fs (nvme0n1p3): re-mounted 75c1e96b-cef8-4ed3-86ea-45010c7b859c r/w. Quota mode: none.
[    9.505931][ 0]  systemd[1]: Starting systemd-udev-trigger.service - Coldplug All udev Devices...
[    9.518899][ 0]  ppdev: user-space parallel port driver
[    9.524908][ 0]  systemd[1]: Started systemd-journald.service - Journal Service.
[    9.574209][ 0]  systemd-journald[350]: Received client request to flush runtime journal.
[   10.118484][ 0]  snd_hda_intel 0000:00:1f.3: Unknown capability 0
[   11.590124][ 0]  hdaudio hdaudioC0D2: Unable to configure, disabling
[   23.892640][ 0]  reboot: Restarting system

After with my patch in place:

[    4.074629][ 0]   nvme0n1: p1 p2 p3 p4 p5 p6
[    4.146956][ 0]  [drm] radeon kernel modesetting enabled.
[    4.152409][ 0]  radeon 0000:01:00.0: SI support disabled by module param
[    5.379207][ 0]  [drm] amdgpu kernel modesetting enabled.
[    5.384909][ 0]  amdgpu: Virtual CRAT table created for CPU
[    5.390514][ 0]  amdgpu: Topology: Add CPU node
[    5.395225][ 0]  [drm] initializing kernel modesetting (OLAND 0x1002:0x6611 0x1642:0x1869 0x87).
[    5.404040][ 0]  [drm] register mmio base: 0xA1600000
[    5.409118][ 0]  [drm] register mmio size: 262144
[    5.413864][ 0]  [drm] add ip block number 0 <si_common>
[    5.419207][ 0]  [drm] add ip block number 1 <gmc_v6_0>
[    5.424448][ 0]  [drm] add ip block number 2 <si_ih>
[    5.429427][ 0]  [drm] add ip block number 3 <gfx_v6_0>
[    5.434668][ 0]  [drm] add ip block number 4 <si_dma>
[    5.439733][ 0]  [drm] add ip block number 5 <si_dpm>
[    5.444803][ 0]  [drm] add ip block number 6 <dce_v6_0>
[    5.450051][ 0]  amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from VFCT
[    5.456517][ 0]  amdgpu: ATOM BIOS: 113-RADEONI6910-B03-BT
[    5.462023][ 0]  kfd kfd: amdgpu: OLAND  not supported in kfd
[    5.467857][ 0]  amdgpu 0000:01:00.0: vgaarb: deactivate vga console
[    5.474239][ 0]  amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    5.482781][ 0]  amdgpu 0000:01:00.0: amdgpu: PCIE atomic ops is not supported
[    5.490242][ 0]  [drm] PCIE gen 3 link speeds already enabled
[    5.496017][ 0]  [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[    5.504778][ 0]  amdgpu 0000:01:00.0: amdgpu: VRAM: 1024M 0x000000F400000000 - 0x000000F43FFFFFFF (1024M used)
[    5.514812][ 0]  amdgpu 0000:01:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
[    5.523710][ 0]  [drm] Detected VRAM RAM=1024M, BAR=1024M
[    5.529133][ 0]  [drm] RAM width 32bits GDDR5
[    5.533532][ 0]  [drm] amdgpu: 1024M of VRAM memory ready
[    5.538963][ 0]  [drm] amdgpu: 225M of GTT memory ready.
[    5.544293][ 0]  [drm] GART: num cpu pages 262144, num gpu pages 262144
[    5.550950][ 0]  amdgpu 0000:01:00.0: amdgpu: PCIE GART of 1024M enabled (table at 0x000000F400E00000).
[    5.560859][ 0]  [drm] Internal thermal controller with fan control
[    5.567163][ 0]  [drm] amdgpu: dpm initialized
[    5.571642][ 0]  [drm] AMDGPU Display Connectors
[    5.576278][ 0]  [drm] Connector 0:
[    5.579782][ 0]  [drm]   HDMI-A-1
[    5.583108][ 0]  [drm]   HPD2  
[    5.586088][ 0]  [drm]   DDC: 0x1950 0x1950 0x1951 0x1951 0x1952 0x1952 0x1953 0x1953
[    5.593937][ 0]  [drm]   Encoders:
[    5.597353][ 0]  [drm]     DFP1: INTERNAL_UNIPHY
[    5.601985][ 0]  [drm] Connector 1:
[    5.605488][ 0]  [drm]   VGA-1
[    5.608553][ 0]  [drm]   DDC: 0x194c 0x194c 0x194d 0x194d 0x194e 0x194e 0x194f 0x194f
[    5.616400][ 0]  [drm]   Encoders:
[    5.619807][ 0]  [drm]     CRT1: INTERNAL_KLDSCP_DAC1
[    5.985857][ 0]  amdgpu 0000:01:00.0: amdgpu: SE 1, SH per SE 1, CU per SH 6, active_cu_number 6
[    6.346743][ 0]  [drm] Initialized amdgpu 3.54.0 20150101 for 0000:01:00.0 on minor 0
[    6.433683][ 0]  fbcon: amdgpudrmfb (fb0) is primary device
[    6.439260][ 0]  Console: switching to colour frame buffer device 240x67
[    6.454578][ 0]  amdgpu 0000:01:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[    6.816426][ 0]  mtsnd build info: e3fc429
[    7.827506][ 0]  EXT4-fs (nvme0n1p3): orphan cleanup on readonly fs
[    7.834021][ 0]  EXT4-fs (nvme0n1p3): mounted filesystem 75c1e96b-cef8-4ed3-86ea-45010c7b859c ro with ordered data mode. Quota mode: none.
[    8.502847][ 0]  device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
[    8.517899][ 0]  systemd[1]: Starting modprobe at fuse.service - Load Kernel Module fuse...
[    8.526044][ 0]  device-mapper: uevent: version 1.0.3
[    8.531923][ 0]  systemd[1]: Starting modprobe at loop.service - Load Kernel Module loop...
[    8.545910][ 0]  systemd[1]: systemd-fsck-root.service - File System Check on Root Device was skipped because of an unmet condition check (ConditionPathExists=!/run/initramfs/fsck-root).
[    8.564367][ 0]  fuse: init (API version 7.39)
[    8.568872][ 0]  device-mapper: ioctl: 4.48.0-ioctl (2023-03-01) initialised: dm-devel at redhat.com
[    8.581889][ 0]  systemd[1]: Starting systemd-journald.service - Journal Service...
[    8.591857][ 0]  loop: module loaded
[    8.639020][ 0]  lp: driver loaded but no devices found
[    8.662288][ 0]  systemd[1]: systemd-tpm2-setup-early.service - TPM2 SRK Setup (Early) was skipped because of an unmet condition check (ConditionSecurity=measured-uki).
[    8.685851][ 0]  ppdev: user-space parallel port driver
[    8.697866][ 0]  EXT4-fs (nvme0n1p3): re-mounted 75c1e96b-cef8-4ed3-86ea-45010c7b859c r/w. Quota mode: none.
[    9.362160][ 0]  snd_hda_intel 0000:00:1f.3: Unknown capability 0
[    9.716497][ 0]  hdaudio hdaudioC0D2: Unable to configure, disabling
[   20.101499][ 0]  reboot: Restarting system

Compared with the blacklist method, amdgpu driver initialization can be completed after adding patch.
>From the external observation, more startup animation can be shown (of course, this is meaningless, because it will restart immediately).


More information about the dri-devel mailing list