amdgpu vs kexec
Lazar, Lijo
lijo.lazar at amd.com
Mon Jun 16 14:02:40 UTC 2025
On 6/16/2025 3:09 PM, Peter Zijlstra wrote:
> Hi guys,
>
> My (Intel Sapphire Rapids) workstation has a RX 7800 XT and when I kexec
> a bunch of times, the amdgpu driver gets upset and barfs on boot.
>
> It starts like so:
>
> [ 16.926489] amdgpu 0000:19:00.0: amdgpu: Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16
> [ 16.980590] amdgpu 0000:19:00.0: amdgpu: reserve 0xa700000 from 0x83e0000000 for PSP TMR
> [ 19.204585] amdgpu 0000:19:00.0: amdgpu: failed to load ucode SMC(0x32)
> [ 19.227333] amdgpu 0000:19:00.0: amdgpu: psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
> [ 19.256420] amdgpu 0000:19:00.0: amdgpu: PSP load smu failed!
> [ 19.467875] [drm:psp_v13_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring
> [ 19.491771] amdgpu 0000:19:00.0: amdgpu: PSP firmware loading failed
> [ 19.513372] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
> [ 19.540397] amdgpu 0000:19:00.0: amdgpu: amdgpu_device_ip_init failed
> [ 19.562177] amdgpu 0000:19:00.0: amdgpu: Fatal error during GPU init
> [ 19.583785] amdgpu 0000:19:00.0: amdgpu: amdgpu: finishing device.
> [ 19.605474] ------------[ cut here ]------------
> [ 19.615370] WARNING: CPU: 0 PID: 704 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:631 amdgpu_irq_put+0x46/0x70 [amdgpu]
> [ 19.638375] Modules linked in: rndis_host hid_generic cdc_ether usbhid usbnet mii hid amdgpu(+) amdxcp gpu_sched drm_panel_backlight_quirks drm_buddy drm_ttm_helper ttm video wmi drm_exec drm_suballoc_helper drm_display_helper ast ah
> ci cec rc_core iTCO_wdt libahci drm_shmem_helper xhci_pci drm_client_lib intel_pmc_bxt libata xhci_hcd iTCO_vendor_support igb nvme watchdog drm_kms_helper idxd atlantic intel_lpss_pci usbcore scsi_mod i2c_algo_bit drm nvme_core i2c_i80
> 1 idxd_bus crc16 intel_lpss macsec dca i2c_smbus idma64 scsi_common ucsi_acpi typec_ucsi typec roles usb_common pinctrl_alderlake button efivarfs
> [ 19.754852] CPU: 0 UID: 0 PID: 704 Comm: kworker/0:5 Not tainted 6.15.0-dirty #51 PREEMPT(full)
> [ 19.773770] Hardware name: Supermicro SYS-531A-I/X13SRA-TF, BIOS 1.1b 08/01/2023
> [ 19.789693] Workqueue: events work_for_cpu_fn
> [ 19.799066] RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
> [ 19.810480] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 c3 cc cc cc cc e9 5a fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
> [ 19.851066] RSP: 0018:ff55eefd81aafd48 EFLAGS: 00010246
> [ 19.862314] RAX: ff466ca3653aac00 RBX: ff466ca2d7f98b40 RCX: 0000000000000000
> [ 19.877675] RDX: 0000000000000000 RSI: ff466ca2d7fa5990 RDI: ff466ca2d7f80000
> [ 19.893037] RBP: ff466ca2d7f90388 R08: 0000000000000000 R09: ff55eefd81aafb10
> [ 19.908401] R10: ff466cc1ffcd2fa8 R11: 0000000000000003 R12: ff466ca2d7f90830
> [ 19.923763] R13: ff466ca2d7f80010 R14: ff466ca2d7f80000 R15: ff466ca2d7fa5990
> [ 19.939132] FS: 0000000000000000(0000) GS:ff466cc1db2ee000(0000) knlGS:0000000000000000
> [ 19.956551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 19.968920] CR2: 00007f45f54e3de8 CR3: 000000207e624003 CR4: 0000000000f71ef0
> [ 19.984282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 19.999645] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 20.015010] PKRU: 55555554
> [ 20.020820] Call Trace:
> [ 20.026075] <TASK>
> [ 20.030581] amdgpu_fence_driver_hw_fini+0xfc/0x130 [amdgpu]
> [ 20.042894] amdgpu_device_fini_hw+0xb7/0x2c6 [amdgpu]
> [ 20.054152] amdgpu_driver_load_kms.cold+0x18/0x2e [amdgpu]
> [ 20.066323] amdgpu_pci_probe+0x1cf/0x470 [amdgpu]
> [ 20.076775] local_pci_probe+0x42/0x90
> [ 20.084839] work_for_cpu_fn+0x17/0x30
> [ 20.092899] process_one_work+0x188/0x340
> [ 20.101523] worker_thread+0x256/0x3a0
> [ 20.109584] ? __pfx_worker_thread+0x10/0x10
> [ 20.118767] kthread+0xf9/0x240
> [ 20.125519] ? __pfx_kthread+0x10/0x10
> [ 20.133578] ret_from_fork+0x31/0x50
> [ 20.141268] ? __pfx_kthread+0x10/0x10
> [ 20.149326] ret_from_fork_asm+0x1a/0x30
> [ 20.157765] </TASK>
> [ 20.162457] ---[ end trace 0000000000000000 ]---
>
> and then continues to barf for a while longer. Full dmesg attached.
>
> When I do a full power cycle its okay again for a few kexecs, but will
> ultimately go unhappy again.
>
> I'm doing a 'normal' systemctl kexec, which I figure should more or less
> shut things down normally. Its not like a crash-kexec -- which is a
> whole other story and can be expected to cause trouble.
>From the log -
[ 16.512201] amdgpu 0000:19:00.0: amdgpu: GPU mode1 reset
[ 16.531581] amdgpu 0000:19:00.0: amdgpu: SMU: valid command, bad
prerequisites: index:2 param:0x00000000 message:GetSmuVersion
[ 16.564138] amdgpu 0000:19:00.0: amdgpu: GPU psp mode1 reset
It looks like PMFW responsible for reset is not in good shape and then
driver is taking an unexpected code path which breaks PSP as well.
Kenneth, any thoughts?
Thanks,
Lijo
More information about the amd-gfx
mailing list