amdgpu vs kexec
Peter Zijlstra
peterz at infradead.org
Mon Jun 16 09:39:45 UTC 2025
Hi guys,
My (Intel Sapphire Rapids) workstation has a RX 7800 XT and when I kexec
a bunch of times, the amdgpu driver gets upset and barfs on boot.
It starts like so:
[ 16.926489] amdgpu 0000:19:00.0: amdgpu: Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16
[ 16.980590] amdgpu 0000:19:00.0: amdgpu: reserve 0xa700000 from 0x83e0000000 for PSP TMR
[ 19.204585] amdgpu 0000:19:00.0: amdgpu: failed to load ucode SMC(0x32)
[ 19.227333] amdgpu 0000:19:00.0: amdgpu: psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
[ 19.256420] amdgpu 0000:19:00.0: amdgpu: PSP load smu failed!
[ 19.467875] [drm:psp_v13_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring
[ 19.491771] amdgpu 0000:19:00.0: amdgpu: PSP firmware loading failed
[ 19.513372] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[ 19.540397] amdgpu 0000:19:00.0: amdgpu: amdgpu_device_ip_init failed
[ 19.562177] amdgpu 0000:19:00.0: amdgpu: Fatal error during GPU init
[ 19.583785] amdgpu 0000:19:00.0: amdgpu: amdgpu: finishing device.
[ 19.605474] ------------[ cut here ]------------
[ 19.615370] WARNING: CPU: 0 PID: 704 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:631 amdgpu_irq_put+0x46/0x70 [amdgpu]
[ 19.638375] Modules linked in: rndis_host hid_generic cdc_ether usbhid usbnet mii hid amdgpu(+) amdxcp gpu_sched drm_panel_backlight_quirks drm_buddy drm_ttm_helper ttm video wmi drm_exec drm_suballoc_helper drm_display_helper ast ah
ci cec rc_core iTCO_wdt libahci drm_shmem_helper xhci_pci drm_client_lib intel_pmc_bxt libata xhci_hcd iTCO_vendor_support igb nvme watchdog drm_kms_helper idxd atlantic intel_lpss_pci usbcore scsi_mod i2c_algo_bit drm nvme_core i2c_i80
1 idxd_bus crc16 intel_lpss macsec dca i2c_smbus idma64 scsi_common ucsi_acpi typec_ucsi typec roles usb_common pinctrl_alderlake button efivarfs
[ 19.754852] CPU: 0 UID: 0 PID: 704 Comm: kworker/0:5 Not tainted 6.15.0-dirty #51 PREEMPT(full)
[ 19.773770] Hardware name: Supermicro SYS-531A-I/X13SRA-TF, BIOS 1.1b 08/01/2023
[ 19.789693] Workqueue: events work_for_cpu_fn
[ 19.799066] RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
[ 19.810480] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 c3 cc cc cc cc e9 5a fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
[ 19.851066] RSP: 0018:ff55eefd81aafd48 EFLAGS: 00010246
[ 19.862314] RAX: ff466ca3653aac00 RBX: ff466ca2d7f98b40 RCX: 0000000000000000
[ 19.877675] RDX: 0000000000000000 RSI: ff466ca2d7fa5990 RDI: ff466ca2d7f80000
[ 19.893037] RBP: ff466ca2d7f90388 R08: 0000000000000000 R09: ff55eefd81aafb10
[ 19.908401] R10: ff466cc1ffcd2fa8 R11: 0000000000000003 R12: ff466ca2d7f90830
[ 19.923763] R13: ff466ca2d7f80010 R14: ff466ca2d7f80000 R15: ff466ca2d7fa5990
[ 19.939132] FS: 0000000000000000(0000) GS:ff466cc1db2ee000(0000) knlGS:0000000000000000
[ 19.956551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 19.968920] CR2: 00007f45f54e3de8 CR3: 000000207e624003 CR4: 0000000000f71ef0
[ 19.984282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 19.999645] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 20.015010] PKRU: 55555554
[ 20.020820] Call Trace:
[ 20.026075] <TASK>
[ 20.030581] amdgpu_fence_driver_hw_fini+0xfc/0x130 [amdgpu]
[ 20.042894] amdgpu_device_fini_hw+0xb7/0x2c6 [amdgpu]
[ 20.054152] amdgpu_driver_load_kms.cold+0x18/0x2e [amdgpu]
[ 20.066323] amdgpu_pci_probe+0x1cf/0x470 [amdgpu]
[ 20.076775] local_pci_probe+0x42/0x90
[ 20.084839] work_for_cpu_fn+0x17/0x30
[ 20.092899] process_one_work+0x188/0x340
[ 20.101523] worker_thread+0x256/0x3a0
[ 20.109584] ? __pfx_worker_thread+0x10/0x10
[ 20.118767] kthread+0xf9/0x240
[ 20.125519] ? __pfx_kthread+0x10/0x10
[ 20.133578] ret_from_fork+0x31/0x50
[ 20.141268] ? __pfx_kthread+0x10/0x10
[ 20.149326] ret_from_fork_asm+0x1a/0x30
[ 20.157765] </TASK>
[ 20.162457] ---[ end trace 0000000000000000 ]---
and then continues to barf for a while longer. Full dmesg attached.
When I do a full power cycle its okay again for a few kexecs, but will
ultimately go unhappy again.
I'm doing a 'normal' systemctl kexec, which I figure should more or less
shut things down normally. Its not like a crash-kexec -- which is a
whole other story and can be expected to cause trouble.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dmesg.gz
Type: application/gzip
Size: 35326 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250616/bc79a675/attachment-0001.gz>
More information about the amd-gfx
mailing list