amdgpu: Reproducible soft lockups when playing games

Borislav Petkov bp at alien8.de
Tue Apr 29 13:09:29 UTC 2025


+ amdgpu folks

On Tue, Apr 29, 2025 at 02:51:56PM +0200, Marcus Rückert wrote:
> Hardware: 
> - ASUS ROG Swift OLED PG27AQDP @ 480 Hz
> - LG 27GL850-B @ 144Hz
> - XFX Mercury Radeon RX 9070 XT OC Gaming Edition with RGB, 16GB GDDR6, HDMI, 3x DP RX-97TRGBBB9
> - Ryzen 9 9950X3D on ASUS ProArt X870E-Creator WiFi
> - be quiet! Dark Power 13 850W ATX 3.0
> 
> Software:
> - kernel-default-6.15~rc4-1.1.g62ec7c7.x86_64 from https://build.opensuse.org/project/show/Kernel:HEAD
> - Mesa-25.1+git442.5841d44f9-1747.1.x86_64 from https://build.opensuse.org/package/show/home:darix:playground/Mesa
> - GE-Proton 9-27 https://github.com/GloriousEggroll/proton-ge-custom/releases/tag/GE-Proton9-27
> - Overwatch via steam
> 
> ```
> [Mon Apr 28 23:10:56 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: Dumping IP State
> [Mon Apr 28 23:10:56 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: Dumping IP State Completed
> [Mon Apr 28 23:10:56 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
> [Mon Apr 28 23:10:56 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
> [Mon Apr 28 23:10:56 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> [Mon Apr 28 23:11:07 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: Dumping IP State
> [Mon Apr 28 23:11:07 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: Dumping IP State Completed
> [Mon Apr 28 23:11:07 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
> [Mon Apr 28 23:11:07 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
> [Mon Apr 28 23:11:07 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> ```
> 
> Usually I have that like once a day or so. But yesterday it was especially bad:
> 
> ```
> Apr 28 21:46:57 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> Apr 28 21:47:08 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> Apr 28 21:47:18 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> Apr 28 21:47:28 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> Apr 28 21:54:34 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> Apr 28 22:00:40 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> Apr 28 22:00:50 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> Apr 28 22:01:00 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> Apr 28 23:10:56 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> Apr 28 23:11:07 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> ```
> 
> Together with my coworker Patrik Jakobsson and Takashi Iwai we already chased down a few other issues (like the dreaded flip_done).
> But this last issue remains. We first backported some fixes to our 6.14.x kernel for flip_done. To get even more fixes I switched to the 6.15~rc kernels.
> 
> Then also went with Mesa 25.1~rc which didnt fix it. so now it is a snapshot package of main.
> 
> Some observations. While gaming I started run https://github.com/Umio-Yasuno/amdgpu_top on the 2nd monitor to see if overheating might be an issue.
> 
> but the memory temps are at around 82 and the GPU core itself is usually lower.
> One observation is that the card is supposed to have a boost clock of 3100MHz but amdgpu_top sees it boost over 3200. I tried both onboard bios and the behavior is the same.
> 
> currently I run both my wayland session as well as my game with RADV_DEBUG=nohiz but that didnt provide more details adding nodcc drop the performance from 480-500Hz ( the card could go faster but I limit the game to 500)
> to 330-360.
> 
> Please let me know, if I can provide more details
> 
>    darix
> 
> 
> ```
> -- 
> Always remember:
>   Never accept the world as it appears to be.
>     Dare to see it for what it could be.
>       The world can always use more heroes.
> 
> 
> 
> 
> ```
> 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


More information about the amd-gfx mailing list