amdgpu: Reproducible soft lockups when playing games

Alex Deucher alexdeucher at gmail.com
Wed Apr 30 13:55:20 UTC 2025


On Wed, Apr 30, 2025 at 3:55 AM Borislav Petkov <bp at alien8.de> wrote:
>
> + amdgpu folks
>
> On Tue, Apr 29, 2025 at 02:51:56PM +0200, Marcus Rückert wrote:
> > Hardware:
> > - ASUS ROG Swift OLED PG27AQDP @ 480 Hz
> > - LG 27GL850-B @ 144Hz
> > - XFX Mercury Radeon RX 9070 XT OC Gaming Edition with RGB, 16GB GDDR6, HDMI, 3x DP RX-97TRGBBB9
> > - Ryzen 9 9950X3D on ASUS ProArt X870E-Creator WiFi
> > - be quiet! Dark Power 13 850W ATX 3.0
> >
> > Software:
> > - kernel-default-6.15~rc4-1.1.g62ec7c7.x86_64 from https://build.opensuse.org/project/show/Kernel:HEAD
> > - Mesa-25.1+git442.5841d44f9-1747.1.x86_64 from https://build.opensuse.org/package/show/home:darix:playground/Mesa
> > - GE-Proton 9-27 https://github.com/GloriousEggroll/proton-ge-custom/releases/tag/GE-Proton9-27
> > - Overwatch via steam
> >
> > ```
> > [Mon Apr 28 23:10:56 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: Dumping IP State
> > [Mon Apr 28 23:10:56 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: Dumping IP State Completed
> > [Mon Apr 28 23:10:56 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
> > [Mon Apr 28 23:10:56 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
> > [Mon Apr 28 23:10:56 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > [Mon Apr 28 23:11:07 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: Dumping IP State
> > [Mon Apr 28 23:11:07 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: Dumping IP State Completed
> > [Mon Apr 28 23:11:07 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
> > [Mon Apr 28 23:11:07 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
> > [Mon Apr 28 23:11:07 2025] [  T10460] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > ```
> >
> > Usually I have that like once a day or so. But yesterday it was especially bad:
> >
> > ```
> > Apr 28 21:46:57 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > Apr 28 21:47:08 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > Apr 28 21:47:18 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > Apr 28 21:47:28 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > Apr 28 21:54:34 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > Apr 28 22:00:40 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > Apr 28 22:00:50 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > Apr 28 22:01:00 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > Apr 28 23:10:56 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > Apr 28 23:11:07 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, but soft recovered
> > ```
> >
> > Together with my coworker Patrik Jakobsson and Takashi Iwai we already chased down a few other issues (like the dreaded flip_done).
> > But this last issue remains. We first backported some fixes to our 6.14.x kernel for flip_done. To get even more fixes I switched to the 6.15~rc kernels.
> >
> > Then also went with Mesa 25.1~rc which didnt fix it. so now it is a snapshot package of main.
> >
> > Some observations. While gaming I started run https://github.com/Umio-Yasuno/amdgpu_top on the 2nd monitor to see if overheating might be an issue.
> >
> > but the memory temps are at around 82 and the GPU core itself is usually lower.
> > One observation is that the card is supposed to have a boost clock of 3100MHz but amdgpu_top sees it boost over 3200. I tried both onboard bios and the behavior is the same.
> >
> > currently I run both my wayland session as well as my game with RADV_DEBUG=nohiz but that didnt provide more details adding nodcc drop the performance from 480-500Hz ( the card could go faster but I limit the game to 500)
> > to 330-360.
> >
> > Please let me know, if I can provide more details

please make sure your kernel has these three patches:
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4408b59eeacfea777aae397177f49748cadde5ce
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=afcdf51d97cd58dd7a2e0aa8acbaea5108fa6826
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=366e77cd4923c3aa45341e15dcaf3377af9b042f

soft recover kills stuck shaders, so I'd suggest trying a newer
version of mesa and LLVM.  If that doesn't help, please file a ticket
here:
https://gitlab.freedesktop.org/drm/amd/-/issues/

Alex


Alex

> >
> >    darix
> >
> >
> > ```
> > --
> > Always remember:
> >   Never accept the world as it appears to be.
> >     Dare to see it for what it could be.
> >       The world can always use more heroes.
> >
> >
> >
> >
> > ```
> >
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette


More information about the amd-gfx mailing list