amdgpu: Reproducible soft lockups when playing games

Sat May 3 02:57:17 UTC 2025

On Thu, 2025-05-01 at 09:32 -0400, Alex Deucher wrote:
> On Wed, Apr 30, 2025 at 7:28 PM Marcus Rückert <amd at nordisch.org>
> wrote:
> > 
> > On Wed, 2025-04-30 at 09:55 -0400, Alex Deucher wrote:
> > > please make sure your kernel has these three patches:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4408b59eeacfea777aae397177f49748cadde5ce
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=afcdf51d97cd58dd7a2e0aa8acbaea5108fa6826
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=366e77cd4923c3aa45341e15dcaf3377af9b042f
> > 
> > I am kinda sure that's the patches Takashi backported into our
> > 6.14.3.
> > They are already part of 6.15.rc4 no?
> 
> Yes, I think so.

FWIW: I could trigger another flip_done timeout.

https://gitlab.freedesktop.org/drm/amd/-/issues/4201

video stream (might even be using hardware decoding) seems like a good
trigger for this. I think most of my flip_done issues had twitch
running while doing something.

> > > soft recover kills stuck shaders, so I'd suggest trying a newer
> > > version of mesa and LLVM.  If that doesn't help, please file a
> > > ticket
> > > here:
> > 
> > Newer Mesa is building although I didnt see anything radv related.
> > 
> > I am curious in
> > https://gitlab.freedesktop.org/drm/amd/-/issues/4192
> > there is a lot more details about the crash than what I see. with
> > what
> > kind of flags/environment variables do I have to run to get the
> > same?
> > 
> 
> That issue is directly related to suspend and resume.  I.e., the
> issues only happen after a suspend cycle.  Is that also what you are
> seeing?

Nope. I am just referencing it as it contains more details than I see,
and I wonder what I have to do to get the same amount of extra details
to provide more useful information for you.

> > An observation from my latest crash:
> > 
> > ```
> > May 01 01:05:59 steam[223306]: radv/amdgpu: The CS has been
> > cancelled
> > because the context is lost. This context is guilty of a soft
> > recovery.
> > May 01 01:06:05 steam[223306]: Game Recording - game stopped
> > [gameid=2357570]
> > May 01 01:06:05 steam[223306]: Removing process 352353 for gameID
> > 2357570
> > ```
> > 
> > Is the game launched by steam inheriting that context or could it
> > really be the steam process triggering it? As 223306 would be
> 
> The kernel driver stops accepting commands from a process if it
> caused
> a hang unless the process recreates its context.  I'm not really sure
> what's going on here based on the limited context, but I suspect the
> game causes a GPU hang so the recording process stopped because of
> that.

on the front of the ring timeout bug: I saw that dxvk had at least one
issue with RDNA4 and ring timeout.

https://github.com/doitsujin/dxvk/issues/4756

So i switched from glorrious eggroll's build to proton experimental
from valve. I have not seen any more ring timeout bugs since.

Which made me wonder why the context shows a steam binary as the owner
of the context and now the wine/game process underneath and if this
could be improved.

hth

   darix

-- 
Always remember:
  Never accept the world as it appears to be.
    Dare to see it for what it could be.
      The world can always use more heroes.