amdgpu: Reproducible soft lockups when playing games
Marcus Rückert
amd at nordisch.org
Sat May 3 02:57:17 UTC 2025
On Thu, 2025-05-01 at 09:32 -0400, Alex Deucher wrote:
> On Wed, Apr 30, 2025 at 7:28 PM Marcus Rückert <amd at nordisch.org>
> wrote:
> >
> > On Wed, 2025-04-30 at 09:55 -0400, Alex Deucher wrote:
> > > please make sure your kernel has these three patches:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4408b59eeacfea777aae397177f49748cadde5ce
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=afcdf51d97cd58dd7a2e0aa8acbaea5108fa6826
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=366e77cd4923c3aa45341e15dcaf3377af9b042f
> >
> > I am kinda sure that's the patches Takashi backported into our
> > 6.14.3.
> > They are already part of 6.15.rc4 no?
>
> Yes, I think so.
FWIW: I could trigger another flip_done timeout.
https://gitlab.freedesktop.org/drm/amd/-/issues/4201
video stream (might even be using hardware decoding) seems like a good
trigger for this. I think most of my flip_done issues had twitch
running while doing something.
> > > soft recover kills stuck shaders, so I'd suggest trying a newer
> > > version of mesa and LLVM. If that doesn't help, please file a
> > > ticket
> > > here:
> >
> > Newer Mesa is building although I didnt see anything radv related.
> >
> > I am curious in
> > https://gitlab.freedesktop.org/drm/amd/-/issues/4192
> > there is a lot more details about the crash than what I see. with
> > what
> > kind of flags/environment variables do I have to run to get the
> > same?
> >
>
> That issue is directly related to suspend and resume. I.e., the
> issues only happen after a suspend cycle. Is that also what you are
> seeing?
Nope. I am just referencing it as it contains more details than I see,
and I wonder what I have to do to get the same amount of extra details
to provide more useful information for you.
> > An observation from my latest crash:
> >
> > ```
> > May 01 01:05:59 steam[223306]: radv/amdgpu: The CS has been
> > cancelled
> > because the context is lost. This context is guilty of a soft
> > recovery.
> > May 01 01:06:05 steam[223306]: Game Recording - game stopped
> > [gameid=2357570]
> > May 01 01:06:05 steam[223306]: Removing process 352353 for gameID
> > 2357570
> > ```
> >
> > Is the game launched by steam inheriting that context or could it
> > really be the steam process triggering it? As 223306 would be
>
> The kernel driver stops accepting commands from a process if it
> caused
> a hang unless the process recreates its context. I'm not really sure
> what's going on here based on the limited context, but I suspect the
> game causes a GPU hang so the recording process stopped because of
> that.
on the front of the ring timeout bug: I saw that dxvk had at least one
issue with RDNA4 and ring timeout.
https://github.com/doitsujin/dxvk/issues/4756
So i switched from glorrious eggroll's build to proton experimental
from valve. I have not seen any more ring timeout bugs since.
Which made me wonder why the context shows a steam binary as the owner
of the context and now the wine/game process underneath and if this
could be improved.
hth
darix
--
Always remember:
Never accept the world as it appears to be.
Dare to see it for what it could be.
The world can always use more heroes.
More information about the amd-gfx
mailing list