[Bug 111481] AMD Navi GPU frequent freezes on both Manjaro/Ubuntu with kernel 5.3 and mesa 19.2 -git/llvm9

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Wed Oct 23 17:18:44 UTC 2019


--- Comment #128 from yamagi at yamagi.org ---
(In reply to yamagi from comment #124)
> Interestingly I've got the problem the other way round. My 5700XT was
> running fine since I got it about two weeks ago. This is Arch Linux, I've
> run Mesa 19.2.1 and llvm-libs 9.0.0 since day one. The card was stable with
> 5.4-RC2 and 5.4-RC3, not a single hang in about 10 hours The Witcher 3 under
> wine + dxvk and Yamagi Quake II with OpenGL 3.2 renderer. After I upgraded
> to 5.4-RC4 I've seen several GPU hangs. The last one, and the only one
> that's still in the logs was:
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled
> seq=85270, emitted seq=85272
> That one was in Yamagi Quake II, but I had hangs on the desktop and in The
> Witcher 3. I have no umr reports so far. I've just compiled the tool and
> will see if I can get some.

As promised, some more informations:

For me the crash is fairly easy to reproduce with Linux 5.4-RC4. All it takes
is Yamagi Quake II (Revision 1232289, can be found at
https://github.com/yquake2/yquake2) with OpenGL 3.2 renderer. The old OpenGL
1.4 doesn't trigger it. Start the game, it's a good idea to set set timedemo
mode to 1, and just let it cycle through the demo loop until it crashes. I used
'./quake +set timedemo 1 +set vid_renderer gl3'. I've never experienced this
crash in the wild with Linux 5.4-RC3 until I learned that I can trigger with
the Quake II demo loop. In Linux 5.4-RC3 it usually takes somewhere between 20
to 30 cycles through loop to trigger, with 5.4-RC4 only 5 to 10 cycles. So
something changed between RC3 and RC4 that made it more likely.

I suspect some kind of timing issue. The demo loop is deterministic, it
generates exactly the same API calls each time it's run. While the crash always
happens while the loading screen is up, it never occures at the same one.
Sometimes it's in the fifth iteration, the next time at the 12th and so on.
Putting apitrace (adds some latency!) onto it, makes it much less likely to
occure. To the point I thought that it's a heisenbug. The same goes for cycling
through the loop without timedemo mode enabled (~20 FPS in normal mode, ~1000
FPS in timedemo mode).

I made an apitrace for easier reproduction. It's a little bit big for bugzilla,
so I've uploaded it here: https://deponie.yamagi.org/temp/quake2.trace.xz
Replaying it usually triggers the crash during the first or second run.

The exact software versions were:
* Linux 5.4-RC4 with https://bugzilla.freedesktop.org/attachment.cgi?id=145323
and https://bugzilla.freedesktop.org/attachment.cgi?id=145734 applied.
* Mesa 19.2.1-2
* LLVM 9.0.0

dmesg output after a crash in Quake IIs demo loop is:
[  122.294181] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout,
signaled seq=177737, emitted seq=177739
[  122.294256] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process glretrace pid 1302 thread glretrace:cs0 pid 1303
[  122.294257] [drm] GPU recovery disabled.

dmesg output after a crash by replaying the apitrace is:
[  266.695388] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout,
signaled seq=27598, emitted seq=27600
[  266.695463] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process glretrace pid 1372 thread glretrace:cs0 pid 1373
[  266.695465] [drm] GPU recovery disabled.

I'm attaching the state of sdma0 is both cases.

I hope this helps to find the root cause of this. If can provide more
informations don't hesitate to ask.

You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20191023/0de42be4/attachment.html>

More information about the dri-devel mailing list