[Bug 107572] Unrecoverable GPU hang with IP block:gfx_v8_0 is hung

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Tue Aug 14 23:45:33 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=107572

            Bug ID: 107572
           Summary: Unrecoverable GPU hang with IP block:gfx_v8_0 is hung
           Product: DRI
           Version: unspecified
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: DRM/AMDgpu
          Assignee: dri-devel at lists.freedesktop.org
          Reporter: madcatx at atlas.cz

Hello,

I have been experiencing a worrying amount of these ever since I got my RX 570
a few months ago. I can reproduce the hang quite reliably by with some 3D
workloads, for instance the Unigine Superposition run on High quality or
Witcher 3 (through WINE) crash the GPU quite reliably within minutes.

Once that happens I can always SSH into the machine and try to get at least
some debugging information. Unfortunately, there does not seem to be much to go
on.

dmesg does not tell me more than this:
[  254.704581] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
last signaled seq=103742, last emitted seq=103745
[  254.704586] [drm] IP block:gfx_v8_0 is hung!
[  254.704629] [drm] GPU recovery disabled.

Here are a few things I have tried so far:
- Boot with amdgpu.dc=0
- Boot with amdgpu.vm_update_mode=3
- Force the GPU to max power state
- Disable IOMMU (both by iommu=off and by disabling VT-d in BIOS)
- Boot with amdgpu.gpu_recovery=1 (does not produce any additional info)

I grabbed the umr tool to try to get the state of the GPU when in crashes but
it does not seem to be able to read anything. Running:

umr -R gfx[.]

Leaves me with:

[ERROR]: Could not open ring debugfs file#  

I check that entries in /sys/kernel/debug/amdgpu that look relevant are there,
cat'ing them gives me "Operation not permitted". Yes, I am doing it as root.

Once this happens the only way out is a hard reboot.

I am running up-to-date Fedora 28, kernel 4.17.2, Mesa 18.0 series, LLVM 6.0.1.

Is there anything else I can do?

Thanks.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180814/da1968e4/attachment.html>


More information about the dri-devel mailing list