[Bug 105733] Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but does nothing. Keyboard stops working.

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Apr 1 14:15:35 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=105733

--- Comment #6 from Allan <allan4229 at gmail.com> ---
TL;DR : I don't have any idea of what is happening. The errors aren't clear and
I didn't find a discrete way of reproducing it and I'm in need of help.

That's exactly the problem... I'm getting crazy about this problem.

I've been trying to understand what is happening for weeks...

So... I'll give you a brief(long) description :

I've been running an RX 580. And then sometimes the system used to freeze like
this and I was starting to think about the card being problematic.

Then I got an RX 480, and I was planning to sell the RX580.

I compiled a kernel with the polaris binaries and etc... It was going very well
until a system upgrade.

Then "here we go again" ... same problems... and now it seems like RX 480 fails
twice as fast as the RX580 fails.

If you are asking yourself "what kind of failures ?" I'll resume it : code 147,
code 146, chrome_dthread libxul.so (for both firefox and chromium), a big call
trace telling about amdgpu blocked for more than 120 seconds. Everything after
the screen being frozen, ignoring the keyboard and mouse clicks, the only thing
that really works is the mouse cursor moving.

When it happens? After a few minutes running youtube or unigine valley or some
random time (from minutes to several hours) using an opencl task for example.

Then I started to think about the other components...
- RAM ? Checked and running.... if the screen hangs, some ssh tests run fine.
- CPU ? Never had a problem about it as far as I remember. Ssh tests run fine.
- MOBO ? I really don't know. That's why :
---- I had been having some sound cracklings, indicating that some power
management could be tainted.
---- I noticed that disabling IOMMU decreased the amount of crashes
significantly... but unfortunately after updating the BIOS/EFI the option of
enabling/disabling it simply was removed... I'll be contacting the
manufacturer. So I can't affirm that it was the cause.
---- I started to think that something nasty was going on with the power
supply.
- POWER SUPPLY ? I bet that it is not
---- I have an 5 yeras old Aerocool 80 plus silver 800W power supply. It always
had been a very good PSU... holding a HD7970GHz (290W TDP) most part of the
time without a single problem.
---- But okay... maybe the capacitors were faulty (as the mobo manufacturer
said when I asked about the sound). Then I bought an AX860i. And if there is
any better PSU than this for the 800W range... I'd like to know. 80 plus
platinum certified... and even that the certification system does not get
verified for years (almost like irrelevant to be honest). I already had an
Corsair HX600 before and it was outstanding... an AX is better than a HX so...
only a titanium  that costs more than my mobo and cpu togheter would be better
then.
---- Guess what? The same problems. Actually, now, it shuts down sometimes.
- KERNEL ? I was thinking that the problem was 4.15 because it has like 5x more
chance of failling. But it also occurs with the very stable 4.13. Maybe I'll
try other kernels... but as further we go behind with kernel versions, less
features we have with amdgpu AFAIK.
---- Also. With the RX480 it started to fail the video output when I configure
the Display Port output to be 144Hz. My screen can handle 160Hz with adaptive
sync, but it never worked with amdgpu.
---- The DisplayPort/HDMI sound with DC/DAL support in 4.15 is a myth and NEVER
works. If I configure amdgpu.dc=1 with RX580 it simply does not sound anything
and with the RX480 it hangs the system when starting the pavucontrol. When
forcing the output to the HDMI/DP it simply does not sound anything in both
ways (but pavucontrol shows that something was supposed to be happening).
---- While running a tty the chances of crashing is very low. But it happens
when trying an opencl application after some random time as said before.
---- When using RX580+1070 or RX480+1070 for vfio I noticed that unbinding the
nvidia card extended the amount of working time before crashing. (was also one
reason for me to think that the PSU was faulty)

Now the "best" part : running a single GPU leads to the same problems... :/


I'm not sure about anything right now. I'll try only the 1070 for sometime to
guarantee that amdgpu is the only problem here.

I never touched the amdgpu code but it seems to me that either I sell the cards
or I fix it by hand. Because I'm not finding anything related.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180401/59b09510/attachment.html>


More information about the dri-devel mailing list