[Bug 105733] Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but does nothing. Keyboard stops working.

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Fri Apr 27 12:41:52 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=105733

--- Comment #12 from Allan <allan4229 at gmail.com> ---
My system started to power down for nothing sometimes, even using the GTX1070
(nvidia|nouveau) .
Then I installed a Windows image just to be sure if the kernel was the problem.

Well, for now it *SEEMS* that isn't *ONLY* the driver/kernel :
- The RX480 was freezing in the same way, then I sent it for warranty.
- RX580 run problematically, almost always I got a message like : "DX11 :
device disconnected" or "Mantle : Device lost".
- GTX1070 was running fine for 1 day, then it became the same as the RX580 and
for my bad luck the system started to power down after a random time (5min to 2
hours +/-).

For sure the driver/kernel (amdgpu/linux) has its faults here, and here's why:
- At Windows, the only card that stuck the system was RX480 sometimes because
it was really broken.
- In other cases, when a failure happened (with Nvidia or AMD), the system was
able to retake the control over the device.
 - Maybe doing a soft-reset?
 - Maybe just killing the driver and starting again?
 - Maybe just by stopping the process that were using the GPU to avoid a big
chain of resulting problems?
- Neither the RX580 nor GTX1070 has dual-bios AFAIK. Maybe RX480, but I did not
test it.

Then :
- Revised and changed the PCI-Ex power lines : OK.
- Tested power supply (lucky for me AX860i has a self test) : OK.
- Cleaned all slots with a brush : OK.
- Tested again CPU and RAM : OK.

But , I must be in a very bad luck, the problems persisted.

I've sent the Motherboard for warranty. I'm waiting for its diagnostic and
solution.

I'll inform here as soon as it becomes possible.

Thoughts for the while :
- Not being able to kill the processes *is* a problem that concerns only amdgpu
and it is either a problem of the driver itself (most likely to be) or of the
kernel.
- The driver is not capable of retaking control of the device.
- It is impossible to kill children pids when something hung using amdgpu.
- Yes, it occurred once or twice using nvidia proprietary too, but it was
probably caused because of the faulty motherboard that I'm waiting to be fixed.
- Using nouveau was the most happy path , but unfortunately nouveau does not
support Pascal at all yet. It keeps the card at the min clock (300 or 400MHz)
and it is not possible yet to increase the speed of the card. So it is not a
valid working way.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180427/a8bcf6b0/attachment.html>


More information about the dri-devel mailing list