[Bug 105733] Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but does nothing. Keyboard stops working.

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Tue Aug 14 19:46:52 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=105733

--- Comment #25 from Andrey Grodzovsky <andrey.grodzovsky at amd.com> ---
(In reply to Allan from comment #12)
> My system started to power down for nothing sometimes, even using the
> GTX1070 (nvidia|nouveau) .
> Then I installed a Windows image just to be sure if the kernel was the
> problem.
> 
> Well, for now it *SEEMS* that isn't *ONLY* the driver/kernel :
> - The RX480 was freezing in the same way, then I sent it for warranty.
> - RX580 run problematically, almost always I got a message like : "DX11 :
> device disconnected" or "Mantle : Device lost".
> - GTX1070 was running fine for 1 day, then it became the same as the RX580
> and for my bad luck the system started to power down after a random time
> (5min to 2 hours +/-).
> 
> For sure the driver/kernel (amdgpu/linux) has its faults here, and here's
> why:
> - At Windows, the only card that stuck the system was RX480 sometimes
> because it was really broken.
> - In other cases, when a failure happened (with Nvidia or AMD), the system
> was able to retake the control over the device.
>  - Maybe doing a soft-reset?
>  - Maybe just killing the driver and starting again?
>  - Maybe just by stopping the process that were using the GPU to avoid a big
> chain of resulting problems?
> - Neither the RX580 nor GTX1070 has dual-bios AFAIK. Maybe RX480, but I did
> not test it.
> 
> Then :
> - Revised and changed the PCI-Ex power lines : OK.
> - Tested power supply (lucky for me AX860i has a self test) : OK.
> - Cleaned all slots with a brush : OK.
> - Tested again CPU and RAM : OK.
> 
> But , I must be in a very bad luck, the problems persisted.
> 
> I've sent the Motherboard for warranty. I'm waiting for its diagnostic and
> solution.
> 
> I'll inform here as soon as it becomes possible.
> 
> Thoughts for the while :
> - Not being able to kill the processes *is* a problem that concerns only
> amdgpu and it is either a problem of the driver itself (most likely to be)
> or of the kernel.

We recently fixed the issue of not being able to kill a process stuck like your
process in wait for fence signal in kernel mode. 

Can you build latest kernel (4.18) and grab again latest firmware and try again
?
Links to kernel and firmware:
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ 

> - The driver is not capable of retaking control of the device.
> - It is impossible to kill children pids when something hung using amdgpu.
> - Yes, it occurred once or twice using nvidia proprietary too, but it was
> probably caused because of the faulty motherboard that I'm waiting to be
> fixed.
> - Using nouveau was the most happy path , but unfortunately nouveau does not
> support Pascal at all yet. It keeps the card at the min clock (300 or
> 400MHz) and it is not possible yet to increase the speed of the card. So it
> is not a valid working way.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180814/f9976874/attachment.html>


More information about the dri-devel mailing list