[Bug 105733] Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but does nothing. Keyboard stops working.
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Tue Aug 14 19:46:52 UTC 2018
https://bugs.freedesktop.org/show_bug.cgi?id=105733
--- Comment #25 from Andrey Grodzovsky <andrey.grodzovsky at amd.com> ---
(In reply to Allan from comment #12)
> My system started to power down for nothing sometimes, even using the
> GTX1070 (nvidia|nouveau) .
> Then I installed a Windows image just to be sure if the kernel was the
> problem.
>
> Well, for now it *SEEMS* that isn't *ONLY* the driver/kernel :
> - The RX480 was freezing in the same way, then I sent it for warranty.
> - RX580 run problematically, almost always I got a message like : "DX11 :
> device disconnected" or "Mantle : Device lost".
> - GTX1070 was running fine for 1 day, then it became the same as the RX580
> and for my bad luck the system started to power down after a random time
> (5min to 2 hours +/-).
>
> For sure the driver/kernel (amdgpu/linux) has its faults here, and here's
> why:
> - At Windows, the only card that stuck the system was RX480 sometimes
> because it was really broken.
> - In other cases, when a failure happened (with Nvidia or AMD), the system
> was able to retake the control over the device.
> - Maybe doing a soft-reset?
> - Maybe just killing the driver and starting again?
> - Maybe just by stopping the process that were using the GPU to avoid a big
> chain of resulting problems?
> - Neither the RX580 nor GTX1070 has dual-bios AFAIK. Maybe RX480, but I did
> not test it.
>
> Then :
> - Revised and changed the PCI-Ex power lines : OK.
> - Tested power supply (lucky for me AX860i has a self test) : OK.
> - Cleaned all slots with a brush : OK.
> - Tested again CPU and RAM : OK.
>
> But , I must be in a very bad luck, the problems persisted.
>
> I've sent the Motherboard for warranty. I'm waiting for its diagnostic and
> solution.
>
> I'll inform here as soon as it becomes possible.
>
> Thoughts for the while :
> - Not being able to kill the processes *is* a problem that concerns only
> amdgpu and it is either a problem of the driver itself (most likely to be)
> or of the kernel.
We recently fixed the issue of not being able to kill a process stuck like your
process in wait for fence signal in kernel mode.
Can you build latest kernel (4.18) and grab again latest firmware and try again
?
Links to kernel and firmware:
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/
> - The driver is not capable of retaking control of the device.
> - It is impossible to kill children pids when something hung using amdgpu.
> - Yes, it occurred once or twice using nvidia proprietary too, but it was
> probably caused because of the faulty motherboard that I'm waiting to be
> fixed.
> - Using nouveau was the most happy path , but unfortunately nouveau does not
> support Pascal at all yet. It keeps the card at the min clock (300 or
> 400MHz) and it is not possible yet to increase the speed of the card. So it
> is not a valid working way.
--
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180814/f9976874/attachment.html>
More information about the dri-devel
mailing list