<html>
<head>
<base href="https://bugs.freedesktop.org/">
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW - Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but does nothing. Keyboard stops working."
href="https://bugs.freedesktop.org/show_bug.cgi?id=105733#c25">Comment # 25</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW - Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but does nothing. Keyboard stops working."
href="https://bugs.freedesktop.org/show_bug.cgi?id=105733">bug 105733</a>
from <span class="vcard"><a class="email" href="mailto:andrey.grodzovsky@amd.com" title="Andrey Grodzovsky <andrey.grodzovsky@amd.com>"> <span class="fn">Andrey Grodzovsky</span></a>
</span></b>
<pre>(In reply to Allan from <a href="show_bug.cgi?id=105733#c12">comment #12</a>)
<span class="quote">> My system started to power down for nothing sometimes, even using the
> GTX1070 (nvidia|nouveau) .
> Then I installed a Windows image just to be sure if the kernel was the
> problem.
>
> Well, for now it *SEEMS* that isn't *ONLY* the driver/kernel :
> - The RX480 was freezing in the same way, then I sent it for warranty.
> - RX580 run problematically, almost always I got a message like : "DX11 :
> device disconnected" or "Mantle : Device lost".
> - GTX1070 was running fine for 1 day, then it became the same as the RX580
> and for my bad luck the system started to power down after a random time
> (5min to 2 hours +/-).
>
> For sure the driver/kernel (amdgpu/linux) has its faults here, and here's
> why:
> - At Windows, the only card that stuck the system was RX480 sometimes
> because it was really broken.
> - In other cases, when a failure happened (with Nvidia or AMD), the system
> was able to retake the control over the device.
> - Maybe doing a soft-reset?
> - Maybe just killing the driver and starting again?
> - Maybe just by stopping the process that were using the GPU to avoid a big
> chain of resulting problems?
> - Neither the RX580 nor GTX1070 has dual-bios AFAIK. Maybe RX480, but I did
> not test it.
>
> Then :
> - Revised and changed the PCI-Ex power lines : OK.
> - Tested power supply (lucky for me AX860i has a self test) : OK.
> - Cleaned all slots with a brush : OK.
> - Tested again CPU and RAM : OK.
>
> But , I must be in a very bad luck, the problems persisted.
>
> I've sent the Motherboard for warranty. I'm waiting for its diagnostic and
> solution.
>
> I'll inform here as soon as it becomes possible.
>
> Thoughts for the while :
> - Not being able to kill the processes *is* a problem that concerns only
> amdgpu and it is either a problem of the driver itself (most likely to be)
> or of the kernel.</span >
We recently fixed the issue of not being able to kill a process stuck like your
process in wait for fence signal in kernel mode.
Can you build latest kernel (4.18) and grab again latest firmware and try again
?
Links to kernel and firmware:
<a href="https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next">https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next</a>
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/">https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/</a>
<span class="quote">> - The driver is not capable of retaking control of the device.
> - It is impossible to kill children pids when something hung using amdgpu.
> - Yes, it occurred once or twice using nvidia proprietary too, but it was
> probably caused because of the faulty motherboard that I'm waiting to be
> fixed.
> - Using nouveau was the most happy path , but unfortunately nouveau does not
> support Pascal at all yet. It keeps the card at the min clock (300 or
> 400MHz) and it is not possible yet to increase the speed of the card. So it
> is not a valid working way.</span ></pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>