Hard lockups with ROCM
Kuehling, Felix
Felix.Kuehling at amd.com
Thu May 16 19:03:34 UTC 2019
Hi Daniel,
On 2019-05-12 9:44 p.m., Daniel Kasak wrote:
> [CAUTION: External Email]
> Hi all. I had version 2.2.0 of the ROCM stack running on a 5.0.x and
> 5.1.0 kernel. Things were going great with various boinc GPU tasks.
> But there is a setiathome GPU task which reliably gives me a hard
> lockup within about 30 minutes of running. I actually had to do *two*
> emergency re-installs over the past week.
Sorry to hear about your trouble. Do you have a second computer you can
use to remote login into your system? Chances are that it's still
responsive and only the screen is frozen.
Also, you could try booting in console mode (without an xserver). The
console usually still works even when the GPU compute units or SDMA
engines are hanging.
If you manage to do an emergency reboot with sysrq (remount-RO and
reboot), you should see the kernel log of your previous session in
/var/log. On Ubuntu it's in /var/log/kern.log. Not sure where it is on
Gentoo. There is a good chance the log contains helpful information
(e.g. if the driver detected a hang but failed to reset the GPU, or
maybe a driver bug that leads to a deadlock or kernel panic).
> Perhaps part of this was my fault ( running btrfs with lzo compression
> on my root partition ... ). But absolutely part of this was the hard
> lockups. I've tested all kinds of other things ( eg rebuilding lots of
> stuff under Gentoo ) ... I don't have a general stability issue even
> under hours of high load. But after restarting boinc with that same
> setiathome task ... <bang>!
>
> If someone wants me to sacrifice another installation, they can point
> me to instructions for trying to gather more information.
If you want to risk another installation, it may be a good idea to do it
on a spare hard drive, or a spare partition on your existing hard drive.
Also, use a more conventional choice of file system. A simple ext4 is
pretty robust in my experience. We get hard lockups all the time. I
usually only reinstall my system for big OS upgrades or if I'm stupid
and mess something up myself.
Which GPU are you using?
There are some things you could try to narrow down the cause of your
problem.
1. Monitor GPU temperature while running setiathome
2. If you're building your own kernel, enable some helpful kernel debug
options that can provide very helpful diagnostic info: lock
debugging, memory debugging, lockup/hang debugging
3. Try running with lower GPU clocks (rocm-smi --setperflevel low). If
that fixes it, you may have inadequate cooling or power supply
4. Try running in console mode (without Xserver or other graphical UI
running). If that fixes it, there may be a bad interaction between
graphics and compute
5. Try updating your firmware. The DKMS package included in our ROCm
releases includes the latest firmware. You should be able to extract
it from there and drop it into /lib/firmware/amdgpu
6. Try to find a regression point. Is there any known version of ROCm
or the kernel where it worked correctly?
Regards,
Felix
>
> Anyway ... perhaps more work around detecting and recovering from GPU
> lockups is in order?
>
> Dan
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list