Hard lockups with ROCM

Thu May 16 19:03:34 UTC 2019

Hi Daniel,

On 2019-05-12 9:44 p.m., Daniel Kasak wrote:
> [CAUTION: External Email]
> Hi all. I had version 2.2.0 of the ROCM stack running on a 5.0.x and 
> 5.1.0 kernel. Things were going great with various boinc GPU tasks. 
> But there is a setiathome GPU task which reliably gives me a hard 
> lockup within about 30 minutes of running. I actually had to do *two* 
> emergency re-installs over the past week.

Sorry to hear about your trouble. Do you have a second computer you can 
use to remote login into your system? Chances are that it's still 
responsive and only the screen is frozen.

Also, you could try booting in console mode (without an xserver). The 
console usually still works even when the GPU compute units or SDMA 
engines are hanging.

If you manage to do an emergency reboot with sysrq (remount-RO and 
reboot), you should see the kernel log of your previous session in 
/var/log. On Ubuntu it's in /var/log/kern.log. Not sure where it is on 
Gentoo. There is a good chance the log contains helpful information 
(e.g. if the driver detected a hang but failed to reset the GPU, or 
maybe a driver bug that leads to a deadlock or kernel panic).

> Perhaps part of this was my fault ( running btrfs with lzo compression 
> on my root partition ... ). But absolutely part of this was the hard 
> lockups. I've tested all kinds of other things ( eg rebuilding lots of 
> stuff under Gentoo ) ... I don't have a general stability issue even 
> under hours of high load. But after restarting boinc with that same 
> setiathome task ... <bang>!
>
> If someone wants me to sacrifice another installation, they can point 
> me to instructions for trying to gather more information.

If you want to risk another installation, it may be a good idea to do it 
on a spare hard drive, or a spare partition on your existing hard drive. 
Also, use a more conventional choice of file system. A simple ext4 is 
pretty robust in my experience. We get hard lockups all the time. I 
usually only reinstall my system for big OS upgrades or if I'm stupid 
and mess something up myself.

Which GPU are you using?

There are some things you could try to narrow down the cause of your 
problem.

 1. Monitor GPU temperature while running setiathome
 2. If you're building your own kernel, enable some helpful kernel debug
    options that can provide very helpful diagnostic info: lock
    debugging, memory debugging, lockup/hang debugging
 3. Try running with lower GPU clocks (rocm-smi --setperflevel low). If
    that fixes it, you may have inadequate cooling or power supply
 4. Try running in console mode (without Xserver or other graphical UI
    running). If that fixes it, there may be a bad interaction between
    graphics and compute
 5. Try updating your firmware. The DKMS package included in our ROCm
    releases includes the latest firmware. You should be able to extract
    it from there and drop it into /lib/firmware/amdgpu
 6. Try to find a regression point. Is there any known version of ROCm
    or the kernel where it worked correctly?

Regards,
   Felix

>
> Anyway ... perhaps more work around detecting and recovering from GPU 
> lockups is in order?
>
> Dan
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx