Hard lockups with ROCM
Paul Menzel
pmenzel+amd-gfx at molgen.mpg.de
Thu May 16 15:56:45 UTC 2019
Dear Daniel,
On 05/16/2019 01:52 PM, Daniel Kasak wrote:
> On Thu, May 16, 2019 at 11:43 AM Alex Deucher <alexdeucher at gmail.com> wrote:
>
>> On Wed, May 15, 2019 at 8:33 PM Daniel Kasak <d.j.kasak.dk at gmail.com>
>> wrote:
>>>
>>> On Mon, May 13, 2019 at 11:44 AM Daniel Kasak <d.j.kasak.dk at gmail.com>
>> wrote:
>>>>
>>>> Hi all. I had version 2.2.0 of the ROCM stack running on a 5.0.x and
>> 5.1.0 kernel. Things were going great with various boinc GPU tasks. But
>> there is a setiathome GPU task which reliably gives me a hard lockup within
>> about 30 minutes of running. I actually had to do *two* emergency
>> re-installs over the past week. Perhaps part of this was my fault ( running
>> btrfs with lzo compression on my root partition ... ). But absolutely part
>> of this was the hard lockups. I've tested all kinds of other things ( eg
>> rebuilding lots of stuff under Gentoo ) ... I don't have a general
>> stability issue even under hours of high load. But after restarting boinc
>> with that same setiathome task ... <bang>!
>>>>
>>>> If someone wants me to sacrifice another installation, they can point
>> me to instructions for trying to gather more information.
>>>>
>>>> Anyway ... perhaps more work around detecting and recovering from GPU
>> lockups is in order?
>>> <sigh>
>>>
>>> That's what I was afraid of :(
>>
>> Not sure what you were afraid of. I don't think anyone has looked at
>> setiathome on ROCm. I'd suggest filing a bug
>> (https://bugs.freedesktop.org) and attaching your dmesg output and
>> xorg log (if using X). If there is a GPU reset, note that you will
>> need to restart your desktop environment because currently neither
>> glamor or any compositors support GL robustness extensions to reset
>> their contexts after a GPU reset.
> Hi Alex. dmesg output is not available ... this is a *hard* lockup. I need
> to power-cycle after it happens ( ALT + SysRq + { S , U , B } doesn't even
> work ). That's why I asked for instructions to possibly gather more info. I
> did check the xorg log after I did an emergency export of my filesystem ...
> nothing of interest in there. It seems like I currently don't really have
> enough info to make a bug report worthwhile.
Does your board have a serial port? If yes, please use the serial console to
gather the messages on another system.
Sometimes the netconsole [1] is also supposed to be able to send the last
Linux messages out.
Kind regards,
Paul
[1]: https://www.kernel.org/doc/Documentation/networking/netconsole.txt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5174 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20190516/ceba2a9f/attachment-0001.bin>
More information about the amd-gfx
mailing list