After Vega 56/64 GPU hang I unable reboot system
StDenis, Tom
Tom.StDenis at amd.com
Thu Dec 20 14:19:40 UTC 2018
On 2018-12-20 9:08 a.m., Tom St Denis wrote:
> On 2018-12-20 9:06 a.m., Tom St Denis wrote:
>> On 2018-12-20 6:45 a.m., Mikhail Gavrilov wrote:
>>> On Thu, 20 Dec 2018 at 16:17, StDenis, Tom <Tom.StDenis at amd.com> wrote:
>>>>
>>>> Well yup the kernel is not letting you open the files:
>>>>
>>>>
>>>> As sudo/root you should be able to open these files with umr. What
>>>> happens if you just open a shell as root and run it?
>>>>
>>>
>>> [root at localhost ~]# touch /sys/kernel/debug/dri/0/amdgpu_ring_gfx
>>> [root at localhost ~]# cat /sys/kernel/debug/dri/0/amdgpu_ring_gfx
>>> cat: /sys/kernel/debug/dri/0/amdgpu_ring_gfx: Operation not permitted
>>> [root at localhost ~]# ls -laZ /sys/kernel/debug/dri/0/amdgpu_ring_gfx
>>> -r--r--r--. 1 root root system_u:object_r:debugfs_t:s0 8204 Dec 20
>>> 16:31 /sys/kernel/debug/dri/0/amdgpu_ring_gfx
>>> [root at localhost ~]# getenforce
>>> Permissive
>>> [root at localhost ~]# /home/mikhail/packaging-work/umr/build/src/app/umr
>>> -O verbose,halt_waves -wa
>>> Cannot seek to MMIO address: Bad file descriptor
>>> [ERROR]: Could not open ring debugfs fileSegmentation fault (core
>>> dumped)
>>>
>>> I am already tried launch `umr` under root user, but kernel don't let
>>> open `amdgpu_ring_gfx` again.
>>>
>>> What else kernel options I should to check?
>>>
>>> I am also attached current kernel config to this message.
>>
>> I can replicate this by doing
>>
>> chmod u+s umr
>> sudo ./umr -R gfx[.]
>>
>> You need to remove the u+s bit you are literally not running umr as root!
>
> Actually disregard that. I'm confused at this point.
>
> I run umr 100s of times a day on my devel box just fine as root.
>
> Let me fiddle and see if I can sort this out.
Ya I was right. With a plain build I can access the files just fine.
tom at fx8:~/stuff/public/umr/src/app $ stat ./umr
File: ./umr
Size: 89204248 Blocks: 174240 IO Block: 4096 regular file
Device: fd01h/64769d Inode: 14946407 Links: 1
Access: (0775/-rwxrwxr-x) Uid: ( 1000/ tom) Gid: ( 1000/ tom)
Access: 2018-12-20 09:15:03.348320256 -0500
Modify: 2018-12-20 09:05:48.148724423 -0500
Change: 2018-12-20 09:14:43.964948557 -0500
Birth: -
tom at fx8:~/stuff/public/umr/src/app $ sudo ./umr -R gfx[.]
raven1.gfx.rptr == 768
raven1.gfx.wptr == 768
raven1.gfx.drv_wptr == 768
raven1.gfx.ring[ 737] == 0xffff1000 ...
raven1.gfx.ring[ 738] == 0xffff1000 ...
raven1.gfx.ring[ 739] == 0xffff1000 ...
raven1.gfx.ring[ 740] == 0xffff1000 ...
raven1.gfx.ring[ 741] == 0xffff1000 ...
raven1.gfx.ring[ 742] == 0xffff1000 ...
raven1.gfx.ring[ 743] == 0xffff1000 ...
raven1.gfx.ring[ 744] == 0xffff1000 ...
raven1.gfx.ring[ 745] == 0xffff1000 ...
raven1.gfx.ring[ 746] == 0xffff1000 ...
raven1.gfx.ring[ 747] == 0xffff1000 ...
raven1.gfx.ring[ 748] == 0xffff1000 ...
raven1.gfx.ring[ 749] == 0xffff1000 ...
raven1.gfx.ring[ 750] == 0xffff1000 ...
raven1.gfx.ring[ 751] == 0xffff1000 ...
raven1.gfx.ring[ 752] == 0xffff1000 ...
raven1.gfx.ring[ 753] == 0xffff1000 ...
raven1.gfx.ring[ 754] == 0xffff1000 ...
raven1.gfx.ring[ 755] == 0xffff1000 ...
raven1.gfx.ring[ 756] == 0xffff1000 ...
raven1.gfx.ring[ 757] == 0xffff1000 ...
raven1.gfx.ring[ 758] == 0xffff1000 ...
raven1.gfx.ring[ 759] == 0xffff1000 ...
raven1.gfx.ring[ 760] == 0xffff1000 ...
raven1.gfx.ring[ 761] == 0xffff1000 ...
raven1.gfx.ring[ 762] == 0xffff1000 ...
raven1.gfx.ring[ 763] == 0xffff1000 ...
raven1.gfx.ring[ 764] == 0xffff1000 ...
raven1.gfx.ring[ 765] == 0xffff1000 ...
raven1.gfx.ring[ 766] == 0xffff1000 ...
raven1.gfx.ring[ 767] == 0xffff1000 ...
raven1.gfx.ring[ 768] == 0xc0032200 rwD
I did manage to get into a weird shell where I couldn't cat
amdgpu_gca_config from bash though after a reboot (had updates pending)
it works fine.
If you can't cat those files then neither can umr.
So NOTABUG :-)
Tom
More information about the amd-gfx
mailing list