After Vega 56/64 GPU hang I unable reboot system

StDenis, Tom Tom.StDenis at amd.com
Thu Dec 20 14:19:40 UTC 2018


On 2018-12-20 9:08 a.m., Tom St Denis wrote:
> On 2018-12-20 9:06 a.m., Tom St Denis wrote:
>> On 2018-12-20 6:45 a.m., Mikhail Gavrilov wrote:
>>> On Thu, 20 Dec 2018 at 16:17, StDenis, Tom <Tom.StDenis at amd.com> wrote:
>>>>
>>>> Well yup the kernel is not letting you open the files:
>>>>
>>>>
>>>> As sudo/root you should be able to open these files with umr.  What
>>>> happens if you just open a shell as root and run it?
>>>>
>>>
>>> [root at localhost ~]# touch /sys/kernel/debug/dri/0/amdgpu_ring_gfx
>>> [root at localhost ~]# cat /sys/kernel/debug/dri/0/amdgpu_ring_gfx
>>> cat: /sys/kernel/debug/dri/0/amdgpu_ring_gfx: Operation not permitted
>>> [root at localhost ~]# ls -laZ /sys/kernel/debug/dri/0/amdgpu_ring_gfx
>>> -r--r--r--. 1 root root system_u:object_r:debugfs_t:s0 8204 Dec 20
>>> 16:31 /sys/kernel/debug/dri/0/amdgpu_ring_gfx
>>> [root at localhost ~]# getenforce
>>> Permissive
>>> [root at localhost ~]# /home/mikhail/packaging-work/umr/build/src/app/umr
>>> -O verbose,halt_waves -wa
>>> Cannot seek to MMIO address: Bad file descriptor
>>> [ERROR]: Could not open ring debugfs fileSegmentation fault (core 
>>> dumped)
>>>
>>> I am already tried launch `umr` under root user, but kernel don't let
>>> open `amdgpu_ring_gfx` again.
>>>
>>> What else kernel options I should to check?
>>>
>>> I am also attached current kernel config to this message.
>>
>> I can replicate this by doing
>>
>> chmod u+s umr
>> sudo ./umr -R gfx[.]
>>
>> You need to remove the u+s bit you are literally not running umr as root!
> 
> Actually disregard that.  I'm confused at this point.
> 
> I run umr 100s of times a day on my devel box just fine as root.
> 
> Let me fiddle and see if I can sort this out.


Ya I was right.  With a plain build I can access the files just fine.

tom at fx8:~/stuff/public/umr/src/app $ stat ./umr
   File: ./umr
   Size: 89204248  	Blocks: 174240     IO Block: 4096   regular file
Device: fd01h/64769d	Inode: 14946407    Links: 1
Access: (0775/-rwxrwxr-x)  Uid: ( 1000/     tom)   Gid: ( 1000/     tom)
Access: 2018-12-20 09:15:03.348320256 -0500
Modify: 2018-12-20 09:05:48.148724423 -0500
Change: 2018-12-20 09:14:43.964948557 -0500
  Birth: -
tom at fx8:~/stuff/public/umr/src/app $ sudo ./umr -R gfx[.]

raven1.gfx.rptr == 768
raven1.gfx.wptr == 768
raven1.gfx.drv_wptr == 768
raven1.gfx.ring[ 737] == 0xffff1000    ...
raven1.gfx.ring[ 738] == 0xffff1000    ...
raven1.gfx.ring[ 739] == 0xffff1000    ...
raven1.gfx.ring[ 740] == 0xffff1000    ...
raven1.gfx.ring[ 741] == 0xffff1000    ...
raven1.gfx.ring[ 742] == 0xffff1000    ...
raven1.gfx.ring[ 743] == 0xffff1000    ...
raven1.gfx.ring[ 744] == 0xffff1000    ...
raven1.gfx.ring[ 745] == 0xffff1000    ...
raven1.gfx.ring[ 746] == 0xffff1000    ...
raven1.gfx.ring[ 747] == 0xffff1000    ...
raven1.gfx.ring[ 748] == 0xffff1000    ...
raven1.gfx.ring[ 749] == 0xffff1000    ...
raven1.gfx.ring[ 750] == 0xffff1000    ...
raven1.gfx.ring[ 751] == 0xffff1000    ...
raven1.gfx.ring[ 752] == 0xffff1000    ...
raven1.gfx.ring[ 753] == 0xffff1000    ...
raven1.gfx.ring[ 754] == 0xffff1000    ...
raven1.gfx.ring[ 755] == 0xffff1000    ...
raven1.gfx.ring[ 756] == 0xffff1000    ...
raven1.gfx.ring[ 757] == 0xffff1000    ...
raven1.gfx.ring[ 758] == 0xffff1000    ...
raven1.gfx.ring[ 759] == 0xffff1000    ...
raven1.gfx.ring[ 760] == 0xffff1000    ...
raven1.gfx.ring[ 761] == 0xffff1000    ...
raven1.gfx.ring[ 762] == 0xffff1000    ...
raven1.gfx.ring[ 763] == 0xffff1000    ...
raven1.gfx.ring[ 764] == 0xffff1000    ...
raven1.gfx.ring[ 765] == 0xffff1000    ...
raven1.gfx.ring[ 766] == 0xffff1000    ...
raven1.gfx.ring[ 767] == 0xffff1000    ...
raven1.gfx.ring[ 768] == 0xc0032200    rwD


I did manage to get into a weird shell where I couldn't cat 
amdgpu_gca_config from bash though after a reboot (had updates pending) 
it works fine.

If you can't cat those files then neither can umr.

So NOTABUG :-)

Tom


More information about the amd-gfx mailing list