[amd-gfx] AMD Carrizo - GPU fault detected: 146 0x0842b714

Sat Jun 18 12:30:38 UTC 2016

On 18.06.2016 13:56, Mads wrote:
> I removed the global env R600_DEBUG=nodma before this test, didn't seem
> to matter anyway...
>
> On 2016-06-18 13:36, Nicolai Hähnle wrote:
>
>> A sanity check is `grep radeonsi /proc/$pid/maps` -- if something
>> shows up, the driver was loaded into the process.
>
> dolphin has pid 560:
>
> $ grep radeonsi /proc/560/maps
> 7f7e70906000-7f7e7100a000 r-xp 00000000 00:0e 2125313
> /usr/lib64/mesa/radeonsi_dri.so
> 7f7e7100a000-7f7e71043000 rw-p 00703000 00:0e 2125313
> /usr/lib64/mesa/radeonsi_dri.so
>
> So that's something, I guess...
>
> So, newly compiled mesa from git with assertions/debug enabled:
>
> $ XAUTHORITY=.Xauthority DISPLAY=:0 LIBGL_DEBUG=verbose dolphin
> libGL: pci id for fd 9: 1002:9874, driver radeonsi
> libGL: OpenDriver: trying /usr/lib64/dri/tls/radeonsi_dri.so
> libGL: OpenDriver: trying /usr/lib64/dri/radeonsi_dri.so
> libGL: Using DRI3 for screen 0
> Trying to convert empty KLocalizedString to QString.
> Cannot creat accessible child interface for object:
> PlacesView(0xb7adc0)  index:  4
> QPixmap::scaled: Pixmap is a null pixmap
> QPixmap::scaled: Pixmap is a null pixmap
> (... repeating a few times, guessing there's missing icons in the
> themeset or something. dolphin itself does not crash...)

Okay, so since dolphin uses OpenGL for rendering as well, the problem 
now is to figure out whether the VM fault comes from dolphin or from the 
compositor.

There are two approaches. The first one is to just try your luck and 
capture an apitrace of dolphin, and then see whether playing that 
apitrace back also produces VM faults. If it does, great - upload the 
apitrace somewhere, and we can hopefully get it fixed.

The second approach is to correlate the VM ID in

> dmesg:
> [   78.873577] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08e2b714
> [   78.873590] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
>   0x0010151C
> [   78.873592] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x0D0B7014
> [   78.873595] VM fault (0x14, vmid 6) at page 1053980, write from
> 'SDM0' (0x53444d30) (183)

with the running processes. This can be done via tracing. As root:

echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_cs_ioctl/enable
echo 1 > /sys/kernel/debug/tracing/events/gpu_sched/amd_sched_job/enable
echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_sched_run_job/enable
echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_vm_grab_id/enable
cat /sys/kernel/debug/tracing/trace_pipe

You'll get *lots* of output of the form

           compiz-2065  [000] .... 14927.891778: amdgpu_cs_ioctl: 
adev=ffff88022fe70000, sched_job=ffff880110dab2a0, first 
ib=ffff8800923e0200, sched fence=ffff880068509b80, ring name:gfx, num_ibs:1
           compiz-2065  [000] .... 14927.891782: amd_sched_job: 
entity=ffff88023258f030, sched job=ffff880110dab2a0, 
fence=ffff880068509b80, ring=gfx, job count:0, hw job count:0
              gfx-172   [002] .... 14927.891802: amdgpu_sched_run_job: 
adev=ffff88022fe70000, sched_job=ffff880110dab2a0, first 
ib=ffff8800923e0200, sched fence=ffff880068509b80, ring name:gfx, num_ibs:1
              gfx-172   [002] .... 14927.891809: amdgpu_vm_grab_id: 
vmid=5, ring=0

In this particular case, compiz submitted a CS (command stream), which 
was then asynchronously sent and processed on the gfx ring with vmid=5.

The idea is to correlate the timestamps with those of the VM fault to 
see which process is at fault. If you do this, please send a bit more 
log context in attachments, because asynchronous execution can 
occasionally make the logs difficult to interpret.

Cheers,
Nicolai

> [   78.873598] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08eab714
> [   78.873600] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
>   0x0010151C
> [   78.873602] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x0D0B7014
> [   78.873604] VM fault (0x14, vmid 6) at page 1053980, write from
> 'SDM0' (0x53444d30) (183)
> [   78.874141] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08e2b714
> [   78.874148] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
>   0x0010151C
> [   78.874150] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x0D0B7014
> [   78.874154] VM fault (0x14, vmid 6) at page 1053980, write from
> 'SDM0' (0x53444d30) (183)
> [   78.874158] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08eab714
> [   78.874160] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
>   0x0010151C
> [   78.874162] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x0D0B7014
> [   78.874164] VM fault (0x14, vmid 6) at page 1053980, write from
> 'SDM0' (0x53444d30) (183)
>
> - Mads