[amd-gfx] AMD Carrizo - GPU fault detected: 146 0x0842b714
Nicolai Hähnle
nhaehnle at gmail.com
Sat Jun 18 12:30:38 UTC 2016
On 18.06.2016 13:56, Mads wrote:
> I removed the global env R600_DEBUG=nodma before this test, didn't seem
> to matter anyway...
>
> On 2016-06-18 13:36, Nicolai Hähnle wrote:
>
>> A sanity check is `grep radeonsi /proc/$pid/maps` -- if something
>> shows up, the driver was loaded into the process.
>
> dolphin has pid 560:
>
> $ grep radeonsi /proc/560/maps
> 7f7e70906000-7f7e7100a000 r-xp 00000000 00:0e 2125313
> /usr/lib64/mesa/radeonsi_dri.so
> 7f7e7100a000-7f7e71043000 rw-p 00703000 00:0e 2125313
> /usr/lib64/mesa/radeonsi_dri.so
>
> So that's something, I guess...
>
> So, newly compiled mesa from git with assertions/debug enabled:
>
> $ XAUTHORITY=.Xauthority DISPLAY=:0 LIBGL_DEBUG=verbose dolphin
> libGL: pci id for fd 9: 1002:9874, driver radeonsi
> libGL: OpenDriver: trying /usr/lib64/dri/tls/radeonsi_dri.so
> libGL: OpenDriver: trying /usr/lib64/dri/radeonsi_dri.so
> libGL: Using DRI3 for screen 0
> Trying to convert empty KLocalizedString to QString.
> Cannot creat accessible child interface for object:
> PlacesView(0xb7adc0) index: 4
> QPixmap::scaled: Pixmap is a null pixmap
> QPixmap::scaled: Pixmap is a null pixmap
> (... repeating a few times, guessing there's missing icons in the
> themeset or something. dolphin itself does not crash...)
Okay, so since dolphin uses OpenGL for rendering as well, the problem
now is to figure out whether the VM fault comes from dolphin or from the
compositor.
There are two approaches. The first one is to just try your luck and
capture an apitrace of dolphin, and then see whether playing that
apitrace back also produces VM faults. If it does, great - upload the
apitrace somewhere, and we can hopefully get it fixed.
The second approach is to correlate the VM ID in
> dmesg:
> [ 78.873577] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08e2b714
> [ 78.873590] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR
> 0x0010151C
> [ 78.873592] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x0D0B7014
> [ 78.873595] VM fault (0x14, vmid 6) at page 1053980, write from
> 'SDM0' (0x53444d30) (183)
with the running processes. This can be done via tracing. As root:
echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_cs_ioctl/enable
echo 1 > /sys/kernel/debug/tracing/events/gpu_sched/amd_sched_job/enable
echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_sched_run_job/enable
echo 1 > /sys/kernel/debug/tracing/events/amdgpu/amdgpu_vm_grab_id/enable
cat /sys/kernel/debug/tracing/trace_pipe
You'll get *lots* of output of the form
compiz-2065 [000] .... 14927.891778: amdgpu_cs_ioctl:
adev=ffff88022fe70000, sched_job=ffff880110dab2a0, first
ib=ffff8800923e0200, sched fence=ffff880068509b80, ring name:gfx, num_ibs:1
compiz-2065 [000] .... 14927.891782: amd_sched_job:
entity=ffff88023258f030, sched job=ffff880110dab2a0,
fence=ffff880068509b80, ring=gfx, job count:0, hw job count:0
gfx-172 [002] .... 14927.891802: amdgpu_sched_run_job:
adev=ffff88022fe70000, sched_job=ffff880110dab2a0, first
ib=ffff8800923e0200, sched fence=ffff880068509b80, ring name:gfx, num_ibs:1
gfx-172 [002] .... 14927.891809: amdgpu_vm_grab_id:
vmid=5, ring=0
In this particular case, compiz submitted a CS (command stream), which
was then asynchronously sent and processed on the gfx ring with vmid=5.
The idea is to correlate the timestamps with those of the VM fault to
see which process is at fault. If you do this, please send a bit more
log context in attachments, because asynchronous execution can
occasionally make the logs difficult to interpret.
Cheers,
Nicolai
> [ 78.873598] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08eab714
> [ 78.873600] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR
> 0x0010151C
> [ 78.873602] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x0D0B7014
> [ 78.873604] VM fault (0x14, vmid 6) at page 1053980, write from
> 'SDM0' (0x53444d30) (183)
> [ 78.874141] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08e2b714
> [ 78.874148] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR
> 0x0010151C
> [ 78.874150] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x0D0B7014
> [ 78.874154] VM fault (0x14, vmid 6) at page 1053980, write from
> 'SDM0' (0x53444d30) (183)
> [ 78.874158] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08eab714
> [ 78.874160] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR
> 0x0010151C
> [ 78.874162] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
> 0x0D0B7014
> [ 78.874164] VM fault (0x14, vmid 6) at page 1053980, write from
> 'SDM0' (0x53444d30) (183)
>
> - Mads
More information about the amd-gfx
mailing list