[amdgpu] Errors with amdgpu-pro 17.50 running on GX-424CC SOC
Will Wagner
willw at carallon.com
Fri Mar 9 11:45:21 UTC 2018
Apologies if this is not the right list for this question. Kernel
MAINTAINERS file suggests it is but please let me know if I should
repost elsewhere.
I have a custom OpenCL application running under Ubuntu 16.04.04, HWE
Kernel 4.13 and amdgpu-pro 17.50 drivers. This is running on a Fujitsu
D3313-S6 industrial mainboard
(http://www.fujitsu.com/fts/products/computing/peripheral/mainboards/industrial-mainboards/d3313s.html)
After a period of running - from 5 minutes to 48 hours we begin to see
these kernel traces. At some point after seeing these errors the
application fails.
[ 99.348774] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014
[ 99.355041] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR
0x00103042
[ 99.362509] amdgpu 0000:00:01.0:
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09020014
[ 99.369980] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0'
(0x54433000) (32)
[ 100.437547] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014
[ 100.443811] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR
0x00103042
[ 100.451288] amdgpu 0000:00:01.0:
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09020014
[ 100.458758] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0'
(0x54433000) (32)
I know from searching the web that this error can appear if there are
errors in the opencl program. However we have run the exact same program
on multiple other hardware configurations and have not seen problems. On
linux we have had success running on all machines tested with a discrete
amd gpu, just not on the GX-424CC apu. On windows we have had the code
running on a large numbers of platforms including the GX-424CC without
issues.
I'm prepared to believe we have an error in our opencl code, but have no
clue where to start looking. What does the error actually mean and why
does it happen? Is it to do with buffer transfers between host and
device? During execution of a kernel?
Whilst attempting to investigate the problem I have tried a number of
kernel arguments for the driver. If I reduce the amount of memory
assigned to vram with vramlimit=64 then it appears to take longer for
the error to occur.
If I run it with the arguments vm_debug=1 vm_fault_stop=1 the error no
longer appears. I would have expected it to occur at least once due to
vm_fault_stop=1 but it does not. However instead I get this error
occasionally:
[ 7612.741693] amdgpu 0000:00:01.0: IH ring buffer overflow (0x00000010,
0x00000000, 0x00000020)
So is this a bug in the driver or in the opencl code? How can I progress
debugging this issue?
Thanks
Will
More information about the dri-devel
mailing list