[amdgpu] Errors with amdgpu-pro 17.50 running on GX-424CC SOC

Fri Mar 9 11:45:21 UTC 2018

Apologies if this is not the right list for this question. Kernel 
MAINTAINERS file suggests it is but please let me know if I should 
repost elsewhere.

I have a custom OpenCL application running under Ubuntu 16.04.04, HWE 
Kernel 4.13 and amdgpu-pro 17.50 drivers. This is running on a Fujitsu 
D3313-S6 industrial mainboard 
(http://www.fujitsu.com/fts/products/computing/peripheral/mainboards/industrial-mainboards/d3313s.html)

After a period of running - from 5 minutes to 48 hours we begin to see 
these kernel traces. At some point after seeing these errors the 
application fails.

[   99.348774] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014
[   99.355041] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR 
  0x00103042
[   99.362509] amdgpu 0000:00:01.0: 
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09020014
[   99.369980] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0' 
(0x54433000) (32)
[  100.437547] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014
[  100.443811] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR 
  0x00103042
[  100.451288] amdgpu 0000:00:01.0: 
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09020014
[  100.458758] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0' 
(0x54433000) (32)

I know from searching the web that this error can appear if there are 
errors in the opencl program. However we have run the exact same program 
on multiple other hardware configurations and have not seen problems. On 
linux we have had success running on all machines tested with a discrete 
amd gpu, just not on the GX-424CC apu. On windows we have had the code 
running on a large numbers of platforms including the GX-424CC without 
issues.

I'm prepared to believe we have an error in our opencl code, but have no 
clue where to start looking. What does the error actually mean and why 
does it happen? Is it to do with buffer transfers between host and 
device? During execution of a kernel?

Whilst attempting to investigate the problem I have tried a number of 
kernel arguments for the driver. If I reduce the amount of memory 
assigned to vram with vramlimit=64 then it appears to take longer for 
the error to occur.

If I run it with the arguments vm_debug=1 vm_fault_stop=1 the error no 
longer appears. I would have expected it to occur at least once due to 
vm_fault_stop=1 but it does not. However instead I get this error 
occasionally:

[ 7612.741693] amdgpu 0000:00:01.0: IH ring buffer overflow (0x00000010, 
0x00000000, 0x00000020)

So is this a bug in the driver or in the opencl code? How can I progress 
debugging this issue?

Thanks
Will