GPU fault 146 0x0e903e0c VM fault read from VCE0 on Radeon Pro WX 2100 - possible driver bug or configuration error?

Mon May 7 03:41:38 UTC 2018

Hello List!

I'm encountering a strange potential bug with by brand-new Radeon Pro WX 2100 when passed through to a VM in VFIO mode. This might be a driver bug, or a misconfiguration, but I'm looking for any advice you can offer!

The basic system setup is:

* Hypervisor is Debian 9.X running on a Dell C6100 blade (Intel E5649 CPU)
* GPU PCI device passed through via VFIO to a KVM/QEMU virtual machine - I have set this up with other cards, specifically a Radeon R9 270X and a Radeon HD6450 without trouble in the past.
* To work properly on my older CPUs, I'm setting "options vfio_iommu_type1 allow_unsafe_interrupts=1" in my modprobe on the hypervisor.
* The VM is running Debian 9.X with the latest AMDGPU-PRO driver (which I understand is unsupported, but the drivers install fine, and the same problem happens with the open-source driver in the kernel as well)
* Inside the VM I've installed the standard VAAPI utilities to support transcode offloading on the GPU for ffmpeg; this configuration is completely headless aside from a virtual Cirrus display in the VM.

First, which might be related, I'm able to get info from vaconfig only if I manually export the radeonsi driver as the one that should be used, which did not happen with my HD6450 (it detected this by default):

# uname -a
Linux transcoder1 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1 (2018-04-29) x86_64 GNU/Linux 
# vainfo                                                                                                                                                                                                                                       
error: XDG_RUNTIME_DIR not set in the environment.
error: can't connect to X server!
libva info: VA-API version 0.39.4
libva info: va_getDriverName() returns -1
libva error: va_getDriverName() failed with unknown libva error,driver_name=(null)
vaInitialize failed with error code -1 (unknown libva error),exit
# export LIBVA_DRIVER_NAME=radeonsi
# vainfo
error: XDG_RUNTIME_DIR not set in the environment.
error: can't connect to X server!
libva info: VA-API version 0.39.4
libva info: va_getDriverName() returns -1
libva info: User requested driver 'radeonsi'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_0_39
libva info: va_openDriver() returns 0
vainfo: VA-API version: 0.39 (libva 1.7.3)
vainfo: Driver version: mesa gallium vaapi
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileNone                   : VAEntrypointVideoProc

This setup doesn't crash when playing a 1080p x264 video file via ffmpeg, however the output is badly corrupted (wrong colours, encoding failures, etc.), so there seems to be a problem in general. And the moment I try to decode a 4K HVEC video, the GPU crashes with the following error:

[  175.464769] amdgpu 0000:00:06.0: GPU fault detected: 146 0x0e92be14
[  175.466660] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001011D2
[  175.468868] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x030BE014
[  175.471056] amdgpu 0000:00:06.0: VM fault (0x14, vmid 1) at page 1053138, write from 'VCE0' (0x56434530) (190)
[  175.473920] amdgpu 0000:00:06.0: GPU fault detected: 146 0x0e92be14
[  175.475704] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001011D4
[  175.477808] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x030BE014
[  175.479900] amdgpu 0000:00:06.0: VM fault (0x14, vmid 1) at page 1053140, write from 'VCE0' (0x56434530) (190)
[  175.517997] amdgpu 0000:00:06.0: GPU fault detected: 146 0x0e903e0c
[  175.519798] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001011D2
[  175.522011] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0203E00C
[  175.524224] amdgpu 0000:00:06.0: VM fault (0x0c, vmid 1) at page 1053138, read from 'VCE0' (0x56434530) (62)

After this point, the entire hypervisor host needs to be rebooted to return the GPU to a "working" state (i.e. so it won't lock up the hypervisor if I reboot the VM). And if I leave it long enough, eventually the hypervisor will simply lock up completely.

I've encountered this bug with both the latest open-source AMDGPU driver in Linux kernels 4.17rc3 and 4.15, as well as with the AMDGPU PRO driver on Linux kernel 4.9 inside the VM as demonstrated above; the crash message is identical in every case. Trying various different VAAPI drivers, other than radeonsi, has no effect, and the r600 driver is in fact far worse, throwing dozens of VM faults instead of the three seen above.

I'm at a loss to determine what could possibly be wrong here as I've tried tweaking almost everything I could think of based on the advice I've been able to find online so far, which is sparse.

I'm willing to provide any further info which may help, especially regarding the passthrough, and any advice anyone could give would be helpful!

Joshua M. Boniface
Linux System Ærchitect - Boniface Labs
Sigmentation fault: core dumped