[Bug 188271] New: IOMMU DMAR fault with NVIDIA CUDA peer to peer

Mon Nov 21 17:16:57 UTC 2016

https://bugzilla.kernel.org/show_bug.cgi?id=188271

            Bug ID: 188271
           Summary: IOMMU DMAR fault with NVIDIA CUDA peer to peer
           Product: Drivers
           Version: 2.5
    Kernel Version: 4.8.6
          Hardware: x86-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri at kernel-bugs.osdl.org
          Reporter: vadim at sourced.tech
        Regression: No

My motherboard is Supermicro X10DRG-Q (details in attached output of
dmidecode). It has 2 Xeon E5-2620 v4 (details in attached lscpu output). Two
Titan X 2016 GPUs are inserted into PCIe slots (see nvidia-smi output). After
enabling of the peer to peer access between those two cards, execution of
cudaMemcpyPeer() hangs and dmesg shows:

[16193.612535] DMAR: DRHD: handling fault status reg 602
[16193.617662] DMAR: [DMA Write] Request device [82:00.0] fault addr
387fc000c000 [fault reason 05] PTE Write access is not set
[16193.661857] DMAR: DRHD: handling fault status reg 702
[16193.666976] DMAR: [DMA Write] Request device [82:00.0] fault addr f8139000
[fault reason 05] PTE Write access is not set (edited)

I am using CoreOS, and the whole stuff happens inside a docker container
running with -device /dev/nvidiactl --device /dev/nvidia0 --device /dev/nvidia1
--device /dev/nvidia-uvm --privileged --security-opt seccomp=unconfined

The addition of intel_iommu=igfx_off to kernel command line cures the problem
and peer to peer works perfectly.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.