[Bug 217797] New: [amdgpu/mm?] HSA_AMD_SVM=y causes/triggers PAT issues

Tue Aug 15 17:11:35 UTC 2023

https://bugzilla.kernel.org/show_bug.cgi?id=217797

            Bug ID: 217797
           Summary: [amdgpu/mm?] HSA_AMD_SVM=y causes/triggers PAT issues
           Product: Drivers
           Version: 2.5
          Hardware: AMD
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri at kernel-bugs.osdl.org
          Reporter: zaltys at natrix.lt
        Regression: No

I have a hunch this might be MM/HMM issue, but I am reporting this as amdgpu
bug just because problematic behavior is triggered by loading amdgpu, which was
compiled with HSA_AMD_SVM=y. I checked problematic behavior on kernels 6.4 and
6.5-rc6, however I have seen people saying it started with 5.14.

My system is on X99 platform with Intel Broadwell-E CPU. It has multiple GPUs:
AMD W6600 (which drives display) and NVIDIA RTX 3080 (used for compute and
vfio). iommu is on and not in PT mode. HSA_AMD_SVM=y somehow messes PAT entries
for NVIDIA card. Example follows.

NVIDIA card has two relevant BARs:
Region 1: Memory at 380000000000 (64-bit, prefetchable) [size=16G]
Region 3: Memory at 380400000000 (64-bit, prefetchable) [size=32M]

example supposes "cat /sys/kernel/debug/x86/pat_memtype_list | grep 380" is
used check PAT entries.

1) fresh system start, amdgpu is loaded (blacklisting it prevents the issue),
NVIDIA card is deliberately not bound to any driver on boot. No PAT entries for
it is visible - good.
2) card is bound to vfio-pci and passed to VM, multiple PAT entries are visible
- good.
3) VM is stopped, card is unbound from vfio-pci. This is where difference is
seen. If HSA_AMD_SVM=n, then there is no PAT entries visible - good, however
with HSA_AMD_SVM=y two PAT entries remain - BAD. In addition, the amount of
these entries depend on how many times the card has been passed-through. It is
like some clean up routine fails.

The above example is made to avoid requiring out of tree drivers for NVIDIA,
however same (and probably with less hassle) can be repeated with just bounding
card to nvidia driver, running compute/render task, unbinding it and then
checking for left over PAT entries. This also shows it is not vfio-pci only
issue.

It looks benign at first, but in real use case that card has to be switched
from nvidia driver to vfio-pci and back without restarting the system. This PAT
issue breaks it, because  left over PAT entries from one driver are not
compatible with the other. vfio-pci needs UC-, otherwise VM throws lots of
ioremap/memtype errors; and nvidia driver prefers WC entries for performance
reasons.

If amdgpu is just a trigger, and issue is in general MM part of kernel, please
CC relevant people.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.