HSA_AMD_SVM=y causes/triggers PAT issues
I have a hunch this might be MM/HMM issue, but I am reporting this as amdgpu bug just because problematic behavior is triggered by loading amdgpu, which was compiled with HSA_AMD_SVM=y. I checked problematic behavior on kernels 6.4 and 6.5-rc6, however I have seen people saying it started with 5.14.
My system is on X99 platform with Intel Broadwell-E CPU. It has multiple GPUs: AMD W6600 (which drives display) and NVIDIA RTX 3080 (used for compute and vfio). iommu is on and not in PT mode. HSA_AMD_SVM=y somehow messes PAT entries for NVIDIA card. Example follows.
NVIDIA card has two BARs, which are relevant in this example:
- Region 1: Memory at 380000000000 (64-bit, prefetchable) [size=16G]
- Region 3: Memory at 380400000000 (64-bit, prefetchable) [size=32M]
let's suppose "cat /sys/kernel/debug/x86/pat_memtype_list | grep 380" is used to check PAT entries.
- fresh system start, amdgpu is loaded (blacklisting it prevents the issue), NVIDIA card is deliberately not bound to any driver on boot. No PAT entries for it is visible - good.
- card is bound to vfio-pci and passed to VM, multiple PAT entries are visible - good.
- VM is stopped, card is unbound from vfio-pci. This is where difference is seen. If HSA_AMD_SVM=n, then there is no PAT entries visible - good, however with HSA_AMD_SVM=y two PAT entries remain - BAD. In addition, the amount of these entries depend on how many times the card has been passed-through. It is like some clean up routine silently fails or there is a lock held.
The above example is made to avoid requiring out of tree drivers for NVIDIA, however same (and probably with less hassle) can be repeated with just bounding card to nvidia driver, running compute/render task, unbinding it and then checking for left over PAT entries. This also shows it is not vfio-pci only issue.
It looks benign at first, but in real use case that card has to be switched from nvidia driver to vfio-pci and back without restarting the system. This PAT issue breaks it, because left over PAT entries from one driver are not compatible with the other. vfio-pci needs UC-, otherwise VM throws lots of ioremap/memtype errors; and nvidia driver prefers WC entries for performance reasons.
So, why amdgpu has effect on non AMD GPU?