fault during qemu passthrough rendering to dmabuf imported from virtgpu
Before submitting your bug report:
Brief summary of the problem:
Using vfio passthrough we can render to amdgpu card display connectors without issue. (gl and vulkan) We have venus rendering working with virtgpu via qemu. We have modified vulkan wsi and xwayland glamor paths to create dumb buffers and export them from virtgpu as dmabufs. We import them in to amdgpu driver, do rendering, then blit from original fb to imported buffer. We then want to flip the imported buffers on virtgpu display. i.e render on passed through amd gpu, display on virtual display. gamescope is the compositor, presenting to virtgpu.
I realise that this is a complicated setup, and the faults we are seeing could well be to do with our software stack hacking rather than an amdgpu driver issue. Having said that, I would appreciate any advice you could provide to debug.
Hardware description:
- CPU: AMD Ryzen 9 3900XT (4 cores passed through qemu)
- GPU: 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 24 [Radeon RX 6400 / 6500 XT] [1002:743f] (rev c1)
- Virt GPU: 00:02.0 VGA compatible controller [0300]: Red Hat, Inc. Virtio GPU [1af4:1050] (rev 01)
- System Memory: 4GB via qemu
- Display(s): virtual display provided by virtio-gpu
- Type of Display Connection: none used on amdgpu
System information:
- Distro name and Version: Fedora 36
- Kernel version: 6.1.0-rc6 (also tried various others)
- Custom kernel: patches for KVM to handle non refcounted pages, virtio-gpu modifier support, a few patches for debugging.
- AMD official driver version: N/A
Attached files:
amdgpu-fault-log.txt log contains lots of custom debug.
in the log, we can see
virtio-pci 0000:00:02.0: BOB_DEBUG: virtgpu_gem_map_dma_buf(): mapping shmem obj=ffff8947c30db400 bo=ffff8947c30db400
virtio-pci 0000:00:02.0: BOB_DEBUG: virtgpu_gem_map_dma_buf(): bo=ffff8947c30db400 dma_address=0x00000001034fa000
with those addresses seen during amdgpu_ttm_backend_bind()
The part we would like advice on is how to translate apdgpu faults like this:
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:8 vmid:1 pasid:32769, for process vkcube pid 999 thread vkcube pid 999)
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: in page starting at address 0x0000800100700000 from client 0x1b (UTCL2)
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101A10
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: Faulty UTCL2 client ID: SDMA0 (0xd)
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: MORE_FAULTS: 0x0
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: WALKER_ERROR: 0x0
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: PERMISSION_FAULTS: 0x1
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: MAPPING_ERROR: 0x0
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: RW: 0x0
to anything useful for debugging further. The faulting address does not appear to be related to the dma addresses from the dmabufs. Looking at the fault handler in amdgpu driver, it appears to shift up the address received by 12, which makes me think a pfn is sent. Please advise on how to interpret the fault messages, specifically what is that start address? (per context va, gtt va, bus addr as mapped via iommu, looked up phys addr)?