Ryzen 9 7900 iGPU (RDNA2 - Raphael) hang - GCVM_L2_PROTECTION_FAULT_STATUS
Brief summary of the problem:
I'm attempting to passthrough a Ryzen 9 7900 iGPU (RDNA2 - Raphael) via VFIO/KVM/QEMU [proxmox] for transcoding duties. The IOMMU groups are pretty good and the GPU is isolated. The firmware was extracted from the bios files using UEFITools.
The guest VM (Ubuntu) is running kernel version 6.4.2 & bleeding edge mesa drivers [23.2.0 via https://launchpad.net/~ernstp/+archive/ubuntu/mesaaco] - but note these issues equally apply to prior point kernel and stable mesa.
Running "vainfo" produces a hang on the first attempt and is reproducible. See below. This occasionally results in a page fault on the hostOS which causes system instability and a host OS hang that is only resolvable by rebooting.
Hardware description:
- CPU: Ryzen 9 7900
- GPU: RDNA2 Raphael
*-display
description: VGA compatible controller
product: Advanced Micro Devices, Inc. [AMD/ATI] [1002:164E]
vendor: Advanced Micro Devices, Inc. [AMD/ATI] [1002]
physical id: 0
bus info: pci@0000:01:00.0
version: c4
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi msix vga_controller bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: iomemory:c00-bff iomemory:c00-bff irq:16 memory:c000000000-c00fffffff memory:c010000000-c0101fffff ioport:d000(size=256) memory:c1000000-c107ffff memory:c1080000-c108ffff
- System Memory: 32GB
- Display(s): Nil [Headless]
System information:
- Distro name and Version: Proxmox 8 -> Ubuntu 22.04.2
- Kernel version: 6.2.13 -> 6.4.2-060402-generic [mainline]
- AMD official driver version: NA
Attached files:
IOMMU groups
Group 0: [1022:14da] 00:01.0 Host bridge Device 14da
Group 1: [1022:14db] [R] 00:01.1 PCI bridge Device 14db
Group 2: [1022:14db] [R] 00:01.2 PCI bridge Device 14db
Group 3: [1022:14da] 00:02.0 Host bridge Device 14da
Group 4: [1022:14db] [R] 00:02.1 PCI bridge Device 14db
Group 5: [1022:14db] [R] 00:02.2 PCI bridge Device 14db
Group 6: [1022:14da] 00:03.0 Host bridge Device 14da
Group 7: [1022:14da] 00:04.0 Host bridge Device 14da
Group 8: [1022:14da] 00:08.0 Host bridge Device 14da
Group 9: [1022:14dd] [R] 00:08.1 PCI bridge Device 14dd
Group 10: [1022:14dd] [R] 00:08.3 PCI bridge Device 14dd
Group 11: [1022:790b] 00:14.0 SMBus FCH SMBus Controller
[1022:790e] 00:14.3 ISA bridge FCH LPC Bridge
Group 12: [1022:14e0] 00:18.0 Host bridge Device 14e0
[1022:14e1] 00:18.1 Host bridge Device 14e1
[1022:14e2] 00:18.2 Host bridge Device 14e2
[1022:14e3] 00:18.3 Host bridge Device 14e3
[1022:14e4] 00:18.4 Host bridge Device 14e4
[1022:14e5] 00:18.5 Host bridge Device 14e5
[1022:14e6] 00:18.6 Host bridge Device 14e6
[1022:14e7] 00:18.7 Host bridge Device 14e7
Group 13: [1000:00af] [R] 01:00.0 Serial Attached SCSI controller SAS3408 Fusion-MPT Tri-Mode I/O Controller Chip (IOC)
Group 14: [1bb1:5018] [R] 02:00.0 Non-Volatile memory controller FireCuda 530 SSD
Group 15: [1022:43f4] [R] 03:00.0 PCI bridge Device 43f4
Group 16: [1022:43f5] [R] 04:00.0 PCI bridge Device 43f5
Group 17: [1022:43f5] [R] 04:01.0 PCI bridge Device 43f5
[8086:1533] [R] 06:00.0 Ethernet controller I210 Gigabit Network Connection
Group 18: [1022:43f5] [R] 04:02.0 PCI bridge Device 43f5
[8086:1533] [R] 07:00.0 Ethernet controller I210 Gigabit Network Connection
Group 19: [1022:43f5] [R] 04:03.0 PCI bridge Device 43f5
[1a03:1150] [R] 08:00.0 PCI bridge AST1150 PCI-to-PCI Bridge
[1a03:2000] [R] 09:00.0 VGA compatible controller ASPEED Graphics Family
Group 20: [1022:43f5] [R] 04:04.0 PCI bridge Device 43f5
Group 21: [1022:43f5] [R] 04:08.0 PCI bridge Device 43f5
[14e4:16d8] [R] 0b:00.0 Ethernet controller BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller
[14e4:16d8] [R] 0b:00.1 Ethernet controller BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller
Group 22: [1022:43f5] 04:0c.0 PCI bridge Device 43f5
[1022:43f7] [R] 0c:00.0 USB controller Device 43f7
USB: [046b:ff10] Bus 001 Device 006 American Megatrends, Inc. Virtual Keyboard and Mouse
USB: [046b:ffb0] Bus 001 Device 005 American Megatrends, Inc. Virtual Ethernet.
USB: [046b:ff31] Bus 001 Device 004 American Megatrends, Inc. Virtual HDisk Device
USB: [046b:ff20] Bus 001 Device 003 American Megatrends, Inc. Virtual Cdrom Device
USB: [046b:ff01] Bus 001 Device 002 American Megatrends, Inc. Virtual Hub
USB: [1d6b:0002] Bus 001 Device 001 Linux Foundation 2.0 root hub
USB: [1d6b:0003] Bus 002 Device 001 Linux Foundation 3.0 root hub
Group 23: [1022:43f5] 04:0d.0 PCI bridge Device 43f5
[1022:43f6] [R] 0d:00.0 SATA controller Device 43f6
Group 24: [1bb1:5018] [R] 0e:00.0 Non-Volatile memory controller FireCuda 530 SSD
Group 25: [1002:164e] [R] 0f:00.0 VGA compatible controller Raphael
Group 26: [1002:1640] [R] 0f:00.1 Audio device Rembrandt Radeon High Definition Audio Controller
Group 27: [1022:1649] 0f:00.2 Encryption controller VanGogh PSP/CCP
Group 28: [1022:15b6] [R] 0f:00.3 USB controller Device 15b6
USB: [1d6b:0002] Bus 003 Device 001 Linux Foundation 2.0 root hub
USB: [1d6b:0003] Bus 004 Device 001 Linux Foundation 3.0 root hub
Group 29: [1022:15b7] [R] 0f:00.4 USB controller Device 15b7
USB: [1d6b:0002] Bus 005 Device 001 Linux Foundation 2.0 root hub
USB: [1d6b:0003] Bus 006 Device 001 Linux Foundation 3.0 root hub
Group 30: [1022:15e2] 0f:00.5 Multimedia controller ACP/ACP3X/ACP6x Audio Coprocessor
Group 31: [1022:15e3] 0f:00.6 Audio device Family 17h/19h HD Audio Controller
Group 32: [1022:15b8] [R] 10:00.0 USB controller Device 15b8
USB: [1d6b:0002] Bus 007 Device 001 Linux Foundation 2.0 root hub
USB: [1d6b:0003] Bus 008 Device 001 Linux Foundation 3.0 root hub
cmd line parameters on host:
... amd_iommu=on iommu=pt modprobe.blacklist=amdgpu vfio-pci.ids=1002:164e,1002:1640
dmesg on guest boot
[ 0.982550] ACPI: bus type drm_connector registered
[ 2.241841] [drm] amdgpu kernel modesetting enabled.
[ 2.241902] amdgpu: CRAT table not found
[ 2.241904] amdgpu: Virtual CRAT table created for CPU
[ 2.241911] amdgpu: Topology: Add CPU node
[ 2.242153] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x164E 0x1002:0x164E 0xC4).
[ 2.242161] [drm] register mmio base: 0xC1000000
[ 2.242162] [drm] register mmio size: 524288
[ 2.242920] [drm] add ip block number 0 <nv_common>
[ 2.242921] [drm] add ip block number 1 <gmc_v10_0>
[ 2.242923] [drm] add ip block number 2 <navi10_ih>
[ 2.242924] [drm] add ip block number 3 <psp>
[ 2.242925] [drm] add ip block number 4 <smu>
[ 2.242926] [drm] add ip block number 5 <dm>
[ 2.242927] [drm] add ip block number 6 <gfx_v10_0>
[ 2.242928] [drm] add ip block number 7 <sdma_v5_2>
[ 2.242930] [drm] add ip block number 8 <vcn_v3_0>
[ 2.242931] [drm] add ip block number 9 <jpeg_v3_0>
[ 2.246229] [drm] BIOS signature incorrect 8a 98
[ 2.248566] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 2.248576] amdgpu: ATOM BIOS: 102-RAPHAEL-008
[ 2.254718] [drm] VCN(0) decode is enabled in VM mode
[ 2.254720] [drm] VCN(0) encode is enabled in VM mode
[ 2.256439] [drm] JPEG decode is enabled in VM mode
[ 2.256464] amdgpu 0000:01:00.0: vgaarb: deactivate vga console
[ 2.256467] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 2.256499] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 2.256504] amdgpu 0000:01:00.0: amdgpu: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
[ 2.256507] amdgpu 0000:01:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
[ 2.256509] amdgpu 0000:01:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[ 2.256514] [drm] Detected VRAM RAM=512M, BAR=256M
[ 2.256516] [drm] RAM width 128bits LPDDR5
[ 2.256773] [drm] amdgpu: 512M of VRAM memory ready
[ 2.256775] [drm] amdgpu: 7973M of GTT memory ready.
[ 2.256782] [drm] GART: num cpu pages 262144, num gpu pages 262144
[ 2.256901] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
[ 2.273538] [drm] Loading DMUB firmware via PSP: version=0x05000500
[ 2.273935] [drm] use_doorbell being set to: [true]
[ 2.273951] [drm] Found VCN firmware Version ENC: 1.24 DEC: 2 VEP: 0 Revision: 0
[ 2.273955] amdgpu 0000:01:00.0: amdgpu: Will use PSP to load VCN firmware
[ 2.296174] [drm] reserve 0xa00000 from 0xf41e000000 for PSP TMR
[ 2.361603] amdgpu 0000:01:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 2.367624] amdgpu 0000:01:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 2.367626] amdgpu 0000:01:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 2.367665] amdgpu 0000:01:00.0: amdgpu: smu driver if version = 0x00000004, smu fw if version = 0x00000005, smu fw program = 0, smu fw version = 0x00544fdd (84.79.221)
[ 2.367669] amdgpu 0000:01:00.0: amdgpu: SMU driver if version not matched
[ 2.368793] amdgpu 0000:01:00.0: amdgpu: SMU is initialized successfully!
[ 2.369694] [drm] Unsupported Connector type:21!
[ 2.369695] [drm] Unsupported Connector type:21!
[ 2.369697] [drm] Unsupported Connector type:21!
[ 2.369698] [drm] Unsupported Connector type:21!
[ 2.369699] [drm] Unsupported Connector type:21!
[ 2.369700] [drm] Display Core initialized with v3.2.230!
[ 2.369702] [drm] DP-HDMI FRL PCON supported
[ 2.370389] [drm] DMUB hardware initialized: version=0x05000500
[ 2.371292] [drm] kiq ring mec 2 pipe 1 q 0
[ 2.373814] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 2.373833] [drm] JPEG decode initialized successfully.
[ 2.375137] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ 2.375190] amdgpu: sdma_bitmap: 3
[ 2.393873] amdgpu: HMM registered 512MB device memory
[ 2.394480] amdgpu: SRAT table not found
[ 2.394481] amdgpu: Virtual CRAT table created for GPU
[ 2.394563] amdgpu: Topology: Add dGPU node [0x164e:0x1002]
[ 2.394565] kfd kfd: amdgpu: added device 1002:164e
[ 2.394573] amdgpu 0000:01:00.0: amdgpu: SE 1, SH per SE 1, CU per SH 2, active_cu_number 2
[ 2.394622] amdgpu 0000:01:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 2.394624] amdgpu 0000:01:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 2.394626] amdgpu 0000:01:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 2.394627] amdgpu 0000:01:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 2.394629] amdgpu 0000:01:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 2.394630] amdgpu 0000:01:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 2.394632] amdgpu 0000:01:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 2.394633] amdgpu 0000:01:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 2.394635] amdgpu 0000:01:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 2.394637] amdgpu 0000:01:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[ 2.394638] amdgpu 0000:01:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 2.394640] amdgpu 0000:01:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
[ 2.394641] amdgpu 0000:01:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
[ 2.394643] amdgpu 0000:01:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
[ 2.394644] amdgpu 0000:01:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[ 2.395254] [drm] Initialized amdgpu 3.52.0 20150101 for 0000:01:00.0 on minor 0
[ 4.300251] systemd[1]: Starting Load Kernel Module drm...
[ 4.305823] systemd[1]: modprobe@drm.service: Deactivated successfully.
[ 4.305943] systemd[1]: Finished Load Kernel Module drm.
Crash
[ 241.056865] amdgpu 0000:01:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 241.056878] amdgpu 0000:01:00.0: amdgpu: in page starting at address 0x0000000000005000 from client 0x1b (UTCL2)
[ 241.056882] amdgpu 0000:01:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3A
[ 241.056884] amdgpu 0000:01:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 241.056887] amdgpu 0000:01:00.0: amdgpu: MORE_FAULTS: 0x0
[ 241.056889] amdgpu 0000:01:00.0: amdgpu: WALKER_ERROR: 0x5
[ 241.056890] amdgpu 0000:01:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 241.056892] amdgpu 0000:01:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 241.056894] amdgpu 0000:01:00.0: amdgpu: RW: 0x0
[ 251.104279] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=1, emitted seq=3
[ 251.104469] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[ 251.104622] amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
[ 251.148032] amdgpu 0000:01:00.0: amdgpu: MODE2 reset
[ 251.155986] amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 251.156086] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
[ 251.156378] [drm] PSP is resuming...
[ 251.177787] [drm] reserve 0xa00000 from 0xf41e000000 for PSP TMR
[ 251.372220] amdgpu 0000:01:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 251.377994] amdgpu 0000:01:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 251.377996] amdgpu 0000:01:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 251.377997] amdgpu 0000:01:00.0: amdgpu: SMU is resuming...
[ 251.377999] amdgpu 0000:01:00.0: amdgpu: smu driver if version = 0x00000004, smu fw if version = 0x00000005, smu fw program = 0, smu fw version = 0x00544fdd (84.79.221)
[ 251.378000] amdgpu 0000:01:00.0: amdgpu: SMU driver if version not matched
[ 251.378510] amdgpu 0000:01:00.0: amdgpu: SMU is resumed successfully!
[ 251.379157] [drm] DMUB hardware initialized: version=0x05000500
[ 251.379771] [drm] kiq ring mec 2 pipe 1 q 0
[ 251.381952] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 251.382435] [drm] JPEG decode initialized successfully.
[ 251.382437] amdgpu 0000:01:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 251.382438] amdgpu 0000:01:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 251.382439] amdgpu 0000:01:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 251.382439] amdgpu 0000:01:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 251.382440] amdgpu 0000:01:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 251.382441] amdgpu 0000:01:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 251.382441] amdgpu 0000:01:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 251.382442] amdgpu 0000:01:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 251.382443] amdgpu 0000:01:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 251.382443] amdgpu 0000:01:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[ 251.382444] amdgpu 0000:01:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 251.382445] amdgpu 0000:01:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
[ 251.382445] amdgpu 0000:01:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
[ 251.382446] amdgpu 0000:01:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
[ 251.382447] amdgpu 0000:01:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[ 251.398814] amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow start
[ 251.398816] amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow done
[ 251.398823] amdgpu 0000:01:00.0: amdgpu: GPU reset(1) succeeded!
I appreciate the additional complexity trying to passthrough new hardware imposes - but would be grateful for any advice