RX6700XT: [gfxhub] page fault + [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout
Brief summary of the problem:
GPU hangs randomly, sometimes recovers gracefully afterwards, sometimes kills xorg-server. A snippet of dmesg for the crash:
[ 716.928693] gmc_v10_0_process_interrupt: 46 callbacks suppressed
[ 716.928700] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process Xorg pid 1163 thread Xorg:cs0 pid 1165)
[ 716.928710] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000800180384000 from client 0x1b (UTCL2)
[ 716.928716] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101031
[ 716.928719] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[ 716.928723] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x1
[ 716.928727] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 716.928730] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 716.928733] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 716.928735] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 716.928743] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process Xorg pid 1163 thread Xorg:cs0 pid 1165)
[ 716.928749] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000800180285000 from client 0x1b (UTCL2)
[ 716.928753] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101031
[ 716.928756] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[ 716.928759] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x1
[ 716.928762] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 716.928765] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 716.928768] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 716.928771] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 716.928777] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process Xorg pid 1163 thread Xorg:cs0 pid 1165)
[ 716.928782] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000800180388000 from client 0x1b (UTCL2)
[ 716.928785] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 716.928788] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0)
[ 716.928792] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x0
[ 716.928795] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 716.928797] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 716.928800] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 716.928803] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 716.928810] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process Xorg pid 1163 thread Xorg:cs0 pid 1165)
[ 716.928815] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000800180388000 from client 0x1b (UTCL2)
[ 716.928818] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 716.928821] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0)
[ 716.928824] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x0
[ 716.928827] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 716.928830] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 716.928833] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 716.928836] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 716.928842] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process Xorg pid 1163 thread Xorg:cs0 pid 1165)
[ 716.928846] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000800180285000 from client 0x1b (UTCL2)
[ 716.928850] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 716.928853] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0)
[ 716.928856] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x0
[ 716.928859] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 716.928861] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 716.928864] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 716.928867] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 716.928874] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process Xorg pid 1163 thread Xorg:cs0 pid 1165)
[ 716.928878] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000800180388000 from client 0x1b (UTCL2)
[ 716.928882] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 716.928884] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0)
[ 716.928887] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x0
[ 716.928890] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 716.928893] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 716.928896] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 716.928899] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 716.928906] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process Xorg pid 1163 thread Xorg:cs0 pid 1165)
[ 716.928910] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000800180285000 from client 0x1b (UTCL2)
[ 716.928913] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 716.928916] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0)
[ 716.928919] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x0
[ 716.928922] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 716.928925] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 716.928928] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 716.928931] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 716.928937] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process Xorg pid 1163 thread Xorg:cs0 pid 1165)
[ 716.928942] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000800180388000 from client 0x1b (UTCL2)
[ 716.928945] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 716.928948] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0)
[ 716.928951] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x0
[ 716.928954] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 716.928957] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 716.928959] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 716.928962] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 716.928969] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process Xorg pid 1163 thread Xorg:cs0 pid 1165)
[ 716.928973] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000800180285000 from client 0x1b (UTCL2)
[ 716.928977] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 716.928979] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0)
[ 716.928983] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x0
[ 716.928985] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 716.928988] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 716.928991] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 716.928994] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 716.929000] amdgpu 0000:0d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process Xorg pid 1163 thread Xorg:cs0 pid 1165)
[ 716.929005] amdgpu 0000:0d:00.0: amdgpu: in page starting at address 0x0000800180390000 from client 0x1b (UTCL2)
[ 716.929008] amdgpu 0000:0d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 716.929011] amdgpu 0000:0d:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0)
[ 716.929014] amdgpu 0000:0d:00.0: amdgpu: MORE_FAULTS: 0x0
[ 716.929017] amdgpu 0000:0d:00.0: amdgpu: WALKER_ERROR: 0x0
[ 716.929020] amdgpu 0000:0d:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 716.929022] amdgpu 0000:0d:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 716.929025] amdgpu 0000:0d:00.0: amdgpu: RW: 0x0
[ 727.048598] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=26147, emitted seq=26149
[ 727.048951] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1163 thread Xorg:cs0 pid 1165
Number of page fautls varies from hang to hang.
Hardware description:
- CPU: AMD Ryzen 7 5800X 8-Core Processor
- GPU: 0d:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] [1002:73df] (rev c5) / Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] / Navy Flounder / Asus Dual Radeon™ RX 6700 XT OC Edition
- System Memory: 32GiB
- Display(s): Eizo CS240
- Type of Display Connection: DP
System information:
- Distro name and Version: Debian stable 12.2
- Mesa version: 22.3.6-1+deb12u1 (from Debian repos)
- Custom kernel: Linux octo 6.5.7 #6 (closed) SMP PREEMPT_DYNAMIC Sun Oct 22 23:08:35 CEST 2023 x86_64 GNU/Linux (self-built via
make bindeb-pkg
+ candidate patches from #2627 applied) - AMD official driver version: NA, using Linux's amdgpu
- Firmware version - should be from linux-firmware.git at commit a3bcbbf2e5d13b49197ecd39ae47715515bf38c2 (latest at this point).
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 44, firmware version: 0x00000040
PFP feature version: 44, firmware version: 0x00000061
CE feature version: 44, firmware version: 0x00000025
RLC feature version: 1, firmware version: 0x0000004a
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
RLCP feature version: 0, firmware version: 0x00000000
RLCV feature version: 0, firmware version: 0x00000000
MEC feature version: 44, firmware version: 0x00000073
MEC2 feature version: 44, firmware version: 0x00000073
IMU feature version: 0, firmware version: 0x00000000
SOS feature version: 0, firmware version: 0x00220a0c
ASD feature version: 553648303, firmware version: 0x210000af
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x00000000
TA HDCP feature version: 0x00000000, firmware version: 0x1700003a
TA DTM feature version: 0x00000000, firmware version: 0x12000015
TA RAP feature version: 0x00000000, firmware version: 0x07000213
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x00413b00 (65.59.0)
SDMA0 feature version: 52, firmware version: 0x00000050
SDMA1 feature version: 52, firmware version: 0x00000050
VCN feature version: 0, firmware version: 0x0211d002
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x02020020
TOC feature version: 0, firmware version: 0x00000000
MES_KIQ feature version: 0, firmware version: 0x00000000
MES feature version: 0, firmware version: 0x00000000
VBIOS version: 115-D512BS0-100
How to reproduce the issue:
The hang happens at random, but for some reason background music playback (audacious and esp. deadbeef) increases frequency of hang a lot.
Log files (for system lockups / game freezes / crashes)
- Dmesg log (full log)dmesg.Fri_Oct_27_07_56_33_PM_CEST_2023.log
-
Ring gfx_0.0.0 dump via umr collected for crash with
gpu_recovery=0
. I can also upload a binary snapshot of this ring (copied from debugfs viacp
) if it's any useful. - amdgpu_fence_info
- lshw output