amdgpu: [gfxhub0] no-retry page fault (kernel 6.2.2-arch1-1-6.2.10-arch1-1) on Thinkpad T495 AMD Ryzen 7 3700U
Bug summary
This issue could be a duplicate of #2241 (closed), I just recently hit something that seems to be like this bug (VM_L2_PROTECTION_FAULT_STATUS:0x0050113A
) on an AMD Ryzen PRO 7 3700U, using :
-
mainline kernel
6.2.2-arch1-1 (this bug also happened on 6.2.10-arch1-1) -
linux-firmware
: 20230210.bf4115c-1 -
mesa
: 23.0.2-2 from extra official repository of Archlinux
The graphical X11 server seems to freeze or lock-up and then it can either survive or crash.
Hardware description:
- Device: 20UJS00K00 ThinkPad T14s Gen 1
- CPU: AMD Ryzen 7 PRO 3700U with Vega RX10
- GPU: AMD Radeon RX Vega 10, 2GB VRAM (UMA buffer allocated on RAM)
- System Memory: 16GB
- Display(s): laptop display & external monitor connected through a Lenovo USB-C docking station
- Type of Display Connection: HDMI
System Information
- Distro name and Version: Arch Linux x86_64,
- Latest Kernel version affected: 6.2.10-arch1-1
- AMD official driver version: using the arch linux package mesa 23.0.2-2
If X11 does survive, I've noticed graphical glitches (areas changing colors, when I'm typing a command) in Alacritty and Kitty, which are both GPU accelerated terminals.
Here is the dmesg :
732.131068] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:157 vmid:5 pasid:32769, for process Xorg pid 2936 thread Xorg:cs0 pid 2960)
[34732.131080] amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x0000800106a32000 from IH client 0x1b (UTCL2)
[34732.131087] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x0050113A
[34732.131089] amdgpu 0000:06:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[34732.131091] amdgpu 0000:06:00.0: amdgpu: MORE_FAULTS: 0x0
[34732.131093] amdgpu 0000:06:00.0: amdgpu: WALKER_ERROR: 0x5
[34732.131094] amdgpu 0000:06:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[34732.131096] amdgpu 0000:06:00.0: amdgpu: MAPPING_ERROR: 0x1
[34732.131097] amdgpu 0000:06:00.0: amdgpu: RW: 0x0
[34742.218188] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered
[34742.219609] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:173 vmid:5 pasid:32769, for process Xorg pid 2936 thread Xorg:cs0 pid 2960)
[34742.219623] amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x0000800104442000 from IH client 0x1b (UTCL2)
[34742.219632] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x0054115A
[34742.219637] amdgpu 0000:06:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[34742.219640] amdgpu 0000:06:00.0: amdgpu: MORE_FAULTS: 0x0
[34742.219643] amdgpu 0000:06:00.0: amdgpu: WALKER_ERROR: 0x5
[34742.219646] amdgpu 0000:06:00.0: amdgpu: PERMISSION_FAULTS: 0x5
[34742.219648] amdgpu 0000:06:00.0: amdgpu: MAPPING_ERROR: 0x1
[34742.219651] amdgpu 0000:06:00.0: amdgpu: RW: 0x1
[34742.219924] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:173 vmid:5 pasid:32769, for process Xorg pid 2936 thread Xorg:cs0 pid 2960)
[34742.219931] amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x000080010462a000 from IH client 0x1b (UTCL2)
[34742.219938] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x0054115A
[34742.219941] amdgpu 0000:06:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[34742.219944] amdgpu 0000:06:00.0: amdgpu: MORE_FAULTS: 0x0
[34742.219946] amdgpu 0000:06:00.0: amdgpu: WALKER_ERROR: 0x5
[34742.219949] amdgpu 0000:06:00.0: amdgpu: PERMISSION_FAULTS: 0x5
[34742.219951] amdgpu 0000:06:00.0: amdgpu: MAPPING_ERROR: 0x1
[34742.219953] amdgpu 0000:06:00.0: amdgpu: RW: 0x1
[34752.245339] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered
[34752.247800] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:157 vmid:5 pasid:32769, for process Xorg pid 2936 thread Xorg:cs0 pid 2960)
[34752.247812] amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x000080010c73a000 from IH client 0x1b (UTCL2)
[34752.247819] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x0050113A
[34752.247822] amdgpu 0000:06:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[34752.247825] amdgpu 0000:06:00.0: amdgpu: MORE_FAULTS: 0x0
[34752.247827] amdgpu 0000:06:00.0: amdgpu: WALKER_ERROR: 0x5
[34752.247830] amdgpu 0000:06:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[34752.247832] amdgpu 0000:06:00.0: amdgpu: MAPPING_ERROR: 0x1
[34752.247834] amdgpu 0000:06:00.0: amdgpu: RW: 0x0
[34752.248054] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:157 vmid:5 pasid:32769, for process Xorg pid 2936 thread Xorg:cs0 pid 2960)
[34752.248060] amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x000080010ca3a000 from IH client 0x1b (UTCL2)
[34752.248065] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x0050113A
[34752.248067] amdgpu 0000:06:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[34752.248069] amdgpu 0000:06:00.0: amdgpu: MORE_FAULTS: 0x0
[34752.248071] amdgpu 0000:06:00.0: amdgpu: WALKER_ERROR: 0x5
[34752.248073] amdgpu 0000:06:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[34752.248075] amdgpu 0000:06:00.0: amdgpu: MAPPING_ERROR: 0x1
[34752.248078] amdgpu 0000:06:00.0: amdgpu: RW: 0x0
[34762.271480] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered
[34762.391709] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:157 vmid:5 pasid:32769, for process Xorg pid 2936 thread Xorg:cs0 pid 2960)
[34762.391726] amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x000080010733a000 from IH client 0x1b (UTCL2)
[34762.391735] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x0050113A
[34762.391739] amdgpu 0000:06:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[34762.391742] amdgpu 0000:06:00.0: amdgpu: MORE_FAULTS: 0x0
[34762.391745] amdgpu 0000:06:00.0: amdgpu: WALKER_ERROR: 0x5
[34762.391747] amdgpu 0000:06:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[34762.391750] amdgpu 0000:06:00.0: amdgpu: MAPPING_ERROR: 0x1
[34762.391752] amdgpu 0000:06:00.0: amdgpu: RW: 0x0
[34772.511438] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered
[34783.604806] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered
I've thus upgraded to linux-firmware-git
(20230320.bcdcfbc-1) and I'm hitting a similar issue but it got to crash the X11 server almost everytime it froze.
Part of the dmesg:
lots of the same oops just before
[65749.018476] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered
[65750.946090] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:157 vmid:1 pasid:32769, for process Xorg pid 1071554 thread Xorg:cs0 pid 1071556)
[65750.946107] amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x000080010233a000 from IH client 0x1b (UTCL2)
[65750.946120] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x0010113A
[65750.946123] amdgpu 0000:06:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[65750.946126] amdgpu 0000:06:00.0: amdgpu: MORE_FAULTS: 0x0
[65750.946129] amdgpu 0000:06:00.0: amdgpu: WALKER_ERROR: 0x5
[65750.946131] amdgpu 0000:06:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[65750.946134] amdgpu 0000:06:00.0: amdgpu: MAPPING_ERROR: 0x1
[65750.946136] amdgpu 0000:06:00.0: amdgpu: RW: 0x0
[65760.964565] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered
[65810.254377] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=3702609, emitted seq=3702611
[65810.255377] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1071554 thread Xorg:cs0 pid 1071556
[65810.256378] amdgpu 0000:06:00.0: amdgpu: GPU reset begin!
[65810.399029] amdgpu 0000:06:00.0: amdgpu: MODE2 reset
[65810.399700] amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume
[65810.400022] [drm] PCIE GART of 1024M enabled.
[65810.400025] [drm] PTB located at 0x000000F400A00000
[65810.400046] [drm] PSP is resuming...
[65810.420078] [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
[65810.489137] amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not available
[65810.499211] amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available
[65811.978457] [drm] kiq ring mec 2 pipe 1 q 0
[65811.989596] [drm] VCN decode and encode initialized successfully(under SPG Mode).
[65811.989606] amdgpu 0000:06:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[65811.989614] amdgpu 0000:06:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
[65811.989619] amdgpu 0000:06:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
[65811.989623] amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
[65811.989626] amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
[65811.989630] amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
[65811.989634] amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
[65811.989637] amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
[65811.989640] amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
[65811.989644] amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
[65811.989647] amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
[65811.989651] amdgpu 0000:06:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 13 on hub 0
[65811.989654] amdgpu 0000:06:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[65811.989658] amdgpu 0000:06:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
[65811.989661] amdgpu 0000:06:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
[65811.989664] amdgpu 0000:06:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
[65811.989667] amdgpu 0000:06:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
[65812.003410] amdgpu 0000:06:00.0: amdgpu: recover vram bo from shadow start
[65812.003415] amdgpu 0000:06:00.0: amdgpu: recover vram bo from shadow done
[65812.003443] [drm] Skip scheduling IBs!
[65812.003451] [drm] Skip scheduling IBs!
[65812.003459] [drm] Skip scheduling IBs!
[65812.003465] [drm] Skip scheduling IBs!
[65812.003469] [drm] Skip scheduling IBs!
[65812.003472] [drm] Skip scheduling IBs!
[65812.003475] [drm] Skip scheduling IBs!
[65812.003478] [drm] Skip scheduling IBs!
[65812.003481] [drm] Skip scheduling IBs!
[65812.003484] [drm] Skip scheduling IBs!
[65812.003493] amdgpu 0000:06:00.0: amdgpu: GPU reset(30) succeeded!
[65812.011109] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
If I can provide more logs, let me know how and I'll do my best to provide more logs to narrow down this issue.