random system freezes with "amdgpu: [gfxhub0] retry page fault VM_L2_PROTECTION_FAULT_STATUS"
Brief summary of the problem:
This issue started happening a few weeks ago when I upgraded my 5.10.x LTS kernel. I was using sway (wayland) at the time and started experiencing system freezes randomly, although this was predominant when using Firefox or Chromium or playing videos. Since then, I've upgraded to kernel 5.11 and 5.12 and moved to KDE + Xorg but this issue hasn't been resolved yet. My system freezes randomly and I lose my work because I have to do hard reboots.
Hardware description:
- CPU: Ryzen 3500U
- GPU: Vega 8
- System Memory: 16 GB
- Display(s) and Type of Diplay Connection: laptop 1080p display and an external 4k monitor connected via USB-C
System information:
- Distro name and Version: Arch Linux
- Kernel version: 5.10.37, 5.11.21-hardened, and 5.12.4
- Custom kernel: no, kernels installed from official Arch Linux repositories
- AMD package version: xf86-video-amdgpu 19.1.0, mesa 21.1.0
How to reproduce the issue:
Use any of the kernel versions mentioned above. Play a video in Firefox or Chromium. Wait for a complete system freeze. This happens randomly though and isn't always reproducible. Sometimes, this happens with other programs as well like KDE system settings, sway window manager, wayland process, xorg process etc.
Attached files:
May 12 20:41:58 kernel: amdgpu 0000:04:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32777, for process systemsettings5 pid 417305 thread systemsett:cs0 pid 417308)
May 12 20:41:58 kernel: amdgpu 0000:04:00.0: amdgpu: in page starting at address 0x8001094d0000 from client 27
May 12 20:41:58 kernel: amdgpu 0000:04:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00141051
May 12 20:41:58 kernel: amdgpu 0000:04:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
May 12 20:41:58 kernel: amdgpu 0000:04:00.0: amdgpu: MORE_FAULTS: 0x1
May 12 20:41:58 kernel: amdgpu 0000:04:00.0: amdgpu: WALKER_ERROR: 0x0
May 12 20:41:58 kernel: amdgpu 0000:04:00.0: amdgpu: PERMISSION_FAULTS: 0x5
May 12 20:41:58 kernel: amdgpu 0000:04:00.0: amdgpu: MAPPING_ERROR: 0x0
May 12 20:41:58 kernel: amdgpu 0000:04:00.0: amdgpu: RW: 0x1
May 17 13:14:43 kernel: amdgpu 0000:04:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:7 pasid:32782, for process chromium pid 4408 thread chromium:cs0 pid 4441)
May 17 13:14:43 kernel: amdgpu 0000:04:00.0: amdgpu: in page starting at address 0x800134c21000 from client 27
May 17 13:14:43 kernel: amdgpu 0000:04:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00701031
May 17 13:14:43 kernel: amdgpu 0000:04:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
May 17 13:14:43 kernel: amdgpu 0000:04:00.0: amdgpu: MORE_FAULTS: 0x1
May 17 13:14:43 kernel: amdgpu 0000:04:00.0: amdgpu: WALKER_ERROR: 0x0
May 17 13:14:43 kernel: amdgpu 0000:04:00.0: amdgpu: PERMISSION_FAULTS: 0x3
May 17 13:14:43 kernel: amdgpu 0000:04:00.0: amdgpu: MAPPING_ERROR: 0x0
May 17 13:14:43 kernel: amdgpu 0000:04:00.0: amdgpu: RW: 0x0