RX580 GPU crash on maximum VRAM clockspeed (kernel 6.8 and later)
For the last month or so I've been trying to narrow down a crash that happens when playing video games or doing other intensive 3D stuff. Kernel logs look something like this when it happens:
Nov 02 16:58:34 clevergirl kernel: [drm] scheduler comp_1.1.0 is not ready, skipping
Nov 02 16:58:34 clevergirl kernel: [drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: ring gfx timeout, signaled seq=2271933, emitted seq=2271935
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: Process information: process Wrath.exe pid 56036 thread dxvk-submit pid 56085
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
Nov 02 16:58:34 clevergirl kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Nov 02 16:58:34 clevergirl steam[52635]: radv/amdgpu: The CS has been cancelled because the context is lost. This context is innocent.
Nov 02 16:58:34 clevergirl kernel: amdgpu: cp is busy, skip halt cp
Nov 02 16:58:34 clevergirl kernel: amdgpu: rlc is busy, skip halt rlc
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: Dumping IP State
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: Dumping IP State Completed
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset
Nov 02 16:58:34 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
Nov 02 16:58:34 clevergirl kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400380000).
Nov 02 16:58:34 clevergirl kernel: [drm] VRAM is lost due to GPU reset!
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110)
Nov 02 16:58:35 clevergirl kernel: [drm] UVD and UVD ENC initialized successfully.
Nov 02 16:58:35 clevergirl kernel: [drm] VCE initialized successfully.
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow start
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow done
Nov 02 16:58:35 clevergirl kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2) succeeded!
Nov 02 16:58:35 clevergirl kwin_wayland[2295]: kwin_scene_opengl: 0x3: GL_CONTEXT_LOST in context lost
When the crash happens, the system will become unresponsive to input for a couple seconds; then the cursor freezes, the screen goes black, and finally a frozen and corrupted image is displayed. This can happen completely regardless of thermals.
The only way I've found to reliably avoid the crash is to use Corectrl to limit VRAM speed to no more than 1000 MHz, as opposed to the default maximum of 2120 MHz. I’ve also noticed that, if the maximum VRAM speed is allowed, the GPU fans will often generate an annoying whiny noise during or after any 3D rendering. Limiting the VRAM speed reliably prevents this as well.
I'm not sure what is going on exactly, as I know basically nothing about graphics hardware. The specific card in question has a history of working fine on Windows, and also IIRC on 5.x kernels, so I don't think it's a hardware issue.
Edit: this is currently with kernel 6.11.7 on Fedora Kinoite 41. However, I've seen it happen with kernels as old as 6.8.0 on Ubuntu 24.04 LTS.