Radeon Vega 64 Crash with screen rainbow snow corruption
Vega 64 Crash with screen rainbow snow corruption ring gfx_low timeout
Possibly Related Issues
Brief summary of the problem:
Vega 64 Crash with screen rainbow snow corruption New crash behavior since upgrading from Ubuntu 23.04 to 23.10. Same system/hardware was stable on 23.04. Crash seems random, not consistently while low or high activity, gpu utilization, or heat/power. Clustered around 10-15m after boot, but not always.
Hardware description:
- CPU: AMD Ryzen 7 3700X 8-Core Processor
- GPU: Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687F]
- System Memory: 32 GB DDR4
- Display(s): Occurs with different, but HDMI at 1080 & 4k
- Type of Display Connection: HDMI and/or DP
- Card Info from VGABIOS
Vega10 A1 XT D05001 32Mx128 852e/945m 0.95V
ATOMBIOSBK-AMD VER016.001.001.000.008892 D0500100.105
PowerPlayInfo table v8.1 at offset 0x917A
System information:
- Distro name and Version: 23.10
- Kernel version: 6.5.0-14-generic #14 (closed)-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 2023 x86_64 x86_64 x86_64 GNU/Linux
- Mesa/DRM Info: Mesa 23.2.1-1ubuntu3.1 vega10, LLVM 15.0.7, DRM 3.54, 6.5.0-14-generic apiVersion 1.3.255
- Custom kernel: No
- AMD official driver version: N/A
How to reproduce the issue:
- Boot the system to GUI (Wayland/Gnome I think)
- Log in to gui with non-root user
- Do something, or nothing. Wait 10-15m for most common case, 2d+ in other cases.
- In summary, not reproducible deterministically on demand.
Attached files:
Screenshots/video files
- ToDo. Similar to #1804 (closed)
Log files (for system lockups / game freezes / crashes)
Key log lines correlated with the crash event follow this pattern:
2023-12-11T13:52:02.087588-08:00 brian-desktop kernel: [ 116.152964] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=7773, emitted seq=7775
2023-12-11T13:52:02.087601-08:00 brian-desktop kernel: [ 116.153280] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 4018 thread firefox:cs0 pid 4176
amdgpu boot time parse errors
Potentially Related to crashes, but also possibly unrelated. At driver startup there are a number of vega10 parsing errors. Startup Parse Error Occurrence frequency and array index
grep -A1 UBSAN kern.log | grep type | awk '{print $7,$14}' | sort | uniq -c
12 1 'ATOM_Vega10_CLK_Dependency_Record
8 1 'ATOM_Vega10_MCLK_Dependency_Record
6 1 'ATOM_Vega10_PCIE_Record # init_powerplay_extended_tables
2 1 'ATOM_Vega10_Voltage_Lookup_Record # get_vddc_lookup_table.isra
2 1 'atom_voltage_gpio_map_lut # this one stack trace is from pp_atomfwctrl_get_voltage_table_v4
2 2 'ATOM_Vega10_MM_Dependency_Record
1 3 'ATOM_Vega10_MCLK_Dependency_Record # stack trace of interest vega10_get_powerplay_table_entry
1 5 'ATOM_Vega10_CLK_Dependency_Record
# The common pattern in the call stack seems to be:
pm/powerplay/hwmgr/vega10_processpptables.c
init_powerplay_extended_tables