RX 7600 XT: crashing with "ring gfx_0.0.0 timeout" at 65 °C
Brief summary of the problem:
The GPU crashes with the message "ring gfx_0.0.0 timeout" appearing in the kernel logs.
This happens pretty reliably when it is under full load once it reaches a temperature of about 65 °C.
Interestingly enough, undervolting by 100 mV and setting the max GPU clock speed to 2470 MHz allows it to reach up to 75 °C before it crashes. (A script for doing this is attached below.)
There are already a couple of similar issues, but I didn't see any mention of the crash happening at a specific temperature.
Is this likely a case of faulty hardware, or something that is fixable in firmware?
Hardware description:
- CPU: Ryzen 5 7500F
- GPU: 33:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt [1002:1681] (rev c1)
- GPU model: Sapphire Pulse RX 7600 XT
- System Memory: 32 GB of DDR5 at 4800 MHz
- Display(s): LG 24MB37PM-B at 1920x1080
- Type of Display Connection: HDMI
System information:
- Distro name and Version: Ubuntu 23.10
- Kernel version: Linux SPC 6.8.0-060800-generic #202403131158 SMP PREEMPT_DYNAMIC Wed Mar 13 12:09:54 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- Custom kernel: Kernel from Ubuntu Mainline PPA
- AMD official driver version: N/A
- Mesa version: 24.0.2 - kisak-mesa PPA
- Linux firmware: 20240202.git36777504 from Ubuntu 24.04 (slight frankendebian)
- Display server: Wayland
How to reproduce the issue:
- Start "Ratchet and Clank: Rift Apart" (or really any other game that can eat a lot of GPU resources)
MangoHud can show the current GPU temperature and load:MANGOHUD_CONFIG=gpu_temp,gpu_power mangohud %command%
- Choose maximum graphics settings
and make sure that this causes GPU usage to be close to 100% - Wait a little until the GPU has heated up enough
the crash usually happens at about 65 °C - The screen turns black and the monitor complains that there is no signal. Sound may continue to play for a short while.
Attached files:
Script for undervolting and underclocking
After making sure that OverDrive is enabled, by setting the bit 0x4000 in the kernel parameter amdgpu.ppfeaturemask
(mine is now 0xfff7ffff), the following script can be used to undervolt and underclock slightly for a little more stability.
It needs to be run as root from within /sys/class/drm/card*/device
(change * to your card number)
# change mode to manual
echo "manual" | tee power_dpm_force_performance_level
# set max clock to 2470 MHz
echo "s 1 2470" | tee pp_od_clk_voltage
# undervolt by 100 millivolts
echo "vo -100" | tee pp_od_clk_voltage
# commit changes
echo "c" | tee pp_od_clk_voltage
Log files (for system lockups / game freezes / crashes)
- Dmesg log (obtained via
journalctl -b -1 -k -o short-precise
, since the system was in an unusable state after the crash)
journal.txt