by default the kernel sets a maximum gpu clock that exceeds the manufacturers specifications, causing hardware crashes
Brief summary of the problem:
On my computer i got sporiadic nondeterministic gpu crashes while playing various games (using valve proton) such as forza horizon 5, need for speed unbound, horizon zero dawn.
i saw that gentoo documented these crashes and stated that they are because of a misconfiguration of the maximum power consumption and maximum clock, i encountered the same errors documented on their wiki and it seems to have fixed it by following their documentation (https://wiki.gentoo.org/wiki/AMDGPU#Frequent_and_Sporadic_Crashes)
dmesg: https://gist.github.com/andrew-ld/2f60712b7fdc5d8d735f8eb9e72fd092
Hardware description:
- CPU: AMD 7800x3d
- GPU: 03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX] [1002:744c] (rev c8)
- System Memory: 32GB CMK32GX5M2E6000Z36
- Display(s): dell g3223q
- Type of Display Connection: DP
System information:
- Distro name and Version: archlinux
- Kernel version: Linux arch 6.7.1-zen1-1.1-zen ZEN SMP PREEMPT_DYNAMIC Wed, 24 Jan 2024 06:23:15 +0000 x86_64 GNU/Linux
- Custom kernel: linux zen compiled with march x86-64-v3
- AMD official driver version: N/A
How to reproduce the issue:
the issue occurs to me by default, without making any changes (it also occurs on linux mainline)
https://www.techpowerup.com/gpu-specs/sapphire-pulse-rx-7900-xtx.b9966
Seeing the specifications of this video card the default maximum clock is 2525 MHz during boost mode. Seeing the specifications of this video card the estimated maximum consumption is about 370W.
[andrew@arch ~]$ cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_SCLK:
0: 500Mhz
1: 2930Mhz
OD_MCLK:
0: 97Mhz
1: 1250MHz
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
SCLK: 500Mhz 5000Mhz
MCLK: 97Mhz 1500Mhz
VDDGFX_OFFSET: -450mv 0mv
the first problem is that even if the video card has a maximum clock of 2525 mhz it seems that linux takes it up to 2930 mhz
[andrew@arch ~]$ sensors
amdgpu-pci-1200
Adapter: PCI adapter
vddgfx: 1.08 V
vddnb: 1.24 V
edge: +37.0°C
PPT: 34.06 W
nvme-pci-0400
Adapter: PCI adapter
Composite: +37.9°C (low = -273.1°C, high = +81.8°C)
(crit = +84.8°C)
Sensor 1: +37.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +38.9°C (low = -273.1°C, high = +65261.8°C)
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +42.5°C
Tccd1: +35.6°C
amdgpu-pci-0300
Adapter: PCI adapter
vddgfx: 685.00 mV
fan1: 0 RPM (min = 0 RPM, max = 3600 RPM)
edge: +55.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +61.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +64.0°C (crit = +108.0°C, hyst = -273.1°C)
(emerg = +113.0°C)
PPT: 33.00 W (cap = 303.00 W)
the second problem seems that at most it can consume 303W instead of the maximum limit stated by the manufacturer.
both options I was able to change at runtime using https://github.com/ilya-zlobintsev/LACT (simply writes the new options to the amdgpu sysfs)