RX580 still hangs under moderate load
As instructed in #2141 (comment 1924659):
Brief summary of the problem:
I have an RX580 (XFX GTR-S RX 580 Black Limited Edition OC+)
0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev e7) (prog-if 00 [VGA controller])
Subsystem: XFX Pine Group Inc. Radeon RX 580 XTR [1682:9588]
which hangs whenever doing anything graphically intensive (like playing a game). Otherwise, it runs fine for months with just desktop compositing and multiple monitors attached.
This is always reproducible.
This has been happening ever since this card was purchased, and also with generic Ubuntu kernels.
My workaround has been disabling the higher clocks in pp_dpm_sclk:
echo manual | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level
echo 5 | sudo tee /sys/class/drm/card0/device/pp_dpm_sclk
This works around the issue, however every time the monitors sleep or I switch modes on one of the monitors, it resets back to auto. If I forget to set it back to manual, I get a GPU hang again.
The contents of pp_dpm_sclk:
$ cat /sys/class/drm/card0/device/pp_dpm_sclk
0: 300Mhz
1: 751Mhz
2: 1048Mhz
3: 1158Mhz
4: 1240Mhz
5: 1309Mhz *
6: 1364Mhz
7: 1430Mhz
Some time ago I came across 816b6931315b ("drm/amdgpu/powerplay: Add special avfs cases for some polaris asics (v3)").
The ASIC on this card is one of those affected (0x67df
rev 0xe7
), and I have tried commenting/removing the case for it, however that made it crash with light loads as well.
In my kernel cmdline I have amdgpu.ppfeaturemask=0xffffffff amdgpu.dc=1
.
Hardware description:
- GPU:
08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev e7) (prog-if 00 [VGA controller])
System information:
- Distro name and Version: Ubuntu 20.04.6
- Kernel version:
Linux version 6.3.4-060304-generic (kernel@kathleen) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.2.0-17ubuntu1) 12.2.0, GNU ld (GNU Binutils for Ubuntu) 2.40) #202305241735 SMP PREEMPT_DYNAMIC Wed May 24 17:46:36 UTC 2023
- Mesa 23.1.0
How to reproduce the issue:
Run a GPU intensive process and wait. Any modern game will do. If I had to guess, anything more intensive than, for example, Quake Live, will cause a hang.