RX580 hangs when under moderate load
Brief summary of the problem:
I have an RX580 (XFX GTR-S RX 580 Black Limited Edition OC+)
0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev e7) (prog-if 00 [VGA controller])
Subsystem: XFX Pine Group Inc. Radeon RX 580 XTR [1682:9588]
which hangs whenever doing anything graphically intensive (like playing a game). Otherwise, it runs fine for months with just desktop compositing and 4 monitors attached.
[2801491.949138] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2095446190, emitted seq=2095446192
[2801491.949379] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process PAC-MAN WORLD R pid 1550729 thread dxvk-submit pid 1550787
[2801491.949592] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
[2801491.949839] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949839] last message was failed ret is 65535
[2801491.949844] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949844] last message was failed ret is 65535
[2801491.949849] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949849] last message was failed ret is 65535
[2801491.949854] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949854] last message was failed ret is 65535
[2801491.949858] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949858] last message was failed ret is 65535
[2801491.949863] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949863] last message was failed ret is 65535
[2801491.949868] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949868] last message was failed ret is 65535
[2801491.949872] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949872] last message was failed ret is 65535
[2801491.949877] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949877] last message was failed ret is 65535
[2801491.949881] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949881] last message was failed ret is 65535
[2801491.949886] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949886] last message was failed ret is 65535
[2801491.949890] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949890] last message was failed ret is 65535
[2801491.949895] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949895] last message was failed ret is 65535
[2801491.949899] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949899] last message was failed ret is 65535
[2801491.949904] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949904] last message was failed ret is 65535
[2801491.949908] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949908] last message was failed ret is 65535
[2801491.949913] amdgpu 0000:0c:00.0: amdgpu:
[2801491.949913] last message was failed ret is 65535
[2801491.950023] amdgpu 0000:0c:00.0: amdgpu:
[2801491.950023] last message was failed ret is 65535
[2801491.950028] amdgpu 0000:0c:00.0: amdgpu:
[2801491.950028] last message was failed ret is 65535
[2801491.950031] amdgpu 0000:0c:00.0: amdgpu:
[2801491.950031] last message was failed ret is 65535
[2801491.983483] [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:956
[2801513.298928] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
[2801513.299035] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DC32 (len 824, WS 0, PS 0) @ 0xDDB2
[2801513.299124] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DAEC (len 326, WS 0, PS 0) @ 0xDBE9
[2801513.299212] [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
[2801514.847450] [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:956
[2801521.771442] sysrq: Keyboard mode set to system default
[2801523.984429] sysrq: Terminate All Tasks
[2801543.062279] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
[2801543.062387] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing B49C (len 1227, WS 8, PS 8) @ 0xB731
[2801543.062477] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DAEC (len 326, WS 0, PS 0) @ 0xDC08
[2801543.062566] [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
[2801563.065911] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
[2801563.066012] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C596 (len 62, WS 0, PS 0) @ 0xC5B2
[2801583.069614] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
[2801583.069703] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DC32 (len 824, WS 0, PS 0) @ 0xDDB2
[2801583.069790] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DAEC (len 326, WS 0, PS 0) @ 0xDBDC
[2801583.069878] [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
[2801614.085137] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
[2801614.085246] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing B49C (len 1227, WS 8, PS 8) @ 0xB724
This is always reproducible.
This has been happening ever since I bought this card a few years ago, and also with generic Ubuntu kernels.
My workaround has been disabling the higher clocks in pp_dpm_sclk:
echo manual | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level
echo 5 | sudo tee /sys/class/drm/card0/device/pp_dpm_sclk
This works around the issue, however every time the monitors sleep or I switch modes on one of the monitors, it resets back to auto. If I forget to set it back to manual, I get a GPU hang again.
The contents of pp_dpm_sclk:
$ cat /sys/class/drm/card0/device/pp_dpm_sclk
0: 300Mhz
1: 751Mhz
2: 1048Mhz
3: 1158Mhz
4: 1240Mhz
5: 1309Mhz *
6: 1364Mhz
7: 1430Mhz
Some time ago I came across 816b6931315b ("drm/amdgpu/powerplay: Add special avfs cases for some polaris asics (v3)").
The ASIC on this card is one of those affected (0x67df
rev 0xe7
), and I have tried commenting/removing the case for it, however that made it crash with light loads as well.
In my kernel cmdline I have amdgpu.ppfeaturemask=0xffffffff amdgpu.dc=1
.
Hardware description:
- CPU: R9 3950X
- GPU:
0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev e7) (prog-if 00 [VGA controller])
- System Memory: 64GB
- Display(s): GW2255, GW2760, SyncMaster 205BW, VG27AQ1A
- Type of Display Connection: 2xDP, 1xHDMI, 1xDVI
System information:
- Distro name and Version: Ubuntu 20.04.5
- Kernel version:
Linux tatokis-PC 5.18.11 #1 SMP PREEMPT_DYNAMIC Thu Jul 14 03:43:22 EEST 2022 x86_64 x86_64 x86_64 GNU/Linux
- Custom kernel: Personal custom config
- AMD official driver version: N/A
How to reproduce the issue:
Run a GPU intensive process and wait. Any modern game will do. If I had to guess, anything more intensive than, for example, Quake Live, will cause a hang.