ring gfx_0.0.0 timeout on 6800XT at idle during display sleep, Xorg task hung, requires physical power cycle
Brief summary of the problem:
When the system is idle and the display is sleeping, a ring gfx_0.0.0 timeout
occurs. The journal indicates the GPU reset is successful, but then fails. This results in Xorg:cs0
and a kworker task blocking forever. Keyboard and mouse input, or turning the monitor on using its power button, does not cause the display to wake up. I can still ssh into the machine but systemctl reboot
hangs after shutting down most services and the machine must be reset using the physical reset button on the case.
To be clear, the timeout does not happen when the display goes to sleep, nor when it tries to wake up, but between those two times, while it is asleep. The system itself remains fully powered, not asleep or suspended.
I have never overclocked this GPU. I have undervolted it, using "vo -100" > pp_od_clk_voltage, but I have also experienced this timeout at stock voltage.
The error messages are very similar to #2709, but in that issue the reset occurs under load (gaming or video playback). I have never had the GPU reset under load, only when the display is sleeping. That issue also suggests vBIOS involvement, but the only BIOS I found for this card is the one currently in use.
Hardware description:
- CPU: AMD Ryzen 9 5950X 16-Core Processor
- GPU: 10:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] [1002:73bf] (rev c1)
- GPU branding: PowerColor Red Dragon 6800XT
- GPU vBIOS version (
cat /sys/class/drm/card1/device/vbios_version
): 113-001-X01 - Motherboard: ASRock X570 Taichi
- System Memory: 4x16GB Samsung M391A2G43BB2-CWE DDR4 ECC
- Display(s): LG 27UL600-W (4k60, VRR enabled)
- Type of Display Connection: DisplayPort 1.4 (according to monitor OSD)
System information:
- Distro name and Version: Arch Linux
- Kernel version: 6.6.8-arch1-1
- Custom kernel: no, stock Arch kernel
- kernel command line:
amdgpu.ppfeaturemask=0xff7ffff amdgpu.gpu_recovery=1
- AMD official driver version: n/a
- glxinfo Device string: AMD Radeon RX 6800 XT (radeonsi, navi21, LLVM 16.0.6, DRM 3.54, 6.6.8-arch1-1) (0x73bf)
- mesa 1:23.3.1-1, vulkan-radeon 1:23.3.1-1
How to reproduce the issue:
Leave the system on overnight with Firefox open displaying a static page (no animations or video). Happens about once a week. (I realize this is not useful.)
Attached files:
journal:
Dec 31 11:55:33 promenade kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=48077159, emitted seq=48077161
Dec 31 11:55:33 promenade kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 1788 thread firefox:cs0 pid 1866
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset begin!
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff?
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: Failed to disable gfxoff!
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:7 param:0x00000000 message:DisableAllSmuFeatures?
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: Failed to disable smu features.
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: Fail to disable dpm features!
Dec 31 11:55:33 promenade kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -121
Dec 31 11:55:33 promenade kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 000000003c46dd15; ring_buffer_end = 000000004d6843ac; write_frame = 000000004583cf5d
Dec 31 11:55:33 promenade kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
Dec 31 11:55:33 promenade kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to terminate ras ta
Dec 31 11:55:33 promenade kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: MODE1 reset
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: GPU mode1 reset
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: GPU smu mode1 reset
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:48 param:0x00000000 message:Mode1Reset?
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: GPU mode1 reset failed
Dec 31 11:55:33 promenade kernel: amdgpu 0000:10:00.0: amdgpu: ASIC reset failed with error, -121 for drm dev, 0000:10:00.0
Dec 31 11:55:44 promenade kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset succeeded, trying to resume
Dec 31 11:55:44 promenade kernel: [drm] PCIE GART of 512M enabled (table at 0x00000083FEB00000).
Dec 31 11:55:44 promenade kernel: [drm] VRAM is lost due to GPU reset!
Dec 31 11:55:44 promenade kernel: [drm] PSP is resuming...
Dec 31 11:55:44 promenade kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
Dec 31 11:55:44 promenade kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Dec 31 11:55:44 promenade kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset(2) failed
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: snd_hda_intel 0000:10:00.1: Unable to change power state from D3hot to D0, device inaccessible
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: [drm] Skip scheduling IBs!
Dec 31 11:55:44 promenade kernel: snd_hda_intel 0000:10:00.1: CORB reset timeout#2, CORBRP = 65535
Dec 31 11:55:44 promenade kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset end with ret = -62
Dec 31 11:55:44 promenade kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -62
Dec 31 11:55:54 promenade kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=283817, emitted seq=283819
Dec 31 11:55:54 promenade kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Dec 31 11:55:54 promenade kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset begin!
Dec 31 11:55:54 promenade kernel: amdgpu 0000:10:00.0: amdgpu: Failed to disallow df cstate
The Xorg log is full of repetitions of
[357865.433] (II) AMDGPU(0): EDID vendor "GSM", prod id 30471
[357865.433] (II) AMDGPU(0): Using hsync ranges from config file
[357865.433] (II) AMDGPU(0): Using vrefresh ranges from config file
[357865.433] (II) AMDGPU(0): Printing DDC gathered Modelines:
[357865.433] (II) AMDGPU(0): Modeline "3840x2160"x0.0 533.25 3840 3888 3920 4000 2160 2214 2219 2222 +hsync -vsync (133.3 kHz eP)
[357865.433] (II) AMDGPU(0): Modeline "3840x2160"x0.0 266.64 3840 3848 3992 4000 2160 2214 2219 2222 +hsync -vsync (66.7 kHz e)
[357865.433] (II) AMDGPU(0): Modeline "1920x1080"x0.0 148.50 1920 2008 2052 2200 1080 1084 1089 1125 +hsync +vsync (67.5 kHz e)
[357865.433] (II) AMDGPU(0): Modeline "2560x1440"x0.0 241.50 2560 2608 2640 2720 1440 1443 1448 1481 +hsync -vsync (88.8 kHz e)
[357865.433] (II) AMDGPU(0): Modeline "1280x720"x0.0 74.25 1280 1390 1430 1650 720 725 730 750 +hsync +vsync (45.0 kHz e)
[357865.433] (II) AMDGPU(0): Modeline "720x480"x0.0 27.00 720 736 798 858 480 489 495 525 -hsync -vsync (31.5 kHz e)
[357865.433] (II) AMDGPU(0): Modeline "640x480"x0.0 25.18 640 656 752 800 480 490 492 525 -hsync -vsync (31.5 kHz e)
[357865.433] (II) AMDGPU(0): Modeline "800x600"x0.0 40.00 800 840 968 1056 600 601 605 628 +hsync +vsync (37.9 kHz e)
[357865.433] (II) AMDGPU(0): Modeline "1024x768"x0.0 65.00 1024 1048 1184 1344 768 771 777 806 -hsync -vsync (48.4 kHz e)
[357865.433] (II) AMDGPU(0): Modeline "(null)"x60.0 81.75 1152 1216 1336 1520 864 867 871 897 -hsync +vsync (53.8 kHz e)
[357865.433] (II) AMDGPU(0): Modeline "1280x1024"x0.0 108.00 1280 1328 1440 1688 1024 1025 1028 1066 +hsync +vsync (64.0 kHz e)
[357865.433] (II) AMDGPU(0): Modeline "(null)"x59.9 118.25 1600 1696 1856 2112 900 903 908 934 -hsync +vsync (56.0 kHz e)
[357865.433] (II) AMDGPU(0): Modeline "1280x800"x0.0 83.50 1280 1352 1480 1680 800 803 809 831 -hsync +vsync (49.7 kHz e)
with nothing seeming to correspond to the hang, ending with:
[681730.516] (EE)
[681730.516] (EE) Backtrace:
[681730.517] (EE) 0: /usr/lib/Xorg (xorg_backtrace+0x2dd) [0x563a52f08c5d]
[681730.517] (EE) 1: /usr/lib/libc.so.6 (__sigaction+0x50) [0x7fbddd3f6710]
[681730.518] (EE) 2: /usr/lib/libc.so.6 (pthread_key_delete+0x14c) [0x7fbddd44683c]
[681730.519] (EE) 3: /usr/lib/libc.so.6 (raise+0x18) [0x7fbddd3f6668]
[681730.520] (EE) 4: /usr/lib/libc.so.6 (abort+0xd7) [0x7fbddd3de4b8]
[681730.520] (EE) 5: /usr/lib/dri/radeonsi_dri.so (radeon_drm_winsys_create+0x155027) [0x7fbddaca5497]
[681730.520] (EE) 6: /usr/lib/dri/radeonsi_dri.so (radeon_drm_winsys_create+0x15c015) [0x7fbddacac485]
[681730.521] (EE) 7: /usr/lib/dri/radeonsi_dri.so (__driDriverGetExtensions_d3d12+0x48a5d) [0x7fbdda51391d]
[681730.521] (EE) 8: /usr/lib/dri/radeonsi_dri.so (__driDriverGetExtensions_d3d12+0x3fccc) [0x7fbdda50ab8c]
[681730.522] (EE) 9: /usr/lib/libc.so.6 (pthread_condattr_setpshared+0x51b) [0x7fbddd4449eb]
[681730.523] (EE) 10: /usr/lib/libc.so.6 (clone+0x1bc) [0x7fbddd4c87cc]
[681730.523] (EE)
[681730.523] (EE)
Fatal server error:
[681730.523] (EE) Caught signal 6 (Aborted). Server aborting
[681730.523] (EE)
[681730.523] (EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
[681730.523] (EE) Please also check the log file at "/home/jbosboom/.local/share/xorg/Xorg.0.log" for additional information.
[681730.523] (EE)