Random system reboot on 5.15+ and no video out until a full shutdown and start up
Unfortunately there hasn't been very consistent logging output prior to the system restarting, but there's been a thread with a number of people who have similar issues that's been getting updated over the last couple months (though there are at least two different issues in here, one that's already on this issue tracker), and it definitely seems related to 5.15+ and the most recent generation or two of amd graphics cards: https://bbs.archlinux.org/viewtopic.php?pid=2007216#p2007216
Description of the issue
After booting a 5.15.x or 5.16.x (so far) kernel, the computer will run for a random amount of time until the display suddenly turns off. It continues to stay off, but after a minute I can ssh in and see that the system has rebooted, and I can't get the display to turn back on without doing a full shutdown and then starting it back up (rebooting leaves the display unresponsive).
Sometimes there's a bunch of graphics related stuff before the crash and other times there's almost nothing, so I'm not sure how relevant it is. I'll paste my most recent two
journalctl -b-1 logs following reboots caused by the bug in case they're helpful (in both cases I've filtered out unrelated stuff like mailnag, UFW, etc):
Most recent (doesn't have much relevant info at all):
Jan 12 20:57:11 command kernel: clocksource: timekeeping watchdog on CPU0: hpet retried 2 times before success Jan 12 20:57:13 command kernel: sched: RT throttling activated Jan 12 20:57:13 command kernel: clocksource: timekeeping watchdog on CPU4: hpet retried 2 times before success
Previous (has a bunch of graphics related things):
Jan 12 15:19:59 command kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3 Jan 12 15:20:05 command kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! Jan 12 15:20:05 command kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3 Jan 12 15:20:09 command kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=874, emitted seq=875 Jan 12 15:20:09 command kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1981 thread gnome-shel:cs0 pid 1987 Jan 12 15:20:09 command kernel: amdgpu 0000:0d:00.0: amdgpu: GPU reset begin! Jan 12 15:20:14 command kernel: amdgpu 0000:0d:00.0: amdgpu: Failed to disable gfxoff! Jan 12 15:20:14 command kernel: [drm] REG_WAIT timeout 1us * 200 tries - hubp2_set_blank line:950 Jan 12 15:20:14 command kernel: [drm] REG_WAIT timeout 1us * 200 tries - hubp2_set_blank line:950 Jan 12 15:20:15 command kernel: [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:77:crtc-0] flip_done timed out Jan 12 15:21:01 command kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125! Jan 12 15:21:01 command gnome-shell: amdgpu: The CS has been cancelled because the context is lost. Jan 12 15:21:13 command kernel: [UFW BLOCK] IN=enp6s0 OUT= MAC=33:33:00:00:01:14:f4:4d:30:6b:54:30:86:dd SRC=fe80:0000:0000:0000:5376:4079:e39f:b25a DST=ff02:0000:0000:0000:0000:0000:0000:0114 LEN=98 TC=0 HOPLIMIT=1 FLOWLBL=485931 PROTO=UDP SPT=9001 DPT=9001 LEN=58 Jan 12 15:21:20 command kernel: [drm] psp gfx command UNKNOWN CMD(0x0) failed and response status is (0x0) Jan 12 15:21:21 command kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
- CPU: AMD Ryzen 9 5900X
- GPU: AMD Radeon RX 6800 XT
- System Memory: 32GB DDR4 3200
- Display(s): 1080p IPS with freesync over DisplayPort
- Distro name and Version: Arch Linux
- Kernel version: 5.15.13-zen1-1-zen, 5.15.13-arch-1, 5.16.0-arch-1 (started happening after the first upgrade to 5.15.x, tested again on 5.16.0)
- Custom kernel: Happens on both stock and zen
- AMD official driver version: OSS drivers/Mesa (21.3.3-2)
This one seems to share a lot in common, though it's definitely different behaviour: #1862
Thanks for your time, and let me know if I can help in any way!
I should add that 5.14.16 and older kernels are rock solid-- my system has been running without a reboot since I captured those logs ~6 days ago.