Not for me unfortunately
I experience the same issue on 5.15.5 with a vega 56 (no overclock). The difference however is it doesn't happen when Xorg starts but when I try to run libreoffice (100% reproducible) : the process hangs and with radeontop I see the graphics pipe and shader interpolator at 100%. A few seconds after, the screens goes black, fans become loud and I need to hard reset. Here are the messages the journal caught after the crash
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.1.0 timeout, signaled seq=2, emitted seq=3
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
amdgpu 0000:0c:00.0: amdgpu: failed to suspend display audio
[drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
[drm] free PSP TMR buffer
amdgpu 0000:0c:00.0: amdgpu: BACO reset
then same messages again with the same <TASK> and finally :
amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
[drm] PCIE GART of 512M enabled.
[drm] PTB located at 0x000000F400E10000
[drm] VRAM is lost due to GPU reset!
[drm] PSP is resuming...
[drm] reserve 0x400000 from 0xf5fec00000 for PSP TMR
[drm] kiq ring mec 2 pipe 1 q 0
[drm] UVD and UVD ENC initialized successfully.
[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* got no status for stream 00000000953205e1 on acrtc00000000a96d022b
[drm] VCE initialized successfully.
amdgpu 0000:0c:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
amdgpu 0000:0c:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
amdgpu 0000:0c:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
amdgpu 0000:0c:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
amdgpu 0000:0c:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
amdgpu 0000:0c:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
amdgpu 0000:0c:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
amdgpu 0000:0c:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
amdgpu 0000:0c:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
amdgpu 0000:0c:00.0: amdgpu: ring vce0 uses VM inv eng 9 on hub 1
amdgpu 0000:0c:00.0: amdgpu: ring vce1 uses VM inv eng 10 on hub 1
amdgpu 0000:0c:00.0: amdgpu: ring vce2 uses VM inv eng 11 on hub 1
amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow start
amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow done
[drm] Skip scheduling IBs!
amdgpu 0000:0c:00.0: amdgpu: GPU reset(1) succeeded!
[drm] Skip scheduling IBs!
[...]
I even get a NULL pointer dereference :
BUG: kernel NULL pointer dereference, address: 00000000000000f8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 6 PID: 337 Comm: kworker/u64:15 Tainted: G W OE 5.15.5-arch1-1 #1 f0168f793e3f707b46715a62fafabd6a40826924
Hardware name: System manufacturer System Product Name/CROSSHAIR VI HERO, BIOS 7901 07/31/2020
Workqueue: events_unbound commit_work
RIP: 0010:dce_pipe_control_lock+0x21/0x220 [amdgpu]
Code: b6 09 e9 f2 f4 12 00 66 90 0f 1f 44 00 00 41 56 41 55 41 89 d5 41 54 49 89 f4 55 0f b6 ea 53 48 83 ec 18 48 8b 9f 20 ed 00 00 <48> 8b be f8 00 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31
RSP: 0018:ffffb536c11279c8 EFLAGS: 00010292
RAX: ffffffffc03e8f50 RBX: ffff8f7195235a00 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8f7193040000
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000002
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000006 R15: 0000000000001e48
FS: 0000000000000000(0000) GS:ffff8f748eb80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000f8 CR3: 0000000181dd2000 CR4: 0000000000350ee0
Call Trace:
<TASK>
? dce12_update_clocks+0xd8/0x110 [amdgpu 86b3af2709c89c963e3a1e2db7a5a116368b897c]
dc_commit_updates_for_stream+0xd27/0x1e30 [amdgpu 86b3af2709c89c963e3a1e2db7a5a116368b897c]
? flush_workqueue+0x1b4/0x440
amdgpu_dm_atomic_commit_tail+0x164f/0x2670 [amdgpu 86b3af2709c89c963e3a1e2db7a5a116368b897c]
commit_tail+0x94/0x130
process_one_work+0x1e8/0x3c0
worker_thread+0x50/0x3c0
? process_one_work+0x3c0/0x3c0
kthread+0x132/0x160
? set_kthread_struct+0x50/0x50
ret_from_fork+0x22/0x30
</TASK>
amdgpu: The CS has been cancelled because the context is lost.
[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Hi, I rebased the MR on top of current main, applying some refactoring. I tried it on chromium with egl on Wayland (using XWayland) and I can play a h264 video with hardware acceleration without stutter :)
Here is the patch if people want to try it : ANGLE_sync_control_rate.diff
And here is my branch : https://gitlab.freedesktop.org/t.clastres/mesa/-/tree/GetMscANGLE
However, please note that I have no knowledge of this codebase or 3D Graphics in general so I will not be able to do much more than that.
Let me know if I should open a new MR though.
Térence Clastres (7ee50e65) at 22 Nov 18:48
mesa: implement ANGLE_sync_control_rate
... and 2 more commits
Térence Clastres (bb6fb606) at 22 Nov 18:43
After I opened this issue, I disabled screen blanking and it seems to have been better. At some point, I stopped using the computer and it remained in suspend to ram for 3 days. When I woke it woke up, screen was black again. Interestingly, sysrq keys worked this time (no usb problem?).
The log looks different from before so here it is: freeze3.txt
I could try, but I first need a way to quickly reproduce it. Also, because of the needed power reset, I get errors on my disks and one time I couldn't even mount my root partition, so I would also need a way to minimize this risk.
When my computer is inactive, I have set my DE (gnome on xorg) to blank the screen after 15mn and then suspend (to ram) after an hour.
When it resumes, it sometimes takes a while to have something on the screen. In the worst cases, the computer completely freezes, I am unable to ssh to it and I also can't use sysrq keys. I'm forced to do a power cycle.
I think this started happening around 5.10, I didn't report it at first because I found related bug reports already opened. However, most of them were when using 2 screens and I couldn't find some of the errors I had in the attached logs. The frequency of the freezes also seemed to change between certain kernel upgrades (or maybe was it due to my usage?), letting me sometimes think the issue was fixed.
I don't have a solid reproducer but it's either after the screen is blanked or after resuming from suspend or maybe a combination of both.