amdgpu [RX Vega 56]: ring sdma0 timeout
Submitted by Matthias Heinz
Assigned to Default DRI bug account
Link to original bug (#112242)
Description
Hi,
I've reported this over at bugzilla.kernel.org but didn't get any help there. Maybe because nobody is expecting bugreports about the amdgpu driver over on the kernels bugtracker?
So this started a while ago, when I updated from 5.0.0 to a newer kernel. I'm currently at 5.3.0 and for almost any game I play I run into this problem:
Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring sdma0 timeout, signaled seq=368056, emitted seq=368057
Aug 24 11:13:33 egalite kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] ERROR [CRTC:47:crtc-0] flip_done timed out
Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process 7DaysToDie.x86_ pid 8108 thread 7DaysToDie:cs0
Aug 24 11:13:33 egalite kernel: amdgpu 0000:0c:00.0: GPU reset begin!
Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, but soft recovered
Only a hard reset made me recover from that.
I did some kernel traces which I will copy over to this report, if necessary, but for now you can download them here: https://bugzilla.kernel.org/show_bug.cgi?id=204683
It also looks a bit like this bug: https://bugzilla.kernel.org/show_bug.cgi?id=201957 , because I also get the "ring gfx timeout". And there are lots and lots of people having this issue.
I tried bisecting it, but failed, because either I missed the commit that causes this, because there are multiple reasons why this happens or this really goes way back to the time, where 4.18 was the base for drm-next (which doesn't compile on modern compilers anymore. Also steam doesn't want to run on those old kernels, so even when I was able to compile an older kernel, there was no way to test them)
I even tried debugging it over ethernet (KGDBoE is a nice thing if you need performance), but somehow this slowed everything down enough to not trigger the bug.
I also tried the suggestions from https://bugs.freedesktop.org/show_bug.cgi?id=109955, but forbidding the lowest clock mode doesn't help either. (It fixes my RocketLeague problems, though).
Please advise what I should try next.
Best regards
Matthias