[amdgpu] drm:amdgpu_job_timedout ring sdma0 timeout
Hi,
I am using currently
- 5.6.0-0.rc5.git0.2.fc32.x86_64
- Mesa 20.0.2
- Radeon RX 5500 XT
AMD_DEBUG=nodma,nongg
and I am observing the following (seemingly common issue on the issue trackers)
Mar 22 15:57:01 HOSTNAME kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Mar 22 15:57:01 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=351048, emitted seq=351050
Mar 22 15:57:01 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Mar 22 15:57:01 HOSTNAME kernel: amdgpu 0000:12:00.0: GPU reset begin!
Mar 22 15:57:04 HOSTNAME kernel: amdgpu: [powerplay] failed send message: DisallowGfxOff (42) param: 0x00000000 response 0xffffffc2
Mar 22 15:57:04 HOSTNAME firefox[4070]: Error flushing display: Resource temporarily unavailable
followed by
Mar 22 15:57:08 HOSTNAME kernel: snd_hda_intel 0000:12:00.1: refused to change power state from D0 to D3hot
Mar 22 15:57:11 HOSTNAME kernel: amdgpu: [powerplay] Msg issuing pre-check failed and SMU may be not in the right state!
Mar 22 15:57:11 HOSTNAME kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
Mar 22 15:57:13 HOSTNAME kernel: amdgpu: [powerplay] Msg issuing pre-check failed and SMU may be not in the right state!
Mar 22 15:57:13 HOSTNAME kernel: [drm:amdgpu_device_gpu_recover.cold [amdgpu]] *ERROR* ASIC reset failed with error, -62 for drm dev, 0000:12:00.0
Mar 22 15:57:13 HOSTNAME kernel: amdgpu 0000:12:00.0: GPU reset(1) failed
Mar 22 15:57:13 HOSTNAME kernel: amdgpu 0000:12:00.0: GPU reset end with ret = -62
I have seen this issue with different kernels and tried so far:
- 5.4.0-rc7 from amd-mainline-dkms-5.4 branch
- 5.5.0-rc7
- 5.5.8
- 5.5.10
- 5.6.0.rc5 (directly from kernel sources)
- 5.6.0-0-rc5 (as shipped with fedora 32)
just having Firefox and a terminal running can be sufficient to trigger it (as I painfully found out when typing this bugreport) - but it happens sporadically. For all but the latest of my kernels I have been using mesa 19.2.X and for 5.6.0-0-rc5 (as shipped with fedora 32)
I am using Mesa 20.0.2
.
Following suggestions from #934 (comment 441847) I have set the AMD_DEBUG=nodma,nongg
variable but am still getting the sdma timeouts.
This bug seems related to
while 1. uses mesa 19.3.2 with llvm 9.0.1, kernel 5.5.4. and describes sdma1 instead of sdma0 I share the description of
even rebooting doesnt bring stuff back up, I have to power off and on to get signal out.
though I have different hardware (Radeon RX 5500 XT on my side).
On the other hand 2. has the same hardware but I observe hangs solely unrelated to physical changes such as cables etc, and am using newer kernel and mesa versions.
System logs:
- journalctl after running for a while full_log_crash.txt
- journalctl after crashing just while opening bug report (just terminal and firefox running) full_log_crash_short.txt - this one seems to stop prematurely
Let me know if there's anything else I could supply - as I get the error regularly it should be easy for me to obtain more logs.
Best regards,
Fabian