AMD Radeon 5700 / Navi: amdgpu.gpu_recovery not working
Submitted by KLingel
Assigned to Default DRI bug account
Link to original bug (#112174)
Description
I have set "amdgpu.gpu_recovery=1" in my kernel boot params. When my GPU is crashing, recovery does not work.
Syslog:
[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] ERROR Waiting for fences timed out!
[drm:amdgpu_job_timedout [amdgpu]] ERROR ring sdma0 timeout, signaled seq=1935, emitted seq=1937
[drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Xorg pid 1861 thread Xorg:cs0 pid 1864
amdgpu 0000:45:00.0: GPU reset begin!
[drm] ring test on 10 succeeded in 22 usecs
[drm] ring test on 10 succeeded in 29 usecs
amdgpu 0000:45:00.0: GPU reset succeeded, trying to resume
[drm] PCIE GART of 512M enabled (table at 0x00000080001E8000).
[drm] PSP is resuming...
[drm] reserve 0x7200000 from 0x81f7c00000 for PSP TMR
amdgpu: [powerplay] SMU is resuming...
amdgpu: [powerplay] SMU is resumed successfully!
[drm] kiq ring mec 2 pipe 1 q 0
[drm] ring test on 10 succeeded in 33 usecs
[drm] ring test on 10 succeeded in 8 usecs
[drm] gfx 0 ring me 0 pipe 0 q 0
[drm:gfx_v10_0_ring_test_ring [amdgpu]] ERROR amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD)
[drm:amdgpu_device_ip_resume_phase2 [amdgpu]] ERROR resume of IP block <gfx_v10_0>
failed -22
amdgpu 0000:45:00.0: GPU reset(1) failed
amdgpu 0000:45:00.0: GPU reset end with ret = -22
[drm:amdgpu_job_timedout [amdgpu]] ERROR ring sdma0 timeout, signaled seq=1937, emitted seq=1937
[drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Xorg pid 1861 thread Xorg:cs0 pid 1864
amdgpu 0000:45:00.0: GPU reset begin!
GPU recovery is really important, especially at the moment with the current state of navi stability issues.
Please fix and enable recovery as default.