(RADV) ACO triggers crash with "high" video settings whereas LLVM is OK (replicable example provided)
System information
- OS: Ubuntu 20.10
- GPU: Navi 10 RX 5700 XT [1002:731f] (rev c1)
- Kernel version: 5.9.8-050908-generic #202011101634 SMP Wed Nov 11 00:51:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
- Mesa version: 4.6 (Compatibility Profile) Mesa 20.2.2 - kisak-mesa PPA
- Xserver version X.Org X Server 1.20.9
- Desktop manager and compositor: Gnome 3.38
- Wine/Proton version: Proton 5.13
Describe the issue
When playing in 4K, I noticed that different games crash after a moment. ("ring gfx" crash, log below) I tried to find a 100% replicable fast crash and bingo : here comes "Trials Fusion". In-game : if I set video settings on 4K resolution and "Normal" profile (or anything lower), no crash. If I put anything higher like 4K "High", or "Ultra", the game crashes during the loading of the 1st level of the game. Each time.
Regression
Indeed, it is. It worked before ACO became default backend for Mesa (I think it was before Mesa 20.2.0)
Workaround
Launch games (or even Steam) with
RADV_DEBUG=llvm %command%
and crashes are gone, for any video settings.
Maybe ACO is not stable enough to be the default backend ?
dmesg log
nov. 19 19:51:16 host kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
nov. 19 19:51:16 host kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=66807, emitted seq=66809
nov. 19 19:51:16 host kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process trials_fusion.e pid 14374 thread trials_fus:cs0 pid 14456
nov. 19 19:51:16 host kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset begin!
nov. 19 19:51:16 host kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
nov. 19 19:51:16 host kernel: amdgpu 0000:28:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
nov. 19 19:51:16 host kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
nov. 19 19:51:17 host kernel: amdgpu 0000:28:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
nov. 19 19:51:17 host kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
nov. 19 19:51:17 host psensor.desktop[5200]: [2020-11-19T00:51:17] [ERR] lmsensor: Cannot get value of subfeature fan1_input: Can't read.
nov. 19 19:51:17 host psensor.desktop[5200]: [2020-11-19T00:51:17] [ERR] lmsensor: Cannot get value of subfeature temp1_input: Can't read.
nov. 19 19:51:17 host psensor.desktop[5200]: [2020-11-19T00:51:17] [ERR] lmsensor: Cannot get value of subfeature temp2_input: Can't read.
nov. 19 19:51:17 host psensor.desktop[5200]: [2020-11-19T00:51:17] [ERR] lmsensor: Cannot get value of subfeature temp3_input: Can't read.
nov. 19 19:51:17 host kernel: [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
nov. 19 19:51:17 host kernel: [drm] free PSP TMR buffer
nov. 19 19:51:20 host kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset succeeded, trying to resume