RADV: ring gfx timeout leading to GPU reset
Description
I'm writing a game that uses Vulkan for rendering and I'm reliably able to reproduce a GPU reset on both a RX560 and RX6700xt (under a minute). This causes the entire DE to crash and is only sometimes recoverable - occasionally I'm able to restart the DE and continue, other times the GPU loses display output or hangs completely and I have to reboot.
I've tried older versions of Ubuntu (18.04) as well as updating to the latest kernel 5.19 and the latest mesa (22.1.7) with identical behavior. Using LLVM doesn't affect the outcome. This also doesn't happen with Intel or Nvidia or with AMDVLK.
Steps to reproduce
-
tar -xf gfxrecon_capture_20220902T163743.tar.xz
(attached: gfxrecon_capture_20220902T163743.tar.xz) gfxrecon-replay gfxrecon_capture_20220902T163743.gfxr
- It should crash very quickly, usually within a couple seconds; otherwise you may have to run it more than once.
System information
- OS: Ubuntu 22.04.1 LTS (Jammy)
- GPU: VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 550 640SP / RX 560/560X] [1002:67ff] (rev cf)
- Kernel version: Linux 5.19.5-051905-generic #202208291036 SMP PREEMPT_DYNAMIC Mon Aug 29 10:47:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- Mesa version: OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.1.7 - kisak-mesa PPA
- Desktop environment: ubuntu:GNOME
- Xserver version: X.Org X Server 1.21.1.3
API captures (if applicable, optional)
Here's a dump from using RADV_DEBUG=hang,llvm,checkir,info,nooutoforder
:
radv_dumps_11644_2022.09.02_15.33.05.tar.xz
dmesg
15:01:15 kernel: [ 510.037887] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
15:01:20 kernel: [ 510.037887] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
15:01:20 kernel: [ 515.167358] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=156162, emitted seq=156164
15:01:20 kernel: [ 515.167611] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process arena pid 9865 thread arena pid 9865
15:01:20 kernel: [ 515.167852] amdgpu 0000:41:00.0: amdgpu: GPU reset begin!
15:01:20 kernel: [ 515.646503] amdgpu: cp is busy, skip halt cp
15:01:21 kernel: [ 515.903636] amdgpu: rlc is busy, skip halt rlc
15:01:21 kernel: [ 515.904657] amdgpu 0000:41:00.0: amdgpu: BACO reset
15:01:21 kernel: [ 516.541722] amdgpu 0000:41:00.0: amdgpu: GPU reset succeeded, trying to resume
15:01:22 kernel: [ 516.869044] amdgpu 0000:41:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
15:01:22 gnome-shell[3691]: amdgpu: amdgpu_cs_query_fence_status failed.
15:01:22 gnome-shell[7008]: amdgpu: The CS has been cancelled because the context is lost.
15:01:22 kernel: [ 517.125913] amdgpu 0000:41:00.0: amdgpu: recover vram bo from shadow start
15:01:22 kernel: [ 517.125977] amdgpu 0000:41:00.0: amdgpu: recover vram bo from shadow done
15:01:22 kernel: [ 517.126019] amdgpu 0000:41:00.0: amdgpu: GPU reset(2) succeeded!
15:01:22 kernel: [ 517.127198] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
15:01:22 kernel: [ 517.130145] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
15:01:22 kernel: [ 517.130361] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
15:01:22 gnome-shell[7008]: amdgpu: The CS has been cancelled because the context is lost.
15:01:22 gnome-shell[3691]: amdgpu: The CS has been cancelled because the context is lost.
15:01:22 gnome-shell[3691]: amdgpu: The CS has been cancelled because the context is lost.
15:01:22 gnome-shell[3691]: amdgpu: amdgpu_cs_query_fence_status failed.
15:01:22 gnome-shell[3691]: message repeated 2 times: [ amdgpu: amdgpu_cs_query_fence_status failed.]
15:01:22 gnome-shell[3691]: amdgpu: The CS has been cancelled because the context is lost.
15:01:22 gnome-shell[3691]: amdgpu: The CS has been cancelled because the context is lost.
15:01:22 gnome-shell[3691]: amdgpu: amdgpu_cs_query_fence_status failed.
15:01:22 kernel: [ 517.242085] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
15:01:22 kernel: [ 517.242923] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
15:01:22 kernel: [ 517.244059] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
15:01:22 kernel: [ 517.244833] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
15:01:22 gnome-shell[3691]: amdgpu: The CS has been cancelled because the context is lost.
15:01:22 kernel: [ 517.245932] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
15:01:22 kernel: [ 517.246640] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!