RADV 6700XT GPU hang with ACO compiler with certain games/samples

Before submitting your bug report:

Check if a new version of Mesa is available which might have fixed the problem.
If you can, check if the latest development version (git main) works better.
Check if your bug has already been reported here.
For any logs, backtraces, etc - use code blocks
As examples of good bug reports you may review one of these - #2598 (closed), #2615 (closed), #2608 (closed)

Otherwise, fill the requested information below. And please remove anything that doesn't apply to keep things readable :)

Description

Describe what you are doing, what you expect and what you're seeing instead. How frequent is the issue? Is it a one time occurrence? Does it appear multiple times but randomly? Can you easily reproduce it?

With some games/Vulkan samples, when using the ACO compiler, there is a GPU hang (screen freezes and flickers a few times). I have to switch to a TTY to kill the user session to get back to the login screen. With Elden Ring, for example, when loading a save file (different save files, starting in different locations), the game initially loads file, but trying to pan around results in a GPU hang. With the bloom sample in this repo, there is a GPU hang after 5-10 seconds. Different versions of Mesa were tested, with varying results, as described in this table.

Game/App	Mesa Version	Result
Elden Ring	21.2.2	Crash
Elden Ring	21.2.6	Crash
Elden Ring	22.0.3	Crash
Bloom	21.2.2	Success
Bloom	21.2.6	Success
Bloom	22.0.3	Crash

For both Elden Ring and the bloom sample, setting RADV_DEBUG=llvm results in both of them running fine, although Elden Ring randomly exits (no core file, no errors in kernel logs) after some period of time (could be 10 minutes, could be an hour). Additionally, in the case of the bloom sample, setting RADV_DEBUG=syncshaders appears to result in it running fine (I discovered this when trying to get a hang report and saw that it didn't result in a GPU hang), although this doesn't work for Elden Ring.

I do see that #6113 is very similar, but that report appears to be that there's a crash in Elden Ring after some amount of time, not just when loading a world and panning.

Screenshots/video files

For rendering errors, attach screenshots of the problem and (if possible) of how it should look. For freezes, it may be useful to provide a screenshot of the affected game scene. Prefer screenshots over videos.

No screenshot available. I can try getting a screenshot if needed.

Log files (for system lockups / game freezes / crashes)

Backtrace (for crashes)
Output of dmesg
Hang reports: Run with RADV_DEBUG=hang and attach the files created in $HOME/radv_dumps_*/

dmesg log:

[  329.462137] ext4 filesystem being remounted at /newroot/boot supports timestamps until 2038 (0x7fffffff)
[  331.735218] ext4 filesystem being remounted at /newroot/boot supports timestamps until 2038 (0x7fffffff)
[  407.252714] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
[  407.508465] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
[  412.382489] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=608282, emitted seq=608284
[  412.382650] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process eldenring.exe pid 28007 thread eldenring.exe pid 28197
[  412.382778] amdgpu 0000:2d:00.0: amdgpu: GPU reset begin!
[  412.875897] amdgpu 0000:2d:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[  412.876022] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[  413.169058] amdgpu 0000:2d:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[  413.169168] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[  413.462317] [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
[  413.476502] [drm] free PSP TMR buffer
[  413.521215] amdgpu 0000:2d:00.0: amdgpu: MODE1 reset
[  413.521218] amdgpu 0000:2d:00.0: amdgpu: GPU mode1 reset
[  413.521289] amdgpu 0000:2d:00.0: amdgpu: GPU smu mode1 reset
[  414.036790] amdgpu 0000:2d:00.0: amdgpu: GPU reset succeeded, trying to resume
[  414.036968] [drm] PCIE GART of 512M enabled (table at 0x0000008000E10000).
[  414.037070] [drm] VRAM is lost due to GPU reset!
[  414.037497] [drm] PSP is resuming...
[  414.229651] [drm] reserve 0xa00000 from 0x82fe000000 for PSP TMR
[  414.310275] amdgpu 0000:2d:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  414.320803] amdgpu 0000:2d:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  414.320805] amdgpu 0000:2d:00.0: amdgpu: SMU is resuming...
[  414.377994] amdgpu 0000:2d:00.0: amdgpu: SMU is resumed successfully!
[  414.379355] [drm] DMUB hardware initialized: version=0x02020003
[  414.693999] [drm] kiq ring mec 2 pipe 1 q 0
[  414.696104] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  414.696390] [drm] JPEG decode initialized successfully.
[  414.696403] amdgpu 0000:2d:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  414.696405] amdgpu 0000:2d:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  414.696406] amdgpu 0000:2d:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  414.696406] amdgpu 0000:2d:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[  414.696407] amdgpu 0000:2d:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[  414.696407] amdgpu 0000:2d:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[  414.696408] amdgpu 0000:2d:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[  414.696408] amdgpu 0000:2d:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[  414.696408] amdgpu 0000:2d:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[  414.696409] amdgpu 0000:2d:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[  414.696409] amdgpu 0000:2d:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[  414.696410] amdgpu 0000:2d:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[  414.696410] amdgpu 0000:2d:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
[  414.696411] amdgpu 0000:2d:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
[  414.696411] amdgpu 0000:2d:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
[  414.696412] amdgpu 0000:2d:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[  414.699928] amdgpu 0000:2d:00.0: amdgpu: recover vram bo from shadow start
[  414.709767] amdgpu 0000:2d:00.0: amdgpu: recover vram bo from shadow done
[  414.709769] [drm] Skip scheduling IBs!
[  414.709770] [drm] Skip scheduling IBs!
[  414.709794] amdgpu 0000:2d:00.0: amdgpu: GPU reset(4) succeeded!
[  414.709796] [drm] Skip scheduling IBs!
[  414.709805] [drm] Skip scheduling IBs!
[  414.709807] [drm] Skip scheduling IBs!
[  414.709808] [drm] Skip scheduling IBs!
[  414.709810] [drm] Skip scheduling IBs!
[  414.709811] [drm] Skip scheduling IBs!
[  414.709813] [drm] Skip scheduling IBs!
[  414.709815] [drm] Skip scheduling IBs!
[  414.709815] [drm] Skip scheduling IBs!
[  414.709817] [drm] Skip scheduling IBs!
[  414.709819] [drm] Skip scheduling IBs!
... (message repeated roughly once every microsecond)
[  414.710957] [drm] Skip scheduling IBs!
[  414.710958] [drm] Skip scheduling IBs!
[  414.710959] [drm] Skip scheduling IBs!
[  414.710960] [drm] Skip scheduling IBs!
[  414.710961] [drm] Skip scheduling IBs!
[  414.710962] [drm] Skip scheduling IBs!
[  414.710963] [drm] Skip scheduling IBs!
[  414.710964] [drm] Skip scheduling IBs!
[  414.710965] [drm] Skip scheduling IBs!
[  414.710966] [drm] Skip scheduling IBs!
[  414.710967] [drm] Skip scheduling IBs!
[  414.710969] [drm] Skip scheduling IBs!
[  414.710970] [drm] Skip scheduling IBs!
[  414.710970] [drm] Skip scheduling IBs!
[  414.714968] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  414.717457] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  415.125549] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  415.125758] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  415.295924] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  415.296279] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  415.323764] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  415.324142] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  415.376176] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  415.376778] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  424.723271] amdgpu_cs_ioctl: 1 callbacks suppressed
[  424.723273] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  424.723523] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  424.732513] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  424.732668] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  424.732746] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  424.732813] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  424.732856] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  424.745759] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  424.752623] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  424.809309] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  431.738698] amdgpu_cs_ioctl: 12 callbacks suppressed
[  431.738700] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  431.742228] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Hang report for Elden Ring on 22.0.3: radv_dumps_28007_2022.05.08_22.47.18.tar.gz

Steps to reproduce

How can Mesa developers reproduce the issue? When reporting a game issue, start explaining from a fresh save file and don't assume prior knowledge of the game's story.

For Elden Ring, loading a new game and trying to pan around should cause the GPU hang as well (although I haven't tried it locally).

For bloom, just loading it and moving around a bit results in a GPU hang 5-10 seconds afterwards.

System information

Please post inxi -GSC -xx output (fenced with triple backticks) OR fill information below manually

OS: Ubuntu 21.10, with the Kisak Mesa PPA to get Mesa 22.0.3
GPU: 2d:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT / 6800M] [1002:73df] (rev c1)
Kernel version: Linux saikrishna-Lemur 5.13.0-40-generic #45-Ubuntu SMP Tue Mar 29 14:48:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Mesa version: 4.6 (Compatibility Profile) Mesa 22.0.3 - kisak-mesa PPA
Desktop environment: KDE

If applicable

Xserver version: Using Wayland, but X version appears to be 1.20.13
DXVK version:
Wine/Proton version: Proton experimental (Elden Ring), None (Bloom sample)

Regression

Did it used to work in a previous Mesa version? It can greatly help to know when the issue started.

Elden Ring: Unknown with this GPU. My previous GPU (RX 580) worked fine Bloom: Yes, 21.2.6

API captures (if applicable, optional)

Consider recording a GFXReconstruct (preferred), RenderDoc, or apitrace capture of the issue with the RADV driver active. This can tremendously help when debugging issues, but you're still encouraged to report issues if you can't provide a capture file.

Further information (optional)

Does the issue reproduce with the LLVM backend (RADV_DEBUG=llvm) or on the AMDGPU-PRO drivers?

RADV_DEBUG=llvm does fix it for both Elden Ring and bloom, although with a potential for a random exit for Elden Ring (not sure for bloom).

Does your environment set any of the variables ACO_DEBUG, RADV_DEBUG, and RADV_PERFTEST?

Not by default.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information