Metro Exodus Enhanced Edition (RT) frequent gpu hangs on "The Caspian" (desert) level
Description
First of all: phenomenal RT progress in the last few months. Metro Exodus turned smoothly playable with the latest bvh optimizations.
To the issue... when playing the game for a longer period of time (30-60 minutes) I am fairly consistently seeing GPU hangs.
Tested on: kernel 6.0.15 and mesa-git (bb4aa8a3), RADV_PERFTEST=rt VKD3D_CONFIG=dxr,dxr11
, game set to Ultra Preset with Medium Ray Tracing
This GPU (RX6800) was previously affected by the amdgpu no-retry page fault issue (drm/amd#2113 (closed)) which showed itself quite similar though the error output is different now. Can someone rule this out to be a mesa/amdgpu issue just based on the log below?
I will test it with different kernel versions next.
Log files (for system lockups / game freezes / crashes)
amdgpu log (removed duplicates)
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=7635062, emitted seq=7635064
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1854 thread gnome-shel:cs0 pid 1874
amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[drm] free PSP TMR buffer
amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xf7e46a12700 flags=0x0010]
amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xf7d6fc82080 flags=0x0010]
amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xf7d6fc80000 flags=0x0010]
amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xf7de1d44200 flags=0x0010]
amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xf7de1d45200 flags=0x0010]
amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xf7de1d08a00 flags=0x0010]
amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xf7de1d47200 flags=0x0010]
amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xf7de1d0aa00 flags=0x0010]
amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xf7de1d46200 flags=0x0010]
amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xf7de1d0ba00 flags=0x0010]
amdgpu 0000:0a:00.0: amdgpu: MODE1 reset
amdgpu 0000:0a:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:0a:00.0: amdgpu: GPU smu mode1 reset
AMD-Vi: IOMMU event log overflow
amdgpu 0000:0a:00.0: amdgpu: GPU reset succeeded, trying to resume
[drm] PCIE GART of 512M enabled (table at 0x0000008001300000).
[drm] VRAM is lost due to GPU reset!
[drm] PSP is resuming...
[drm] reserve 0xa00000 from 0x83fd000000 for PSP TMR
amdgpu 0000:0a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
amdgpu 0000:0a:00.0: amdgpu: SMU is resuming...
amdgpu 0000:0a:00.0: amdgpu: smu driver if version = 0x00000040, smu fw if version = 0x00000041, smu fw program = 0, version = 0x003a5600 (58.86.0)
amdgpu 0000:0a:00.0: amdgpu: SMU driver if version not matched
amdgpu 0000:0a:00.0: amdgpu: use vbios provided pptable
amdgpu 0000:0a:00.0: amdgpu: SMU is resumed successfully!
[drm] DMUB hardware initialized: version=0x02020017
[drm] kiq ring mec 2 pipe 1 q 0
[drm] VCN decode and encode initialized successfully(under DPG Mode).
[drm] JPEG decode initialized successfully.
amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring sdma2 uses VM inv eng 14 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring sdma3 uses VM inv eng 15 on hub 0
amdgpu 0000:0a:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
amdgpu 0000:0a:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 5 on hub 1
amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 6 on hub 1
amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 7 on hub 1
amdgpu 0000:0a:00.0: amdgpu: ring jpeg_dec uses VM inv eng 8 on hub 1
amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow start
amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow done
[drm] Skip scheduling IBs!
amdgpu 0000:0a:00.0: amdgpu: GPU reset(2) succeeded!
[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Steps to reproduce
The hang happens frequently and independent of level or situation. It usually happens after 30 to 60 minutes of gameplay.
System information
System:
Host: mershl-desktop Kernel: 6.0.15-300.fc37.x86_64 arch: x86_64 bits: 64
compiler: gcc v: 2.38-25.fc37 Desktop: GNOME v: 43.2 tk: GTK v: 3.24.36
wm: gnome-shell dm: GDM Distro: Fedora release 37 (Thirty Seven)
CPU:
Info: 8-core model: AMD Ryzen 7 3700X bits: 64 type: MT MCP arch: Zen 2
rev: 0 cache: L1: 512 KiB L2: 4 MiB L3: 32 MiB
Speed (MHz): avg: 2196 high: 2268 min/max: 2200/4426 boost: enabled cores:
1: 2268 2: 2200 3: 2200 4: 2200 5: 2198 6: 2199 7: 2200 8: 2200 9: 2200
10: 2200 11: 2200 12: 2071 13: 2200 14: 2200 15: 2200 16: 2200
bogomips: 115186
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
Device-1: AMD Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] driver: amdgpu
v: kernel arch: RDNA-2 pcie: speed: 16 GT/s lanes: 16 ports: active: DP-1
empty: DP-2,DP-3,HDMI-A-1 bus-ID: 0a:00.0 chip-ID: 1002:73bf
Display: wayland server: X.org v: 1.20.14 with: Xwayland v: 22.1.7
compositor: gnome-shell driver: X: loaded: amdgpu
unloaded: fbdev,modesetting,radeon,vesa dri: radeonsi gpu: amdgpu
display-ID: 0
Monitor-1: DP-1 model: Samsung C34H89x res: 3440x1440 dpi: 110
diag: 864mm (34")
API: OpenGL v: 4.6 Mesa 22.3.1 renderer: AMD Radeon RX 6800 (navi21 LLVM
15.0.6 DRM 3.48 6.0.15-300.fc37.x86_64) direct render: Yes