amdgpu: GPU reset for Radeon RX 6700XT completely broken
Summary of the problem:
If GPU ever encounter any error which require GPU reset it could never perform it. Monitor permanently lose signal and system require hard reset to recover (can ssh but not reboot due to hanged thread in kernel). And it happens very often, mainly in vulkan/dx12 games and using gpu hardware decoding. Also it happens even on Windows, same no signal and disabled driver in device manager after reboot.
Is this particular GPU even capable of resetting? Googling shows many issues with this specific card and some other models of 6700XT even with gpu passthrough to VM (aka "vfio reset bug"). And some other 6700XT models seems doing fine...
Hardware description:
- CPU: AMD Ryzen 9 5950X
- GPU: Sapphire AMD Radeon RX 6700XT 11306-01-20G RX 6700XT Gaming NITRO+ (Micron memory, BIOS 1, BIOS 2)
- System Memory: G.SKILL DDR4 32Gb (2x16Gb) 3600MHz pc-28800 TRIDENT Z NEO (F4-3600C14D-32GTZN)
- Motherboard: ASUS X570 ROG Crosshair VIII Dark Hero
- Display: Xiaomi MI 2k 27''
- Type of Display Connection: DP
System information:
- Distro name and Version: ArchLinux
- Kernel version: 6.4.3, custom 6.5.0-rc2
- AMD official driver version: N/A
- glxinfo: AMD Radeon RX 6700 XT (navi22, LLVM 15.0.7, DRM 3.52, 6.4.3-arch1-2)
- mesa: 23.1.3
How to reproduce the issue:
- cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover, even when sitting in plain kms framebuffer
- gfxrecon-replay gfxrecon_capture_20230718T062606.gfxr (using RADV)
Attached files:
gfxrecon_capture_20230718T062606.gfxr.gz
Log files
[ 794.763383] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=315706, emitted seq=315708
[ 794.763663] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process bg3.exe pid 16381 thread bg3.exe pid 16381
[ 794.763920] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
[ 795.001749] amdgpu 0000:0c:00.0: amdgpu: MODE1 reset
[ 795.001752] amdgpu 0000:0c:00.0: amdgpu: GPU mode1 reset
[ 795.001810] amdgpu 0000:0c:00.0: amdgpu: GPU smu mode1 reset
[ 806.473774] amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 806.473955] [drm] PCIE GART of 512M enabled (table at 0x0000008000F00000).
[ 806.474010] [drm] VRAM is lost due to GPU reset!
[ 806.474011] [drm] PSP is resuming...
[ 813.796592] [drm:psp_v11_0_memory_training [amdgpu]] *ERROR* send training msg failed.
[ 813.796752] [drm:psp_resume [amdgpu]] *ERROR* Failed to process memory training!
[ 813.796892] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
[ 813.797018] amdgpu 0000:0c:00.0: amdgpu: GPU reset(2) failed
[ 813.797053] [drm] Skip scheduling IBs!
[ 813.797057] [drm] Skip scheduling IBs!
[ 813.797061] [drm] Skip scheduling IBs!
[ 813.797064] [drm] Skip scheduling IBs!
[ 813.797065] [drm] Skip scheduling IBs!
[ 813.797067] [drm] Skip scheduling IBs!
[ 813.797070] [drm] Skip scheduling IBs!
[ 813.797071] [drm] Skip scheduling IBs!
[ 813.797075] [drm] Skip scheduling IBs!
[ 813.797077] [drm] Skip scheduling IBs!
[ 813.797079] [drm] Skip scheduling IBs!
[ 813.797080] [drm] Skip scheduling IBs!
[ 813.797081] [drm] Skip scheduling IBs!
[ 813.797082] [drm] Skip scheduling IBs!
[ 813.797083] [drm] Skip scheduling IBs!
[ 813.797085] [drm] Skip scheduling IBs!
[ 813.797086] [drm] Skip scheduling IBs!
[ 813.797087] [drm] Skip scheduling IBs!
[ 813.797089] [drm] Skip scheduling IBs!
[ 813.797090] [drm] Skip scheduling IBs!
[ 813.797091] [drm] Skip scheduling IBs!
[ 813.797092] [drm] Skip scheduling IBs!
[ 813.797094] [drm] Skip scheduling IBs!
[ 813.797095] [drm] Skip scheduling IBs!
[ 813.797097] [drm] Skip scheduling IBs!
[ 813.797103] [drm] Skip scheduling IBs!
[ 813.797106] [drm] Skip scheduling IBs!
[ 813.797109] [drm] Skip scheduling IBs!
[ 813.797110] [drm] Skip scheduling IBs!
[ 813.797111] [drm] Skip scheduling IBs!
[ 813.797113] [drm] Skip scheduling IBs!
[ 813.797115] [drm] Skip scheduling IBs!
[ 813.797121] [drm] Skip scheduling IBs!
[ 813.797124] [drm] Skip scheduling IBs!
[ 813.797124] [drm] Skip scheduling IBs!
[ 813.797125] [drm] Skip scheduling IBs!
[ 813.797126] [drm] Skip scheduling IBs!
[ 813.797128] [drm] Skip scheduling IBs!
[ 813.797132] [drm] Skip scheduling IBs!
[ 813.797137] [drm] Skip scheduling IBs!
[ 813.797140] [drm] Skip scheduling IBs!
[ 813.797142] [drm] Skip scheduling IBs!
[ 813.797144] [drm] Skip scheduling IBs!
[ 813.797146] [drm] Skip scheduling IBs!
[ 813.797148] [drm] Skip scheduling IBs!
[ 813.797150] [drm] Skip scheduling IBs!
[ 813.797150] [drm] Skip scheduling IBs!
[ 813.797154] [drm] Skip scheduling IBs!
[ 813.797157] [drm] Skip scheduling IBs!
[ 813.797166] [drm] Skip scheduling IBs!
[ 813.911574] snd_hda_intel 0000:0c:00.1: CORB reset timeout#2, CORBRP = 65535
[ 813.911614] amdgpu 0000:0c:00.0: amdgpu: GPU reset end with ret = -62
[ 813.911615] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -62
[ 823.980029] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=9498, emitted seq=9500
[ 823.980402] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[ 823.980741] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
[ 823.980998] amdgpu 0000:0c:00.0: amdgpu: Failed to disallow df cstate