[navi21] GPU hang during video decoding in Firefox (ring vcn_dec_1 timeout)
I started getting periodic GPU hangs that happens during vaapi accelerated video playback in Firefox.
- GPU: Sapphire Pulse RX 6800 XT
- Kernel: 5.19.8
- Mesa VAAPI: 22.2.0~rc3-1
- Firmware: latest from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/log/amdgpu i.e. including this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/amdgpu?id=3c1662d9a546ab01e42a34aec30ca44965fa4d7d - Firefox 105.0b9
Here is an example of recent hang log from dmesg:
[10422.428815] [drm] failed to load ucode VCN0_RAM(0x35)
[10422.428845] [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
[10430.435115] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec_1 timeout, signaled seq=22205, emitted seq=22208
[10430.435272] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RDD Process pid 2591 thread firefox-bi:cs0 pid 4478
[10430.435404] amdgpu 0000:0f:00.0: amdgpu: GPU reset begin!
[10434.435382] amdgpu 0000:0f:00.0: amdgpu: failed to suspend display audio
[10434.777329] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
[10435.024724] [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x00000070 != 0x00000000
[10435.272396] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
[10435.525357] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
[10435.778083] [drm] Register(1) [mmUVD_RBC_RB_RPTR] failed to reach value 0x000000a0 != 0x00000000
[10436.042909] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
[10436.131193] [drm] free PSP TMR buffer
[10436.178967] CPU: 0 PID: 39695 Comm: kworker/u64:1 Not tainted 5.19.8 #1
[10436.178969] Hardware name: To Be Filled By O.E.M. X570 Taichi/X570 Taichi, BIOS P4.80 02/16/2022
[10436.178971] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[10436.178976] Call Trace:
[10436.178977] <TASK>
[10436.178978] dump_stack_lvl+0x44/0x5c
[10436.178982] amdgpu_do_asic_reset+0x26/0x459 [amdgpu]
[10436.179088] amdgpu_device_gpu_recover_imp.cold+0x59d/0x8e0 [amdgpu]
[10436.179176] amdgpu_job_timedout+0x156/0x190 [amdgpu]
[10436.179269] ? __switch_to+0x106/0x430
[10436.179272] drm_sched_job_timedout+0x76/0x110 [gpu_sched]
[10436.179274] process_one_work+0x1c7/0x380
[10436.179276] worker_thread+0x4d/0x380
[10436.179277] ? process_one_work+0x380/0x380
[10436.179278] kthread+0xe9/0x110
[10436.179279] ? kthread_complete_and_exit+0x20/0x20
[10436.179280] ret_from_fork+0x22/0x30
[10436.179282] </TASK>
[10436.179286] amdgpu 0000:0f:00.0: amdgpu: MODE1 reset
[10436.179289] amdgpu 0000:0f:00.0: amdgpu: GPU mode1 reset
[10436.179394] amdgpu 0000:0f:00.0: amdgpu: GPU smu mode1 reset
[10436.703700] amdgpu 0000:0f:00.0: amdgpu: GPU reset succeeded, trying to resume
Edited by Shmerl