Unrecoverable GPU crash when playing a 4K H.265 HDR10+ video
I noticed the issue with a 6.2 kernel but it persists in 6.3. An attempt to play a 4K H.265 HDR10+ video file with the mpv played and HW acceleration enabled leads to the entire desktop immediately becoming very laggy. After some time, perhaps 30 seconds at most, the GPU crashes with the backtrace listed below. The driver then tries to recover the GPU without success until the computer is restarted.
This appears to be a regression because I was able to play such videos correctly in the past. A link to a Kodi sample video that triggers the issue: https://mega.nz/file/nehDka6Z#C5_OPbSZkONdOp1jRmc09C9-viDc3zMj8ZHruHcWKyA
- Kernel 6.3 (Arch Linux)
- ThinkPad T14 Gen1 AMD, Ryzen 4750U
- Mesa 23.0.3
- mpv 0.35.1 set to use vo=next, profile=gpu-hq, hwdec=auto
dub 29 14:09:00 Sad-Silke kernel: worker_thread+0x51/0x390
dub 29 14:09:00 Sad-Silke kernel: ? __pfx_worker_thread+0x10/0x10
dub 29 14:09:00 Sad-Silke kernel: kthread+0xde/0x110
dub 29 14:09:00 Sad-Silke kernel: ? __pfx_kthread+0x10/0x10
dub 29 14:09:00 Sad-Silke kernel: ret_from_fork+0x2c/0x50
dub 29 14:09:00 Sad-Silke kernel: </TASK>
dub 29 14:09:00 Sad-Silke kernel: ---[ end trace 0000000000000000 ]---
dub 29 14:09:00 Sad-Silke kernel: ------------[ cut here ]------------
dub 29 14:09:00 Sad-Silke kernel: WARNING: CPU: 8 PID: 438 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:599 amdgpu_irq_put+0x49/0x70 [amdgpu]
dub 29 14:09:00 Sad-Silke kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device ccm algif_aead des_generic libdes ecb md4 cmac algif_hash algif_skcipher af_alg bnep>
dub 29 14:09:00 Sad-Silke kernel: videodev snd_acp_config drm_display_helper r8169 irqbypass ipmi_devintf snd_pcm snd_soc_acpi videobuf2_common ucsi_acpi video sp5100_tco realtek mc crc>
dub 29 14:09:00 Sad-Silke kernel: CPU: 8 PID: 438 Comm: kworker/u32:8 Tainted: G W 6.3.0-arch1-1 #1 886a6cf902611b6d92bd971f7c5b1561ca9722fc
dub 29 14:09:00 Sad-Silke kernel: Hardware name: LENOVO 20UDS02D00/20UDS02D00, BIOS R1BET74W(1.43 ) 03/01/2023
dub 29 14:09:00 Sad-Silke kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
dub 29 14:09:00 Sad-Silke kernel: RIP: 0010:amdgpu_irq_put+0x49/0x70 [amdgpu]
dub 29 14:09:00 Sad-Silke kernel: Code: 48 8b 4e 10 48 83 39 00 74 2c 89 d1 48 8d 04 88 8b 08 85 c9 74 14 f0 ff 08 b8 00 00 00 00 74 05 e9 b0 14 d8 e8 e9 57 fd ff ff <0f> 0b b8 ea ff ff >
dub 29 14:09:00 Sad-Silke kernel: RSP: 0018:ffffc14a02f2fca8 EFLAGS: 00010246
dub 29 14:09:00 Sad-Silke kernel: RAX: ffff9bac50dd4960 RBX: ffff9bac5b160000 RCX: 0000000000000000
dub 29 14:09:00 Sad-Silke kernel: RDX: 0000000000000000 RSI: ffff9bac5b160c48 RDI: ffff9bac5b160000
dub 29 14:09:00 Sad-Silke kernel: RBP: ffff9bac5b160000 R08: 0000000000000000 R09: 0000000000000000
dub 29 14:09:00 Sad-Silke kernel: R10: 0000000000000001 R11: 0000000000000100 R12: 0000000000001050
dub 29 14:09:00 Sad-Silke kernel: R13: ffff9bac5b1789a0 R14: ffff9bac90fbf800 R15: 0000000000000000
dub 29 14:09:00 Sad-Silke kernel: FS: 0000000000000000(0000) GS:ffff9bb32fa00000(0000) knlGS:0000000000000000
dub 29 14:09:00 Sad-Silke kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
dub 29 14:09:00 Sad-Silke kernel: CR2: 00007fdde7832000 CR3: 000000024bc20000 CR4: 0000000000350ee0
dub 29 14:09:00 Sad-Silke kernel: Call Trace:
dub 29 14:09:00 Sad-Silke kernel: <TASK>
dub 29 14:09:00 Sad-Silke kernel: gmc_v9_0_hw_fini+0x7e/0xa0 [amdgpu b08609e9020f2bdfbda0cdb5cdd2797bbfba90bb]
dub 29 14:09:00 Sad-Silke kernel: amdgpu_device_ip_suspend_phase2+0x107/0x1a0 [amdgpu b08609e9020f2bdfbda0cdb5cdd2797bbfba90bb]
dub 29 14:09:00 Sad-Silke kernel: ? amdgpu_device_ip_suspend_phase1+0x71/0xe0 [amdgpu b08609e9020f2bdfbda0cdb5cdd2797bbfba90bb]
dub 29 14:09:00 Sad-Silke kernel: amdgpu_device_ip_suspend+0x36/0x70 [amdgpu b08609e9020f2bdfbda0cdb5cdd2797bbfba90bb]
dub 29 14:09:00 Sad-Silke kernel: amdgpu_device_pre_asic_reset+0xd3/0x2b0 [amdgpu b08609e9020f2bdfbda0cdb5cdd2797bbfba90bb]
dub 29 14:09:00 Sad-Silke kernel: amdgpu_device_gpu_recover+0x4c7/0xd80 [amdgpu b08609e9020f2bdfbda0cdb5cdd2797bbfba90bb]
dub 29 14:09:00 Sad-Silke kernel: amdgpu_job_timedout+0x18d/0x240 [amdgpu b08609e9020f2bdfbda0cdb5cdd2797bbfba90bb]
dub 29 14:09:00 Sad-Silke kernel: drm_sched_job_timedout+0x7a/0x110 [gpu_sched e7e788809fb0c001ccfa741268f1ea6d83082fdb]
dub 29 14:09:00 Sad-Silke kernel: process_one_work+0x1c8/0x3c0
dub 29 14:09:00 Sad-Silke kernel: worker_thread+0x51/0x390
dub 29 14:09:00 Sad-Silke kernel: ? __pfx_worker_thread+0x10/0x10
dub 29 14:09:00 Sad-Silke kernel: kthread+0xde/0x110
dub 29 14:09:00 Sad-Silke kernel: ? __pfx_kthread+0x10/0x10
dub 29 14:09:00 Sad-Silke kernel: ret_from_fork+0x2c/0x50
dub 29 14:09:00 Sad-Silke kernel: </TASK>
dub 29 14:09:00 Sad-Silke kernel: ---[ end trace 0000000000000000 ]---
dub 29 14:09:00 Sad-Silke kernel: amdgpu 0000:07:00.0: amdgpu: MODE2 reset
dub 29 14:09:00 Sad-Silke kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
dub 29 14:09:00 Sad-Silke kernel: [drm] PCIE GART of 1024M enabled.
dub 29 14:09:00 Sad-Silke kernel: [drm] PTB located at 0x000000F41FC00000
dub 29 14:09:00 Sad-Silke kernel: [drm] PSP is resuming...
dub 29 14:09:01 Sad-Silke kernel: [drm] reserve 0x400000 from 0xf41f800000 for PSP TMR
dub 29 14:09:02 Sad-Silke kernel: amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
dub 29 14:09:02 Sad-Silke kernel: amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
dub 29 14:09:02 Sad-Silke kernel: [drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
dub 29 14:09:02 Sad-Silke kernel: [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
dub 29 14:09:02 Sad-Silke kernel: amdgpu 0000:07:00.0: amdgpu: Secure display: Generic Failure.
dub 29 14:09:02 Sad-Silke kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
dub 29 14:09:02 Sad-Silke kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
dub 29 14:09:02 Sad-Silke kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
dub 29 14:09:02 Sad-Silke kernel: [drm] DMUB hardware initialized: version=0x01010026
dub 29 14:09:02 Sad-Silke kernel: [drm] kiq ring mec 2 pipe 1 q 0
dub 29 14:09:02 Sad-Silke kernel: amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
dub 29 14:09:02 Sad-Silke kernel: [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
dub 29 14:09:02 Sad-Silke kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
dub 29 14:09:02 Sad-Silke kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset(3) failed
dub 29 14:09:02 Sad-Silke kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset end with ret = -110
dub 29 14:09:02 Sad-Silke kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110