GPU hang on video playback
System information
- OS: (
cat /etc/os-release | grep "NAME"
) Arch Linux - GPU: (
lspci -nn | grep VGA
orlshw -C display -numeric
)00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Stoney [Radeon R2/R3/R4/R5 Graphics] [1002:98e4] (rev d2)
- Kernel version: (run
uname -a
) 5.18.9-arch1-1 - Mesa version: (
glxinfo -B | grep "OpenGL version string"
) mesa 22.1.3 - Xserver version (if applicable): (
sudo X -version
) - Desktop manager and compositor: GNOME Wayland
If applicable
- DXVK version: N/A
- Wine/Proton version: N/A
Describe the issue
I can reproduce the issue on a HEVC encoded video file with the media player Clapper consistently while seeking but it does happen on other video files (although inconsistently). Clapper uses gst-plugins-va for hardware decoding. My entire system locks up during the hang and I have to hard reboot (amdgpu.gpu_recovery=1 doesn't always help to recover)
Regression
I'm not able to downgrade mesa to check due to a LLVM and Clang rebuild in Arch.
As far as I can tell this isn't a regression in the kernel, the lockup happens on both 5.17.9 and 5.17 kernels for me.
Log files as attachment
This is what I see after the hang (booting with amdgpu.lockup_timeout=100 amdgpu.vm_debug=1 amdgpu.gpu_recovery=1
)
Jul 08 09:55:31 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:30 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:29 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:28 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:27 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:26 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:25 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:24 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:23 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:22 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:21 arch kernel: amdgpu 0000:00:01.0: amdgpu: SRBM_SOFT_RESET=0x00040000
Jul 08 09:55:21 arch kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset begin!
Jul 08 09:55:21 arch kernel: amdgpu 0000:00:01.0: amdgpu: IP block:uvd_v6_0 is hung!
Jul 08 09:55:21 arch kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Jul 08 09:55:21 arch kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring uvd timeout, signaled seq=213, emitted seq=215
Jul 08 09:55:21 arch kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset(3) succeeded!
Jul 08 09:55:21 arch kernel: amdgpu 0000:00:01.0: amdgpu: recover vram bo from shadow done
Jul 08 09:55:21 arch kernel: amdgpu 0000:00:01.0: amdgpu: recover vram bo from shadow start
Jul 08 09:55:21 arch kernel: </TASK>
Jul 08 09:55:21 arch kernel: ret_from_fork+0x22/0x30
Jul 08 09:55:21 arch kernel: ? kthread_complete_and_exit+0x20/0x20
Jul 08 09:55:21 arch kernel: kthread+0xde/0x110
Jul 08 09:55:21 arch kernel: ? rescuer_thread+0x3a0/0x3a0
Jul 08 09:55:21 arch kernel: worker_thread+0x51/0x380
Jul 08 09:55:21 arch kernel: process_one_work+0x1c7/0x380
Jul 08 09:55:21 arch kernel: drm_sched_job_timedout+0x76/0x100 [gpu_sched b54a976254cd79f6332eedc913d0037b3c33b883]
Jul 08 09:55:21 arch kernel: amdgpu_job_timedout+0x18c/0x1c0 [amdgpu c3399060640045ce33894f35f697ceceab8d3be0]
Jul 08 09:55:21 arch kernel: amdgpu_device_gpu_recover_imp.cold+0x537/0x8cc [amdgpu c3399060640045ce33894f35f697ceceab8d3be0]
Jul 08 09:55:21 arch kernel: amdgpu_do_asic_reset+0x2a/0x470 [amdgpu c3399060640045ce33894f35f697ceceab8d3be0]
Jul 08 09:55:21 arch kernel: dump_stack_lvl+0x48/0x5d
Jul 08 09:55:21 arch kernel: <TASK>
Jul 08 09:55:21 arch kernel: Call Trace:
Jul 08 09:55:21 arch kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Jul 08 09:55:21 arch kernel: Hardware name: Acer Aspire A315-21/Squirtle_SR, BIOS V1.12 06/19/2018
Jul 08 09:55:21 arch kernel: CPU: 0 PID: 8 Comm: kworker/u8:0 Not tainted 5.18.9-arch1-1 #1 137f0035b2ece06cb65382579db27e9de66af
Jul 08 09:55:21 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, giving up!!!
Jul 08 09:55:21 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:20 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:19 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:18 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:17 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:16 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:15 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:14 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:13 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:11 arch kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jul 08 09:55:10 arch kernel: amdgpu 0000:00:01.0: amdgpu: SRBM_SOFT_RESET=0x00040000
Jul 08 09:55:10 arch kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset begin!
Jul 08 09:55:10 arch kernel: amdgpu 0000:00:01.0: amdgpu: IP block:uvd_v6_0 is hung!
Jul 08 09:55:10 arch kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Jul 08 09:55:10 arch kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring uvd timeout, signaled seq=209, emitted seq=211
Jul 08 09:55:10 arch kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset(2) succeeded!
Jul 08 09:55:10 arch kernel: amdgpu 0000:00:01.0: amdgpu: recover vram bo from shadow done
Jul 08 09:55:10 arch kernel: amdgpu 0000:00:01.0: amdgpu: recover vram bo from shadow start
Jul 08 09:55:10 arch kernel: </TASK>
This is during the system lockup when I try to kill -9 clapper a defunct process keeps running
Jul 08 01:37:19 arch kernel: R13: 0000000000000021 R14: 00007f5a389d9910 R15: 00007f5a389d9640
Jul 08 01:37:19 arch kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Jul 08 01:37:19 arch kernel: RBP: 00007ffee8b94f30 R08: 0000000000000000 R09: 00000000ffffffff
Jul 08 01:37:19 arch kernel: RDX: 0000000000000021 RSI: 0000000000000109 RDI: 00007f5a389d9910
Jul 08 01:37:19 arch kernel: RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f5a44866e1d
Jul 08 01:37:19 arch kernel: RSP: 002b:00007ffee8b94f00 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Jul 08 01:37:19 arch kernel: RIP: 0033:0x7f5a44866e1d
Jul 08 01:37:19 arch kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Jul 08 01:37:19 arch kernel: ? do_syscall_64+0x6b/0x90
Jul 08 01:37:19 arch kernel: ? syscall_exit_to_user_mode+0x26/0x50
Jul 08 01:37:19 arch kernel: ? do_syscall_64+0x6b/0x90
Jul 08 01:37:19 arch kernel: ? syscall_exit_to_user_mode+0x26/0x50
Jul 08 01:37:19 arch kernel: ? do_syscall_64+0x6b/0x90
Jul 08 01:37:19 arch kernel: ? syscall_exit_to_user_mode+0x26/0x50
Jul 08 01:37:19 arch kernel: do_syscall_64+0x6b/0x90
Jul 08 01:37:19 arch kernel: syscall_exit_to_user_mode+0x26/0x50
Jul 08 01:37:19 arch kernel: exit_to_user_mode_prepare+0xd3/0x140
Jul 08 01:37:19 arch kernel: arch_do_signal_or_restart+0x48/0x760
Jul 08 01:37:19 arch kernel: get_signal+0x986/0x990
Jul 08 01:37:19 arch kernel: do_group_exit+0x31/0xa0
Jul 08 01:37:19 arch kernel: do_exit+0x337/0xac0
Jul 08 01:37:19 arch kernel: task_work_run+0x60/0x90
Jul 08 01:37:19 arch kernel: __fput+0x89/0x240
Jul 08 01:37:19 arch kernel: drm_release+0x69/0x110
Jul 08 01:37:19 arch kernel: drm_file_free.part.0+0x204/0x250
Jul 08 01:37:19 arch kernel: amdgpu_driver_postclose_kms+0x7d/0x2e0 [amdgpu c3399060640045ce33894f35f697ceceab8d3be0]
Jul 08 01:37:19 arch kernel: amdgpu_uvd_free_handles+0xc5/0x130 [amdgpu c3399060640045ce33894f35f697ceceab8d3be0]
Jul 08 01:37:19 arch kernel: dma_fence_wait_timeout+0xe4/0x100
Jul 08 01:37:19 arch kernel: ? __bpf_trace_dma_fence+0x10/0x10
Jul 08 01:37:19 arch kernel: dma_fence_default_wait+0x1d0/0x270
Jul 08 01:37:19 arch kernel: schedule_timeout+0x119/0x150
Jul 08 01:37:19 arch kernel: schedule+0x4f/0xb0
Jul 08 01:37:19 arch kernel: __schedule+0x37c/0x11f0
Jul 08 01:37:19 arch kernel: <TASK>
Jul 08 01:37:19 arch kernel: Call Trace:
Jul 08 01:37:19 arch kernel: task:com.github.rafo state:D stack: 0 pid: 4228 ppid: 4227 flags:0x00004002
Jul 08 01:37:19 arch kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 08 01:37:19 arch kernel: Not tainted 5.18.9-arch1-1 #1
Jul 08 01:37:19 arch kernel: INFO: task com.github.rafo:4228 blocked for more than 491 seconds.
Screenshots/video files (if applicable)
The video file is large and I can't really share it.
I can reproduce the issue in the following video https://www.libde265.org/hevc-bitstreams/tos-1720x720-cfg01.mkv as well.
Any extra information would be greatly appreciated
N/A