GPU Lockup on Radeon Pitcairn - VAAPI related
Brief summary of the problem:
VA-API initilization appears to cause a GPU lockup, it's unrecoverable requiring me to hard reboot.
I have chromium setup to use VAAPI decoding, also, I'm developing an application that uses libavcodec vaapi. This seems fairly reproducible, across all recent kernels I've used in the last few months, I can typically reproduce it within a day. It seems related to initializing VA-API, somehow related to opening chromium, watch youtube, run another application with va-api, close chromium, and repeat. When va-api initializes there is a good chance the GPU locks up and the system becomes unresponsive (I can't ssh in, but I can tell some USB things are not 100% dead). This does not happen if you just open and close one application (it might matter that two different applications use va-api at the same time?)
I'm reporting here because apperently it's better than kernel.org, but same issue is reported here: https://bugzilla.kernel.org/show_bug.cgi?id=217110
Hardware description:
- CPU: i7-4930K
- GPU: VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Curacao XT / Trinidad XT [Radeon R7 370 / R9 270X/370X] [1002:6810]
- System Memory: 48GB
- Display(s): 2x@4K, 1x@1080p
- Type of Display Connection: 2x@DP, 1x@HDMI
System information:
- Distro name and Version: Slackware-Current
- Kernel version: Linux caterpillar 6.1.10 #1 (closed) SMP PREEMPT_DYNAMIC Mon Feb 6 14:14:20 CST 2023 x86_64 Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz GenuineIntel GNU/Linux
- Custom kernel: N/A - (Slackware/huge)
- AMD official driver version: N/A
- VA-API version: 1.17 (libva 2.17.1)
- Driver version: Mesa Gallium driver 22.3.4 for PITCAIRN (, LLVM 14.0.6, DRM 2.50, 6.1.10)
How to reproduce the issue:
- Opening chromium
- Watch youtube
- run another application with va-api
- close chromium
- repeat
When va-api initializes there is a good chance the GPU locks up and the system becomes unresponsive
Attached files:
Log files (for system lockups / game freezes / crashes)
Note, this message showed up hours before, I don't think it's related (happens on screen wake)
[drm:si_dpm_set_power_state [radeon]] *ERROR* si_restrict_performance_levels_before_switch failed
This is what I get when the system crashes:
radeon 0000:01:00.0: ring 3 stalled for more than 10080msec
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GPU lockup (current fence id 0x00000000001a67f9 last fence id 0x00000000001a6905 on ring 3)
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: ring 0 stalled for more than 10324msec
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000008a6c32e last fence id 0x0000000008a6c366 on ring 0)
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: ring 5 stalled for more than 10080msec
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GPU lockup (current fence id 0x00000000000187a2 last fence id 0x00000000000187a3 on ring 5)
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: Saved 4561 dwords of commands on ring 0.
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GPU softreset: 0x000003CC
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GRBM_STATUS = 0xA0003028
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: SRBM_STATUS = 0x20024FC0
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000802
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_008680_CP_STAT = 0x800000E3
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44CFC046
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: Wait for MC idle timedout !
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GRBM_SOFT_RESET=0x0000DDFF
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00128500
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GRBM_STATUS = 0x00003028
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GRBM_STATUS_SE0 = 0x00000006
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GRBM_STATUS_SE1 = 0x00000006
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: SRBM_STATUS = 0x200006C0
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: SRBM_STATUS2 = 0x00000000
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: R_00D834_DMA_STATUS_REG = 0x44C83D57
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: GPU reset succeeded, trying to resume
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
[drm:atom_op_jump [radeon]] *ERROR* atombios stuck in loop for more than 5secs aborting
[drm:atom_execute_table_locked [radeon]] *ERROR* atombios stuck executing C021 (len 254, WS 0, PS 4) @ 0xC04B
[drm:atom_execute_table_locked [radeon]] *ERROR* atombios stuck executing B6E3 (len 94, WS 12, PS 8) @ 0xB72C
[drm] PCIE gen 3 link speeds already enabled
radeon 0000:01:00.0: Wait for MC idle timedout !
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: Wait for MC idle timedout !
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
[drm] PCIE GART of 2048M enabled (table at 0x00000000001D6000).
radeon 0000:01:00.0: WB enabled
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000100000c00
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000100000c04
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000100000c08
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000100000c0c
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000100000c10
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
radeon 0000:01:00.0: failed VCE resume (-22).
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
debugfs: File 'radeon_ring_gfx' in directory '0' already present!
debugfs: File 'radeon_ring_cp1' in directory '0' already present!
debugfs: File 'radeon_ring_cp2' in directory '0' already present!
debugfs: File 'radeon_ring_dma1' in directory '0' already present!
debugfs: File 'radeon_ring_dma2' in directory '0' already present!
[drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x850C)=0xCAFEDEAD)
[drm:si_resume [radeon]] *ERROR* si startup failed on resume
[drm:si_dpm_set_power_state [radeon]] *ERROR* si_set_sw_state failed
[drm:radeon_vce_get_create_msg [radeon]] *ERROR* radeon: failed to schedule ib (-12).
[drm:radeon_vce_ib_test [radeon]] *ERROR* radeon: failed to get create msg (-12).
[drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on ring 6 (-12).
radeon 0000:01:00.0: scheduling IB failed (-12).
SUBSYSTEM=pci
DEVICE=+pci:0000:01:00.0
[drm:radeon_vce_get_create_msg [radeon]] *ERROR* radeon: failed to schedule ib (-12).
[drm:radeon_vce_ib_test [radeon]] *ERROR* radeon: failed to get create msg (-12).
[drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on ring 7 (-12).
------------[ cut here ]------------
WARNING: CPU: 7 PID: 21050 at drivers/gpu/drm/radeon/radeon_object.c:62 radeon_ttm_bo_destroy+0xd2/0xe0 [radeon]
Modules linked in: cdc_acm netconsole fuse tun it87 hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables autofs4 efivarfs cfg80211 rfkill 8021q garp mrp bridge stp llc ipv6 amdgpu iommu_v2 gpu_sched drm_buddy ir_rc6_decoder rc_rc6_mce hid_picolcd lcd snd_usb_audio snd_usbmidi_lib ftdi_sio snd_rawmidi usbserial snd_seq_device usblp joydev hid_generic uas usbhid usb_storage hid rc_hauppauge ir_kbd_i2c ivtv_alsa rc_core tuner_simple tuner_types intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp wm8775 mxm_wmi evdev coretemp tuner snd_hda_codec_realtek radeon snd_hda_codec_generic ledtrig_audio video drm_ttm_helper snd_hda_codec_hdmi ttm kvm_intel drm_display_helper snd_hda_intel snd_intel_dspcfg cx25840 drm_kms_helper snd_intel_sdw_acpi snd_hda_codec kvm ivtv drm snd_hda_core irqbypass cx2341x crct10dif_pclmul crc32_pclmul tveeprom polyval_clmulni polyval_generic snd_hwdep ghash_clmulni_intel agpgart videodev sha512_4,1440,265142482775,-,ncfrag=965/983
snd_pcm syscopyarea i2c_i801 xhci_pci mc snd_timer xhci_pci_renesas sysfillrect rapl i2c_smbus i2c_algo_bit sysimgblt mei_me snd lpc_ich ehci_pci ioatdma intel_cstate xhci_hcd i2c_core mfd_core e1000e soundcore ehci_hcd mei dca wmi button loop
CPU: 7 PID: 21050 Comm: chromium Not tainted 6.1.10 #1
Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./X79-UP4, BIOS F7 03/19/2014
RIP: 0010:radeon_ttm_bo_destroy+0xd2/0xe0 [radeon]
Code: 00 00 00 74 0f 48 8b b3 a8 01 00 00 48 89 df e8 84 6f 7b ff 48 89 df e8 bc 66 7a ff 4c 89 e7 5b 5d 41 5c 41 5d e9 9e de 1a da <0f> 0b eb cd 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 81 bf
RSP: 0018:ffffb74a83693bd0 EFLAGS: 00010297
RAX: ffff95c80b202668 RBX: ffff95c80b202478 RCX: 000000008020001f
RDX: ffff95c7b667ad80 RSI: 0000000000000001 RDI: ffff95c7471f9cc8
RBP: ffffffffffffffff R08: 0000000000000000 R09: ffffffffc0a50201
R10: ffff95c766c04688 R11: 0000000000000000 R12: ffff95c80b202400
drm_file_free.part.0+0x1d9/0x2b0 [drm]
drm_release+0x64/0xd0 [drm]
__fput+0x89/0x250
R13: 0000000000000001 R14: ffff95c7b7db7a40 R15: ffff95cbe6f74028
task_work_run+0x59/0x90
FS: 0000000000000000(0000) GS:ffff95d24fbc0000(0000) knlGS:0000000000000000
do_exit+0x335/0xa80
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055bdabf78210 CR3: 000000010960e004 CR4: 00000000001706e0
Call Trace:
<TASK>
radeon_bo_unref+0x1a/0x30 [radeon]
radeon_gem_object_free+0x30/0x50 [radeon]
drm_gem_object_release_handle+0x50/0x60 [drm]
? drm_gem_object_handle_put_unlocked+0xf0/0xf0 [drm]
idr_for_each+0x5e/0xe0
? ktime_get_mono_fast_ns+0x3d/0x90
drm_gem_release+0x1c/0x30 [drm]
? radeon_gem_wait_idle_ioctl+0xb4/0x100 [radeon]
do_group_exit+0x2d/0x80
get_signal+0x953/0x960
? radeon_gem_busy_ioctl+0xb0/0xb0 [radeon]
arch_do_signal_or_restart+0x30/0x710
? __rseq_handle_notify_resume+0xa6/0x480
? __pm_runtime_suspend+0x6a/0x100
exit_to_user_mode_prepare+0xca/0x190
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x46/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f20bcad84e8
Code: Unable to access opcode bytes at 0x7f20bcad84be.
RSP: 002b:00007ffd48fb2f18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: fffffffffffffff5 RBX: 000028c403117180 RCX: 00007f20bcad84e8
RDX: 00007ffd48fb2f68 RSI: 0000000040086464 RDI: 0000000000000017
RBP: 00007ffd48fb2f68 R08: 0000000000000000 R09: ffffffffffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040086464
R13: 0000000000000017 R14: 000028c400226e00 R15: 0000000000000000
</TASK>
---[ end trace 0000000000000000 ]---
vainfo:
$ vainfo -a
Trying display: wayland
Trying display: x11
vainfo: VA-API version: 1.17 (libva 2.17.1)
vainfo: Driver version: Mesa Gallium driver 22.3.4 for PITCAIRN (, LLVM 14.0.6, DRM 2.50, 6.1.10)
vainfo: Supported config attributes per profile/entrypoint pair
VAProfileMPEG2Simple/VAEntrypointVLD
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VA_RT_FORMAT_YUV422
VAProfileMPEG2Main/VAEntrypointVLD
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VA_RT_FORMAT_YUV422
VAProfileVC1Simple/VAEntrypointVLD
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VA_RT_FORMAT_YUV422
VAProfileVC1Main/VAEntrypointVLD
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VA_RT_FORMAT_YUV422
VAProfileVC1Advanced/VAEntrypointVLD
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VA_RT_FORMAT_YUV422
VAProfileH264ConstrainedBaseline/VAEntrypointVLD
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VA_RT_FORMAT_YUV422
VAProfileH264ConstrainedBaseline/VAEntrypointEncSlice
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VAConfigAttribRateControl : VA_RC_CBR
VA_RC_VBR
VA_RC_CQP
VAConfigAttribEncPackedHeaders : VA_ENC_PACKED_HEADER_NONE
VAConfigAttribEncMaxRefFrames : l0=1
l1=0
VAConfigAttribEncMaxSlices : 1
VAProfileH264Main/VAEntrypointVLD
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VA_RT_FORMAT_YUV422
VAProfileH264Main/VAEntrypointEncSlice
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VAConfigAttribRateControl : VA_RC_CBR
VA_RC_VBR
VA_RC_CQP
VAConfigAttribEncPackedHeaders : VA_ENC_PACKED_HEADER_NONE
VAConfigAttribEncMaxRefFrames : l0=1
l1=0
VAConfigAttribEncMaxSlices : 1
VAProfileH264High/VAEntrypointVLD
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VA_RT_FORMAT_YUV422
VAProfileH264High/VAEntrypointEncSlice
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VAConfigAttribRateControl : VA_RC_CBR
VA_RC_VBR
VA_RC_CQP
VAConfigAttribEncPackedHeaders : VA_ENC_PACKED_HEADER_NONE
VAConfigAttribEncMaxRefFrames : l0=1
l1=0
VAConfigAttribEncMaxSlices : 1
VAProfileNone/VAEntrypointVideoProc
VAConfigAttribRTFormat : VA_RT_FORMAT_YUV420
VA_RT_FORMAT_YUV420_10
VA_RT_FORMAT_RGB32
VA_RT_FORMAT_YUV420_10BPP