Navi 22 - Occasional vcn timeouts when VAAPI encoding and decoding simultaneously
Brief summary of the problem:
When recording with OBS using any VAAPI encoder and playing media with VAAPI decoding (Firefox, mpv), I will occasionally get either vcn_enc_0.0 timeout
from OBS, or vcn_dec_0 timeout
from whatever application was playing.
Hardware description:
- CPU: R9 5900X
- GPU:
06:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] [1002:73df] (rev c1)
(PowerColor RX 6700 XT Red Devil) - System Memory: 32GB DDR4 3600
- Display(s): Asus XG27AQ (2560x1440 144hz VRR), Asus VS239 (1920x1080 60hz)
- Type of Display Connection: Displayport and HDMI respectively
System information:
- Distro name and Version: Fedora 38 KDE (Wayland)
- Kernel version: 6.2.15-301.fsync.fc38.x86_64
- Custom kernel: https://copr.fedorainfracloud.org/coprs/sentry/kernel-fsync/
- AMD official driver version: N/A
How to reproduce the issue:
- Play a Youtube video or Twitch stream in a browser with hardware acceleration, or video in mpv with VAAPI decoding enabled. Format such as h264 or h265 doesn't seem to matter.
- Record with OBS using a VAAPI encoder. (Built-in ffmpeg VAAPI, obs-gstreamer VAAPI, obs-vaapi) I use obs-vaapi's h265 option, haven't tried to test this issue against h264 yet.
- Either OBS will throw a
vcn_enc_0.0 timeout
or whatever is playing something will throw avcn_dec_0 timeout
. The other application will proceed to coredump with some others. The screen will freeze, sound will continue, and trying to change TTYs won't show anything. The monitors may go to sleep.
The rate of occurrence seems random. It can happen anywhere between 3 minutes to after an hour.
If amdgpu.gpu_recovery=0
is set, both applications appear in either order. An example with OBS first:
May 10 13:24:30 system kernel: [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
May 10 13:24:41 system kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_enc_0.0 timeout, signaled seq=2506271, emitted seq=2506273
May 10 13:24:41 system kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process obs pid 253158 thread obs:cs0 pid 253167
May 10 13:24:41 system kernel: amdgpu 0000:06:00.0: amdgpu: GPU recovery disabled.
May 10 13:24:41 system kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec_0 timeout, signaled seq=1049847, emitted seq=1049850
May 10 13:24:41 system kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RDD Process pid 249039 thread firefox:cs0 pid 249129
May 10 13:24:41 system kernel: amdgpu 0000:06:00.0: amdgpu: GPU recovery disabled.
Without recovery, a second scenario seems to sometimes happen:
- The normal freeze happens.
- The recording in OBS stalls. Trying to end the recording gets stuck on "Stopping." Killing OBS turns it zombie. If a game was running, closing it is fine. In the case of Firefox, whatever video was playing goes black and won't play back. Refreshing the page doesn't work. Checking dmesg or the journal shows the two timeouts. The session generally becomes unstable. If Steam's hardware acceleration is enabled, exiting it causes the screen to freeze with the other things that happen.
I mainly use OBS for recording gameplay, and sometimes I'll play a Youtube video while doing so which has then lead to this issue. Live Twitch streams too.
Having OBS recording just the desktop with mpv looping any video will trigger it as well.
I've been using this copr of mesa-git instead of 23.0.3. It doesn't help.
Some other issues involving VAAPI have mentioned an issue with kernel 6.1.7 and above.
I tried on Fedora's build of kernel 6.1.6 and was able to reproduce the issue. I could try some older versions if needed.
sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep MEC
returns the following:
MEC feature version: 42, firmware version: 0x00000068
MEC2 feature version: 42, firmware version: 0x00000068
The first instance I could find in my journal was February 12th 2023, when my journal entries went back to December 2022.
I want to say I've had vcn_dec_0 timeout
from Firefox without OBS recording, but my journal only had events that include OBS running.
I don't believe I've had vcn_enc_0.0 timeout
happen when OBS is recording a game without VAAPI decoding taking place. The other day I recorded for about 6 hours without Firefox running to be safe and had no issue.
Attached files:
This test video from a different VAAPI issue can trigger the issue, but any video will do it for me.
mpv arguments used: mpv --no-config --loop --hwdec=vaapi
Log files (for system lockups / game freezes / crashes)
journal_system_6.2.15_fsync_recovery_0.txt
journal_system_6.1.6.txt
OBS log with obs-vaapi encoder settings. I changed rate control to CQP and qpi, qpp, and qpb from 26 to 24. But other than those they're default. Setting those QP values higher to 28 or leaving at 26 gives the same results. obs-gstreamer with rate-control=cqp init-qp=26
gives the same results. As do the built in ffmpeg VAAPI options with CQP.
obs.txt