[R9 390X] Broken hardware acceleration in 6.10 kernel

changed the description

Ditto here with an AMD Radeon 520:

Kernel 6.10.6 (LoongArch)
Mesa 24.2.1

As added information, I have tested to see that attempting to play H.264 videos would cause the aforementioned error, but H.265 and VP8 seems to be okay.

mentioned in issue #3501 (closed)

I have bisected the kernel to the commit f3572db3c049b4d32bb5ba77ad5305616c44c7c1 (merged since 6.10.4), reverting it has fixed hardware decoding on my Radeon 520.

It was intended as a fix for #3501 (closed).

For me it was 6.10.3

Hardware description:

CPU: AMD PRO A4-3350B
GPU: AMD Radeon R4 Graphics
System Memory: 8GiB
Display(s): 1366x768 @ 60 Hz in 14″
Type of Display Connection: eDP-1

System information:

Distro name and Version: Fedora Linux 40 (Workstation Edition)
Kernel version: 6.10.3-200.fc40
Custom kernel: N/A
AMD official driver version: N/A

❯ vainfo
Trying display: wayland
libva info: VA-API version 1.21.0
libva info: Trying to open /usr/lib64/dri-nonfree/radeonsi_drv_video.so
libva info: Trying to open /usr/lib64/dri-freeworld/radeonsi_drv_video.so
libva info: Trying to open /usr/lib64/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_1_21
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.21 (libva 2.21.0)
vainfo: Driver version: Mesa Gallium driver 24.1.7 for AMD Radeon R4 Graphics (radeonsi, kabini, LLVM 18.1.6, DRM 3.57, 6.10.3-200.fc40.x86_64)
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            :	VAEntrypointVLD
      VAProfileMPEG2Main              :	VAEntrypointVLD
      VAProfileVC1Simple              :	VAEntrypointVLD
      VAProfileVC1Main                :	VAEntrypointVLD
      VAProfileVC1Advanced            :	VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:	VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:	VAEntrypointEncSlice
      VAProfileH264Main               :	VAEntrypointVLD
      VAProfileH264Main               :	VAEntrypointEncSlice
      VAProfileH264High               :	VAEntrypointVLD
      VAProfileH264High               :	VAEntrypointEncSlice
      VAProfileNone                   :	VAEntrypointVideoProc

added TTM VCN labels

Please try this patch, it should fix the issue.

0001-drm-amdgpu-fix-forcing-BOs-into-the-UVD-segment.patch

Unfortunately, with the patch applied, I still get the same "CS has been rejected" error (-22), the dmesg output also looks identical:

[  125.523712] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff019e8000-ff019ec000 out of 256MB segment!

Mhm, then I don't fully know why that doesn't work. Maybe something in the allocation backend doesn't work as expected.

Can you try this patch here as well, thx in advance. 0001-drm-amdgpu-WIP-test-patch.patch

@ckoenig Unfortunately, no dice:

[   49.708329] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff019e0000-ff019e4000 out of 256MB segment!

The memory address range did seem to have changed, though...

I'm running out of ideas what that could be. Could you add this code chunk here and see if you have any "Test..." prints in dmesg right before the problem happens?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index eaf75248800e..211faf64f1d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1802,5 +1802,7 @@ int amdgpu_cs_find_mapping(struct amdgpu_cs_parser *parser,
        if (r)
                return r;
 
+       printK("Test 0x%08x\n", bo->resource->placement);
+
        return amdgpu_ttm_alloc_gart(&(*bo)->tbo);
 }

Thanks a lot for the help.

@ckoenig

Getting a failure here (I should have mentioned, I'm currently on 6.10.7, sorry!):

drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c: In function ‘amdgpu_cs_find_mapping’:
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1793:35: error: ‘bo’ is a pointer to pointer; did you mean to dereference it before applying ‘->’ to it?
 1793 |         printk("Test 0x%08x\n", bo->resource->placement);
      |                                   ^~
./include/linux/printk.h:433:33: note: in definition of macro ‘printk_index_wrap’
  433 |                 _p_func(_fmt, ##__VA_ARGS__);                           \
      |                                 ^~~~~~~~~~~
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1793:9: note: in expansion of macro ‘printk’
 1793 |         printk("Test 0x%08x\n", bo->resource->placement);
      |         ^~~~~~

Sorry, that was my fault. I typed it from memory and didn't double checked it. The added debug line should be like this:

printk("Test 0x%08x\n", (*bo)->tbo.resource->placement);

@ckoenig Hi good morning (UTC+8 here), here are the debug prints:

[   47.037906] Test 0x00000000
[   47.040888] Test 0x00000000
[   47.043730] Test 0x00000000
[   47.046515] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff019e4000-ff019e8000 out of 256MB segment!
[   47.064942] audit: type=1701 audit(1725416982.689:44): auid=1001 uid=1001 gid=1002 ses=3 subj=unconfined pid=1768 comm="mpv:cs0" exe="/usr/bin/mpv" sig=6 res=1

Oh, well that explains it. @arunpravin24 we have a bug in the VRAM backend.

When TTM_PL_FLAG_CONTIGUOUS is given it doesn't end up in the resource->placement flags for some reason.

Sure, I will check. Thanks.

@ckoenig Well at least we are on some sort of firm ground now! Let me know if you need any further help with testing - I'm happy to help.

And sorry if this came across as presumptious - what is AMD's current support policy for radeon-powered cards, such as TeraScale 2 and older generation cards? I'm currently running into a bug at #3604, would love to know if there remains anyone who can look into issues like this.

Here is another testing hack. I don't have time to fix the underlying issue at the moment, but Arun should keep looking at it.

The patch doesn't have all checks and is only compile tested, but could be that this actually works as a workaround for now.

0001-drm-amdgpu-WIP-test-patch-v2.patch

@ckoenig Unfortunately, that made it worse...

[   40.908913] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff019d0000-ff019d4000 out of 256MB segment!
[   40.927581] audit: type=1701 audit(1725545674.371:43): auid=1001 uid=1001 gid=1002 ses=3 subj=unconfined pid=1759 comm="mpv:cs0" exe="/usr/bin/mpv" sig=6 res=1
[   43.568148] r8169 0000:02:00.0 enp2s0: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[   44.567703] r8169 0000:02:00.0 enp2s0: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[   44.589278] r8169 0000:02:00.0 enp2s0: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[   45.544297] snd_hda_intel 0000:05:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00270600
[   46.558607] snd_hda_intel 0000:05:00.1: No response from codec, disabling MSI: last cmd=0x00270600
[   47.569302] snd_hda_intel 0000:05:00.1: azx_get_response timeout, switching to single_cmd mode: last cmd=0x00270600
[   48.023422] r8169 0000:02:00.0 enp2s0: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[   48.044838] r8169 0000:02:00.0 enp2s0: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[   50.718298] r8169 0000:02:00.0 enp2s0: NETDEV WATCHDOG: CPU: 3: transmit queue 0 timed out 5755 ms
[   50.727230] r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control
[   50.747382] r8169 0000:02:00.0 enp2s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
[   51.166311] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=1055, emitted seq=1056
[   51.184322] amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
[   51.766808] amdgpu 0000:05:00.0: amdgpu: PCI CONFIG reset
[   51.776194] amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
[   51.783495] [drm] PCIE gen 2 link speeds already enabled
[   51.790972] amdgpu 0000:05:00.0: amdgpu: PCIE GART of 1024M enabled (table at 0x000000F400600000).
[   51.800020] [drm] VRAM is lost due to GPU reset!
[   52.023196] r8169 0000:02:00.0 enp2s0: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[   52.044570] r8169 0000:02:00.0 enp2s0: rtl_counters_cond == 1 (loop: 1000, delay: 10).
[   52.172201] amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[   52.188431] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v6_0> failed -110
[   52.204754] [drm:si_dpm_set_power_state [amdgpu]] *ERROR* si_restrict_performance_levels_before_switch failed
[   52.304534] amdgpu 0000:05:00.0: amdgpu: GPU reset(1) failed
[   52.310170] amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110
[   52.316659] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110

By the way, did you mean for me to apply both WIP patches? They both seem to apply, but I have only applied v2 in this round of testing.

@arunpravin24 Hi, do we have any updates on this issue?

@MingcongBai Analyzing the problem and working on it, I will provide a patch for your testing.

Same for me with 6.10.9 and every version before. In fact I'm having trouble, and have reported several radeon related bugs since Fedora 38, my first installation. I'm now on Fedora 40 and still with the original 6.8.7 kernel as it's the only one behaving properly as far as radeon/amdgpu is concerned. Nothing between 6.8.8 and 6.10.9 worked for me.

Symptom: Kodi crashing instantly as soon as I click on a video.

amdgpu: The CS has been rejected, see dmesg for more information (-22). [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff00d0e000-ff00d10000 out of 256MB segment!

I made a diff of the actual git diff v6.8..v6.11 -- drivers/gpu/drm/amd/amdgpu/ content to compare and there are several significant changes so I'm wondering whether real people are actually testing those changes or it's purely theoretical.

OS: Fedora Linux 40 (Workstation Edition) x86_64

Kernel: 6.8.7-300.fc40.x86_64

CPU: AMD A10-7850K Radeon R7 4C+8G (4) @ 3.700GHz

GPU: AMD ATI Radeon R7 Graphics

@arunpravin24 Sorry to nag but do we have any updates on fixing hardware acceleration for GCN 1.0/2.0 cards? It has been difficult to follow the latest kernel updates on my hardware.

@MingcongBai Sorry, I was busy with other stuff. I will try to post a patch this week for your testing.

I will try to reproduce the issue locally here. I should have that hw generation still in my lab somewhere.

Just FYI it seems that on the 6.9.x kernel series this isn't an issue. But 6.10 is all busted. I confirmed 6.9.12 on fedora 40 kde this from my own testing, but it was also discussed here https://bbs.archlinux.org/viewtopic.php?pid=2191561#p2191561

@arunpravin24 @ckoenig Now that it's been two weeks, do we have any updates on this?

@MingcongBai I am trying to reproduce the problem on another machine. I will update you on this.

Kernel 6.11 is out on Fedora 40. Can we assume it's broken as well?

If I can help with testing a fix, I am all ears. I am eager to see this bug fixed.

I attempted to replicate the issue on new cards by limiting the memory, but I was unsuccessful. Finally today I will receive the R9 390 card. I will check and update.

Hello, got ZEN kernel update today to 6.11, still the same issue. Vainfo:

vainfo: VA-API version: 1.22 (libva 2.22.0)
vainfo: Driver version: Mesa Gallium driver 24.2.5-arch1.1 for AMD Radeon R7 Graphics (radeonsi, kaveri, LLVM 18.1.8, DRM 3.59, 6.11.5-zen1-1-zen)

dmesg:

[drm:amdgpu_uvd_ring_parse_cs [amdgpu]] *ERROR* msg/fb buffer ff00d14000-ff00d16000 out of 256MB segment!

I'm getting the same "out of 256mb segment" dmesg error as above when using any application that tries to use h264 acceleration. MPV with hwdec and others just simply crash while firefox will fall back to software decoding. Honestly it's not THAT big a deal but having better web performance and battery life by saving CPU processing power is pretty helpful on my devices.

Here's my devices hardware info for testing, I'm also eager for this bug to be fixed.

Device 1 - HP-T620:

OS: Debian Testing/Trixie on Kernel 6.11.4, CPU: AMD GX-415GA, GPU: HD 8330E IGPU

Device 2 - Optiplex 780:

OS: Debian Testing/Trixie on Kernel 6.11.4, CPU: Core 2 Duo e8500, GPU: R5-240 (GDDR3)

Device 3 - Old Custom:

OS: Debian Testing/Trixie on Kernel 6.11.4, CPU: FX-6350, GPU: HD-7770

All of them are running AMDGPU for vulkan support and they're all affected it seems.

removed VCN label

added UVD label

I'm getting a similar error on Fedora 40 on Radeon R7 Kaveri iGPU built into an AMD A8-7600 CPU.

Video players (clapper, mpv) configured to use VA-API crash with:

amdgpu: The CS has been rejected, see dmesg for more information (-22)

dmesg shows this:

kernel: [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff00d12000-ff00d14000 out of 256MB segment!

Affected kernels are at least 6.10.12 and later. Mesa 24.1.7. I haven't checked earlier versions yet.

6.9.12 is the last known working version (confirmed) thus far.

Could you please try the attached patch? 0001-drm-amdgpu-WIP-test-patch-for-UVD-CS-issue.patch

Hi Arun, i tried your WIP test patch.

GPU: R7 370 4GB (GCN 1)

Distro: CachyOS (Arch Linux)

Kernel: 6.11.6

Mesa: 24.2.4

When running mpv --hwdec=auto i get the following dmesg errors:

[nov 1 18:29] UVD Test 0x00000001
[  +0,002102] UVD Test 0x00000001
[  +0,000028] UVD Test 0x00000001
[  +0,000005] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff00bd0000-ff00bd2000 out of 256MB segment!

how to apply a patch?

See this screenshot

@mahmoudshmaitelly What OS are you using? What app are you using to edit?

Thanks for testing. I have modified the contiguous setting. Please try the attached patch.

0001-drm-amdgpu-WIP-test-patch-for-UVD-CS-issue.patch

This one seems to work!.

GPU: R7 370 4GB (GCN 1)

Output from mpv:

mpv --hwdec=vaapi file_example_MP4_1280_10MG.mp4

● Video  --vid=1  (h264 1280x720 30 fps) [default]

● Audio  --aid=1  (aac 2ch 48000 Hz 160 kbps) [default]

Using hardware decoding (vaapi).

AO: [pipewire] 48000Hz stereo 2ch floatp

VO: [gpu] 1280x720 vaapi[nv12]

AV: 00:00:09 / 00:00:30 (32%) A-V:  0.000

dmesg gets spammed with the printk statement with:

UVD Test 0x00000001

and

UVD Test 0x00000003

@Shendisx Thanks for the quick result. I have removed the debug log. Please apply the attached patch to avoid spamming dmesg with the printk logs. I will post this patch for the review. 0001-drm-amdgpu-WIP-test-patch-for-UVD-CS-issue.patch

My pleasure, Thank you for fixing it!.

Please let know us when it would be merged with latest kernel so we can update and tested without rebuilding the kernel on our side. Very much deserved thanks for your fix.

Thank you for the patch. There has been a definite progress, but unfortunately the problems have not been ultimately solved for my card.

I applied the patch to the kernel version 6.11.5 in Debian, and my outcome has been that for a HD 7750 card, I now am able to launch a video playback, but a following error emerges. Before that, I was not able to launch a video playback at all on 6.11.5. The video playback worked almost perfectly (up to approx. once a month crash) on earlier 5.x series kernels.

Playing: 02.redacted.s01.E02.(2023).HDTV.(1080р).by.redacted.ts
 (+) Video --vid=1 (h264 1920x1080 25.000fps)
 (+) Audio --aid=1 --alang=Und (mp2 2ch 48000Hz)
Using hardware decoding (vaapi).
VO: [gpu] 1920x1080 vaapi[nv12]
[ffmpeg/video] h264: Failed to allocate a vaapi/nv12 frame from a fixed pool of hardware frames.
[ffmpeg/video] h264: Consider setting extra_hw_frames to a larger value (currently set to -1, giving a pool size of 22).
[ffmpeg/video] h264: get_buffer() failed
[ffmpeg/video] h264: decode_slice_header error
[ffmpeg/video] h264: no frame!
Error while decoding frame (hardware decoding)!
[ffmpeg/video] h264: get_buffer() failed
[ffmpeg/video] h264: decode_slice_header error
[ffmpeg/video] h264: no frame!
Error while decoding frame (hardware decoding)!
[ffmpeg/video] h264: get_buffer() failed
[ffmpeg/video] h264: decode_slice_header error
[ffmpeg/video] h264: no frame!
Error while decoding frame (hardware decoding)!
Attempting next decoding method after failure of h264-vaapi.
[ffmpeg/video] h264: co located POCs unavailable
[ffmpeg/video] h264: co located POCs unavailable
AO: [pipewire] 48000Hz stereo 2ch doublep
AV: 00:00:00 / 00:48:22 (0%) A-V: -0.007

There's nothing appearing in the dmesg log when this error is triggered in mpv.

My vainfo output with the patch is:

Trying display: wayland
Trying display: x11
libva info: VA-API version 1.22.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_1_22
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.22 (libva 2.22.0)
vainfo: Driver version: Mesa Gallium driver 24.2.4-1 for AMD Radeon HD 7700 Series (radeonsi, verde, LLVM 19.1.1, DRM 3.59, 6.11.5-fix-amdgpu-01)
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointVLD
      VAProfileNone                   : VAEntrypointVideoProc

Also, I started having an error in the dmesg when playing gzdoom with mods that load a lot of resources (like Project Brutality, but also Hexen Serpent Resurrection does that, albeit being a bit more difficult to trigger).

The error is a repetition of

redacted [drm:amdgpu_cs_parser_bos.isra.0 [amdgpu]] *ERROR* amdgpu_vm_validate() failed.
redacted [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!

With the patch, the error is more easy to trigger.

I could play around with module options and see how things unfold. What are the good settings to try playing around with?

UDATE. With the following set of amdgpu module options, I was able to launch video playback without errors:

options amdgpu si_support=1
options amdgpu cik_support=1
options amdgpu dpm=1
options amdgpu runpm=0
options amdgpu bapm=1
options amdgpu aspm=1
options amdgpu gttsize=32768
options amdgpu msi=1
options amdgpu hw_i2c=1
options amdgpu pcie_gen2=1
options amdgpu gpu_recovery=1

Relevant excerpts from mpv.conf:

gpu-api=opengl
hwdec=vaapi
x11-bypass-compositor=yes
vo=gpu
vo-sixel-dither=burkes

Unfortunately, hardware decoding via vaapi does not work with Vulkan API in mpv on my system at this point.

I'll test with gzdoom later, and report in this thread. Also, would like to test with longer video playback today.

I noticed a small problem when compiling the kernel earlier. The compiler emitted a warning about an unused variable i. Probably, it would be elided during compilation, and we shouldn't see any observable effects.

I patched the following kernel and the issue is not resolved I believe we are discovering another bug. I am dbg the coredump for the full backtrace.

uname -a                                                                                                                                                                                                                                                                        
Linux eos 6.11.5-x64v2-xanmod1-1-git #1 SMP PREEMPT_DYNAMIC Mon, 04 Nov 2024 23:20:09 +0000 x86_64 GNU/Linux

mpv --hwdec=vaapi -vo=gpu-next ~/Videos/I\ Am\ Legend\ -\ Trailer.mp4                                                                                                                                                                                                                 
[ffmpeg/demuxer] mov,mp4,m4a,3gp,3g2,mj2: stream 0, timescale not set
● Video  --vid=1  (h264 1920x816 23.976 fps) [default]
○ Image  --vid=2  (mjpeg)
● Audio  --aid=1  (aac 6ch 48000 Hz 258 kbps) [default]
File tags:
 Artist: Warner Bros.
 Date: 2007
 Genre: Science-Fiction
 Title: I Am Legend - Trailer
[vo/gpu-next/wayland] Unable to set DRM atomic cap: Operation not supported
amdgpu: The CS has been rejected, see dmesg for more information (-22).
[1]    5303 IOT instruction (core dumped)  mpv --hwdec=vaapi -vo=gpu-next ~/Videos/I\ Am\ Legend\ -\ Trailer.mp4

dmesg                                                                                                                                                                                                                                                                             
dmesg: read kernel buffer failed: Operation not permitted

vainfo                                                                                                                                                                                                                                                                            
Trying display: wayland
vainfo: VA-API version: 1.22 (libva 2.22.0)
vainfo: Driver version: Mesa Gallium driver 24.2.6-arch1.1 for AMD Radeon HD 7800 Series (radeonsi, pitcairn, ACO, DRM 3.59, 6.11.5-x64v2-xanmod1-1-git)
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointVLD
      VAProfileNone                   : VAEntrypointVideoProc

Attached is the debug log vaapi_sig_abort_debug_log

Reverting to Kernel 6.9.7 and mpv works successfully

I was able to run mpv with hwdec for over 10 hours today non-stop with the patched kernel. I adjusted slightly the config I presented in my earlier post, since it's more relevant to my use case (Chromium is configured to use Vulkan and VAAPI).

gpu-api=vulkan
hwdec=vaapi-copy
x11-bypass-compositor=yes
vo=gpu
vo-sixel-dither=burkes

I have been monitoring the card's state with nvtop, and have observed consistent 300-500 MiB/s in TX and RX directions.

However, I'm still experiencing crashes when using 3d graphics intensively. In contrast to previous crashes, those are completely silent. I have an ssh session open with dmesg output being captured remotely. There's absolutely no single line being output. I'm pretty sure I would've observed the same even if I used serial console.

So, the only change in experience with 3d graphics, is that now there're no errors in the log about insufficient memory to submit a command, or inability to allocate contiguous 256mb block, or some sdma0 or sdma1 error. The log is silent.

I noticed that the crash usually occurs when TX in nvtop goes to a high value, like 19.80 GiB/s or 16.38 GiB/s. Well, this is not really beyond the capabilities of 16-lane PCIe-3.0 slot, but that certainly something that happens consistently at the time of crash with gzdoom.

Now, I also tried to play Hearts of Iron IV with this card, and crash occurred without the TX spiking.

@mahmoudshmaitelly What behavior do you observe if gpu-api=vulkan and hwdec=vaapi-copy? The mpv seems to be capricious when it comes to the choice of VO options, but those settings seem to make a lot of sense overall, and actually probably the only sound ones when Vulkan is desired as rendering API.

@shang.tsung.sea The mpv failed to start HW decode and run video in software mode. No core-dump

[vaapi] Failed to initialize VAAPI: resource allocation failed [vaapi] libva: /usr/lib/dri/radeonsi_drv_video.so init failed

sudo dmesg | grep amdgpu
[    3.978676] fbcon: amdgpudrmfb (fb0) is primary device
[    4.307422] amdgpu 0000:01:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[   38.931050] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff013e6000-ff013e8000 out of 256MB segment!

mpv --hwdec=vaapi-copy --gpu-api=vulkan --vo=gpu-next ~/Videos/I\ Am\ Legend\ -\ Trailer.mp4
[ffmpeg/demuxer] mov,mp4,m4a,3gp,3g2,mj2: stream 0, timescale not set
● Video  --vid=1  (h264 1920x816 23.976 fps) [default]
○ Image  --vid=2  (mjpeg)
● Audio  --aid=1  (aac 6ch 48000 Hz 258 kbps) [default]
File tags:
 Artist: Warner Bros.
 Date: 2007
 Genre: Science-Fiction
 Title: I Am Legend - Trailer
[vo/gpu-next/wayland] Unable to set DRM atomic cap: Operation not supported
WARNING: radv is not a conformant Vulkan implementation, testing use only.
[vaapi] libva: /usr/lib/dri/radeonsi_drv_video.so init failed
[vaapi] Failed to initialize VAAPI: resource allocation failed
AO: [pipewire] 48000Hz 5.1 6ch floatp
VO: [gpu-next] 1920x816 yuv420p
AV: 00:00:25 / 00:02:03 (21%) A-V:  0.000
Exiting... (Quit)
Execution time: 26s                                                                                                                                                                                                                                                                                                   

uname -a
Linux eos 6.11.5-x64v2-xanmod1-1-git #1 SMP PREEMPT_DYNAMIC Mon, 04 Nov 2024 23:20:09 +0000 x86_64 GNU/Linux

I think I am still running without the patch.

I changed the code in the file as per the patch
Then, makepkg -si

I reviewed the src code file and the file was restored to the git version! What should I do to recompile the kernel and install it with the patch?

Anyone can share a patched kernel, so I can do pacman -U .... @i300220 Thanks, I will follow the instructions and try again

@mahmoudshmaitelly Hi, Maybe you skipped some steps. Here's the procedure: https://wiki.archlinux.org/title/Kernel/Arch_build_system

@mahmoudshmaitelly This is really weird. In my case, after applying the patch, I'm getting the following result:

mpv --hwdec=vaapi-copy --gpu-api=vulkan --vo=gpu-next (#i)*S??e<2->.*.mkv
Playing: redacted.mkv
 (+) Video --vid=1 (*) 'redacted' (h264 1920x1080 23.976fps)
 (+) Audio --aid=1 --alang=eng (*) 'redacted' (ac3 6ch 48000Hz)
     Subs  --sid=1 --slang=eng (*) (subrip)
 (+) Subs  --sid=2 --slang=eng (subrip)
File tags:
 Title: redacted
WARNING: radv is not a conformant Vulkan implementation, testing use only.
[ffmpeg/video] h264: Increasing reorder buffer to 1
Using hardware decoding (vaapi-copy).
VO: [gpu-next] 1920x1080 nv12
AO: [pipewire] 48000Hz 5.1(side) 6ch doublep
AV: 00:18:21 / 00:43:08 (43%) A-V:  0.000
Exiting... (Quit)

Same with --vo=gpu.

I wonder, if you may be able to test the same mpv command on X11? If that would make a difference, we'd be having a curious situation indeed.

On Debian, there's a small caveat with kernel module building and module signing, especially if you use dkms. Check out this post and also the procedure in the Debian's manual.

If you have 20-24 GB of extra RAM + Swap to spare, I would suggest to perform the actual build in the tmpfs (usually mounted on /tmp, but beware of maximum quota set there), and to perform an entire clean build.

I did apply a patch as per Arch Wiki and installed a patched kernel

I am still seeing a failure as before

Video:
mpv --hwdec=vaapi --vo=gpu-next ~/Videos/I\ Am\ Legend\ -\ Trailer.mp4                                                                                                                                                                                                                
[ffmpeg/demuxer] mov,mp4,m4a,3gp,3g2,mj2: stream 0, timescale not set
● Video  --vid=1  (h264 1920x816 23.976 fps) [default]
○ Image  --vid=2  (mjpeg)
● Audio  --aid=1  (aac 6ch 48000 Hz 258 kbps) [default]
File tags:
 Artist: Warner Bros.
 Date: 2007
 Genre: Science-Fiction
 Title: I Am Legend - Trailer
[vo/gpu-next/wayland] Unable to set DRM atomic cap: Operation not supported
amdgpu: The CS has been rejected, see dmesg for more information (-22).
[1]    9978 IOT instruction (core dumped)  mpv --hwdec=vaapi --vo=gpu-next ~/Videos/I\ Am\ Legend\ -\ Trailer.mp4

dmesg:
 35.620352] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff012ba000-ff012bc000 out of 256MB segment!
[   35.644787] firefox:cs0[5056]: segfault at 0 ip 0000561fcb8ac780 sp 00007b577f5ff9b0 error 6 in firefox[96780,561fcb833000+a4000] likely on CPU 3 (core 3, socket 0)
[   35.644810] Code: 53 50 48 89 fb 4c 8b 35 36 c1 02 00 49 8b 36 ff 15 ed c1 02 00 49 8b 36 bf 0a 00 00 00 ff 15 57 c2 02 00 48 89 1d e0 f2 02 00 <c7> 04 25 00 00 00 00 23 00 00 00 e8 00 00 00 00 f3 0f 1e fa 50 48
[  101.902196] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff01410000-ff01412000 out of 256MB segment!
[  101.961276] firefox:cs0[6575]: segfault at 0 ip 00005a3e316b4780 sp 000076c4059ff9b0 error 6 in firefox[96780,5a3e3163b000+a4000] likely on CPU 0 (core 0, socket 0)
[  101.961285] Code: 53 50 48 89 fb 4c 8b 35 36 c1 02 00 49 8b 36 ff 15 ed c1 02 00 49 8b 36 bf 0a 00 00 00 ff 15 57 c2 02 00 48 89 1d e0 f2 02 00 <c7> 04 25 00 00 00 00 23 00 00 00 e8 00 00 00 00 f3 0f 1e fa 50 48
[  182.681000] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff0142c000-ff0142e000 out of 256MB segment!
[  182.692410] firefox:cs0[6839]: segfault at 0 ip 00006434d345a780 sp 0000744c095ff9b0 error 6 in firefox[96780,6434d33e1000+a4000] likely on CPU 4 (core 0, socket 0)
[  182.692423] Code: 53 50 48 89 fb 4c 8b 35 36 c1 02 00 49 8b 36 ff 15 ed c1 02 00 49 8b 36 bf 0a 00 00 00 ff 15 57 c2 02 00 48 89 1d e0 f2 02 00 <c7> 04 25 00 00 00 00 23 00 00 00 e8 00 00 00 00 f3 0f 1e fa 50 48
[  747.325100] switching from power state:
[  747.325105]  ui class: performance
[  747.325106]  internal class: none
[  747.325107]  caps:
[  747.325108] [drm]    uvd    vclk: 0 dclk: 0
[  747.325109] [drm]            power level 0    sclk: 30000 mclk: 15000 vddc: 900 vddci: 850 pcie gen: 2
[  747.325111] [drm]            power level 1    sclk: 45000 mclk: 120000 vddc: 900 vddci: 975 pcie gen: 2
[  747.325112] [drm]            power level 2    sclk: 100000 mclk: 120000 vddc: 1219 vddci: 975 pcie gen: 2
[  747.325113]  status: c
[  747.325114] switching to power state:
[  747.325115]  ui class: performance
[  747.325116]  internal class: none
[  747.325117]  caps:
[  747.325117] [drm]    uvd    vclk: 0 dclk: 0
[  747.325118] [drm]            power level 0    sclk: 30000 mclk: 15000 vddc: 825 vddci: 850 pcie gen: 2
[  747.325119] [drm]            power level 1    sclk: 45000 mclk: 120000 vddc: 900 vddci: 975 pcie gen: 2
[  747.325120] [drm]            power level 2    sclk: 110000 mclk: 120000 vddc: 1219 vddci: 975 pcie gen: 2
[  747.325121]  status: r
[  772.321693] switching from power state:
[  772.321698]  ui class: performance
[  772.321699]  internal class: none
[  772.321701]  caps:
[  772.321702] [drm]    uvd    vclk: 0 dclk: 0
[  772.321703] [drm]            power level 0    sclk: 30000 mclk: 15000 vddc: 900 vddci: 850 pcie gen: 2
[  772.321705] [drm]            power level 1    sclk: 45000 mclk: 120000 vddc: 900 vddci: 975 pcie gen: 2
[  772.321706] [drm]            power level 2    sclk: 100000 mclk: 120000 vddc: 1219 vddci: 975 pcie gen: 2
[  772.321707]  status: c
[  772.321709] switching to power state:
[  772.321709]  ui class: performance
[  772.321710]  internal class: none
[  772.321711]  caps:
[  772.321712] [drm]    uvd    vclk: 0 dclk: 0
[  772.321713] [drm]            power level 0    sclk: 30000 mclk: 15000 vddc: 825 vddci: 850 pcie gen: 2
[  772.321714] [drm]            power level 1    sclk: 45000 mclk: 120000 vddc: 900 vddci: 975 pcie gen: 2
[  772.321715] [drm]            power level 2    sclk: 110000 mclk: 120000 vddc: 1219 vddci: 975 pcie gen: 2
[  772.321716]  status: r
[  801.265473] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff01476000-ff01478000 out of 256MB segment!

@mahmoudshmaitelly And is your mpv API being OpenGL or Vulkan? It did make a difference between being able to launch an app or not at all in my case.

Also, have you tried to apply the patch that prints out UVD Test lines posted earlier in this thread? What does it output in your case?

I found time to test the patches on x86 and LoongArch, with a Radeon HD7850 (GCN 1.0) yesterday. The issue seems to have gone away on x86 with the video samples we use here at AOSC:

https://repo.aosc.io/ahvl/sample-videos-20241006.tar.zst

However, on LoongArch, the GPU driver seems to time out and reset whilst playing the AVC, 4K@60fps sample. The traceback changes between the two test trials, but the system locked up in both cases.

1st Run:

[  435.347306] amdgpu 0000:07:00.0: amdgpu: Dumping IP State
[  435.352670] amdgpu 0000:07:00.0: amdgpu: Dumping IP State Completed
[  435.358900] amdgpu 0000:07:00.0: amdgpu: ring gfx timeout, signaled seq=30402, emitted seq=30404

2nd Run:

[  145.842934] amdgpu 0000:07:00.0: amdgpu: Dumping IP State
[  145.848309] amdgpu 0000:07:00.0: amdgpu: Dumping IP State Completed
[  145.854603] amdgpu 0000:07:00.0: amdgpu: ring sdma0 timeout, signaled seq=2979, emitted seq=2982

However, on LoongArch, the GPU driver seems to time out and reset whilst playing the AVC, 4K@60fps sample. The traceback changes between the two test trials, but the system locked up in both cases.

In my case, such content plays just fine. Caveat: the software decoding is used by MPV, since HD 7750 does not have a hardware decoding capability for videos of such dimensions. I actually do some downscaling to play at 60fps, but 30fps the playback is fine at full resolution (and software decoding).

@shang.tsung.sea I can use API OpenGl or Vulkan. The MPV/Smplayer does not crash but revert to software mode when I play one of the vc1 samples shared here: https://repo.aosc.io/ahvl/sample-videos-20241006.tar.zst

I will patch the kernel with path with the debug message and report back.

I am failing to patch the kernel with Arch. I did manual patch < "patchname" and the patch camplained about a/... and b/.. file not found.

I will wait for the fix to be pushed upstream.

@mahmoudshmaitelly Did you use this command to patch the extracted kernel source?

patch -p1 < 0001-drm-amdgpu-WIP-test-patch-for-UVD-CS-issue.patch

I did, and the patch command said it was successful. I did makepkg -sf and then yay -U ..... Rebooted and run the same video which is available online: https://ia904502.us.archive.org/25/items/IAmLegendTrailer/IAmLegendTrailer.mp4

patched kernel:
uname -a
Linux eos 6.11.5-x64v2-xanmod1-1-rdm-patch-git #5 SMP PREEMPT_DYNAMIC Fri, 08 Nov 2024 01:05:01 +0000 x86_64 GNU/Linux

mpv --hwdec=vaapi --vo=gpu-next ~/Videos/I\ Am\ Legend\ -\ Trailer.mp4
[ffmpeg/demuxer] mov,mp4,m4a,3gp,3g2,mj2: stream 0, timescale not set
● Video  --vid=1  (h264 1920x816 23.976 fps) [default]
○ Image  --vid=2  (mjpeg)
● Audio  --aid=1  (aac 6ch 48000 Hz 258 kbps) [default]
File tags:
 Artist: Warner Bros.
 Date: 2007
 Genre: Science-Fiction
 Title: I Am Legend - Trailer
[vo/gpu-next/wayland] Unable to set DRM atomic cap: Operation not supported
amdgpu: The CS has been rejected, see dmesg for more information (-22).
[1]    7472 IOT instruction (core dumped)  mpv --hwdec=vaapi --vo=gpu-next ~/Videos/I\ Am\ Legend\ -\ Trailer.mp4
dmesg
[   19.054627] switching to power state:
[   19.054628]  ui class: performance
[   19.054628]  internal class: none
[   19.054629]  caps:
[   19.054630] [drm]    uvd    vclk: 0 dclk: 0
[   19.054631] [drm]            power level 0    sclk: 30000 mclk: 15000 vddc: 825 vddci: 850 pcie gen: 2
[   19.054632] [drm]            power level 1    sclk: 45000 mclk: 120000 vddc: 900 vddci: 975 pcie gen: 2
[   19.054633] [drm]            power level 2    sclk: 110000 mclk: 120000 vddc: 1219 vddci: 975 pcie gen: 2
[   19.054634]  status: r
[   19.054659] switching from power state:
[   19.054659]  ui class: performance
[   19.054660]  internal class: none
[   19.054661]  caps:
[   19.054662] [drm]    uvd    vclk: 0 dclk: 0
[   19.054662] [drm]            power level 0    sclk: 30000 mclk: 15000 vddc: 900 vddci: 850 pcie gen: 2
[   19.054663] [drm]            power level 1    sclk: 45000 mclk: 120000 vddc: 900 vddci: 975 pcie gen: 2
[   19.054664] [drm]            power level 2    sclk: 100000 mclk: 120000 vddc: 1219 vddci: 975 pcie gen: 2
[   19.054665]  status: c
[   19.054666] switching to power state:
[   19.054667]  ui class: performance
[   19.054667]  internal class: none
[   19.054668]  caps:
[   19.054669] [drm]    uvd    vclk: 0 dclk: 0
[   19.054670] [drm]            power level 0    sclk: 30000 mclk: 15000 vddc: 825 vddci: 850 pcie gen: 2
[   19.054671] [drm]            power level 1    sclk: 45000 mclk: 120000 vddc: 900 vddci: 975 pcie gen: 2
[   19.054672] [drm]            power level 2    sclk: 110000 mclk: 120000 vddc: 1219 vddci: 975 pcie gen: 2
[   19.054673]  status: r
[   48.672103] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff01368000-ff0136a000 out of 256MB segment!

I can proceed with the review/upstream process if you confirm that the patch fixes the issue.

The patch helped to fix the problem for me on Debian, system Mesa Gallium driver 24.2.0-devel for AMD Radeon HD 7700 Series (radeonsi, verde, LLVM 18.1.7, DRM 3.59, 6.11.5-fix-amdgpu-01. The only caveat I noticed is that there's been a complaint by the compiler about unused variable `i`. I didn't fix it manually, since I wanted to keep the patch pristine.

This patch fixed the issue for me on x86, but causes GPU reset on LoongArch.

However, I do suspect that it is an issue outside of this driver (of course, I would recommend arranging more debugging down the line). See the thread here.

It seems that the celebration was a bit too premature on my end. Just got this one when attempted to play a video.

[  +6.416998] amdgpu 0000:1b:00.0: amdgpu: IH ring buffer overflow (0x00009E40, 0x0000C630, 0x00009E50)
[  +0.000012] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 147 0x086f4402
[  +0.000004] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101407
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0F008002
[  +0.000002] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x02, vmid 7) at page 1053703, write from '' (0x00000000) (8)
[  +0.526832] [drm] Fence fallback timer expired on ring sdma0
[  +0.001155] switching from power state:
[  +0.000012]   ui class: performance
[  +0.000005]   internal class: none
[  +0.000008]   caps:
[  +0.000005] [drm]     uvd    vclk: 0 dclk: 0
[  +0.000006] [drm]             power level 0    sclk: 30000 mclk: 112500 vddc: 950 vddci: 950 pcie gen: 3
[  +0.000009] [drm]             power level 1    sclk: 40000 mclk: 112500 vddc: 950 vddci: 950 pcie gen: 3
[  +0.000008] [drm]             power level 2    sclk: 80000 mclk: 112500 vddc: 1100 vddci: 950 pcie gen: 3
[  +0.000008]   status: c
[  +0.000006] switching to power state:
[  +0.000004]   ui class: none
[  +0.000004]   internal class: uvd
[  +0.000006]   caps: video
[  +0.000007] [drm]     uvd    vclk: 72000 dclk: 56000
[  +0.000005] [drm]             power level 0    sclk: 40000 mclk: 112500 vddc: 950 vddci: 950 pcie gen: 3
[  +0.000007] [drm]             power level 1    sclk: 40000 mclk: 112500 vddc: 950 vddci: 950 pcie gen: 3
[  +0.000007] [drm]             power level 2    sclk: 80000 mclk: 112500 vddc: 1100 vddci: 950 pcie gen: 3
[  +0.000007]   status: r

and continuing the rest of dmesg in the other post, since the whole piece looked like spam to this forum automation...

And the immediately following it

[  +6.543766] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 146 0x0f8c440c
[  +0.000013] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100FFC
[  +0.000005] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04400C
[  +0.000005] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x0c, vmid 6) at page 1052668, read from '' (0x00000000) (68)
[  +0.000011] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 146 0x0fcc440c
[  +0.000004] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010100D
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04400C
[  +0.000004] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x0c, vmid 6) at page 1052685, read from '' (0x00000000) (68)
[  +0.000008] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 146 0x0fac440c
[  +0.000004] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101061
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C00400C
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x0c, vmid 6) at page 1052769, read from '' (0x00000000) (4)
[  +0.000008] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 146 0x0fec440c
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101094
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04400C
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x0c, vmid 6) at page 1052820, read from '' (0x00000000) (68)

Which is really weird, since for two days there have been no problems whatsoever with video. And the same set of video files were actually being continuously playing today as well, earlier in the day.

Third installment coming...

The final piece, immediately following is

[  +0.000008] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 146 0x0fac480c
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001010C2
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04400C
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x0c, vmid 6) at page 1052866, read from '' (0x00000000) (68)
[  +0.000007] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 146 0x0f8c480c
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001010E8
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04400C
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x0c, vmid 6) at page 1052904, read from '' (0x00000000) (68)
[  +0.000007] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 146 0x0fcc040c
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101115
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C00800C
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x0c, vmid 6) at page 1052949, read from '' (0x00000000) (8)

Hopefully won't look like spam to this forum.

But it did. Another installment in line...

Hopefully final one

[  +0.000007] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 146 0x0f8c080c
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010113F
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C00800C
[  +0.000004] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x0c, vmid 6) at page 1052991, read from '' (0x00000000) (8)
[  +0.000006] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 146 0x0e0c480c
[  +0.000004] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010116C
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C00800C
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x0c, vmid 6) at page 1053036, read from '' (0x00000000) (8)
[  +0.000006] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 146 0x0e2c040c
[  +0.000004] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101195
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04800C
[  +0.000003] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x0c, vmid 6) at page 1053077, read from '' (0x00000000) (72)
[  +0.530796] [drm] Fence fallback timer expired on ring sdma0
[Nov 7 22:04] switching from power state:
[  +0.000010]   ui class: none
[  +0.000003]   internal class: uvd
[  +0.000005]   caps: video
[  +0.000004] [drm]     uvd    vclk: 72000 dclk: 56000
[  +0.000004] [drm]             power level 0    sclk: 80000 mclk: 112500 vddc: 1100 vddci: 950 pcie gen: 3
[  +0.000006] [drm]             power level 1    sclk: 80000 mclk: 112500 vddc: 1100 vddci: 950 pcie gen: 3
[  +0.000004] [drm]             power level 2    sclk: 80000 mclk: 112500 vddc: 1100 vddci: 950 pcie gen: 3
[  +0.000004]   status: c
[  +0.000004] switching to power state:
[  +0.000002]   ui class: performance
[  +0.000002]   internal class: none
[  +0.000004]   caps:
[  +0.000003] [drm]     uvd    vclk: 0 dclk: 0
[  +0.000003] [drm]             power level 0    sclk: 30000 mclk: 15000 vddc: 825 vddci: 900 pcie gen: 3
[  +0.000004] [drm]             power level 1    sclk: 40000 mclk: 112500 vddc: 950 vddci: 950 pcie gen: 3
[  +0.000003] [drm]             power level 2    sclk: 80000 mclk: 112500 vddc: 1100 vddci: 950 pcie gen: 3
[  +0.000004]   status: r

Hope it helps.

IMO, this bug with amdgpu is just the tip of the iceberg. I'm on Fedora since Fedora 38, installed in october 2023 on this computer with an AMD A10-7850K Radeon R7 4C+8G (4) @ 3.700GHz CPU, and each and every linux kernel since 6.5.6 had some kind of crash/issue related to radeon/amdgpu happening almost on a daily basis, that I duly reported but were happily ignored and expired hoping the next Fedora release would be better (it hasn't). Curiously, kernel 6.8.7 is exempt of any of those bugs (nothing visible in dmesg or system logs). It was the original kernel shipped with the initial release of Fedora 40, so I stuck with it because every attempt to upgrade to the currently available kernel resulted in triggering one or another amdgpu issue. No idea how things were prior to 6.5.6, that computer used to run windows a year ago and 6.5.6 was the very first linux kernel to ever run on this computer.

Not wanting to direct any developer, just hoping this information will be useful in their efforts to get rid of amdgpu issues.

Upgrading motherboard is not an option. If I had to run obsolete OS, so be it. Not so big a deal.

I have another computer with an AMD Phenom II X6 1055T (6) @ 2.800GHz CPU and Radeon 3000 integrated GPU and never had any issue running Debian 9, 10, 11, and 12. For reference, Debian 11 was running kernel 5.10 which is still stable on Debian 12. Debian 12 was shipped with kernel 6.1 but I test the 6.9/6.10 backports as well. Only experienced the computer freezing twice with the 6.10 branch, so I now avoid it.

Upgrading motherboard is not an option.

@i300220 It is worth noting that some of the older GPUs, like HD 7750 are still relevant, because they allow to output simultaneously six displays. This is an important features in video distribution setups, and multi-display workstations. Six to eight displays is not unheard of.

Got another instance of the GPU Fault error just now while playing video using vaapi-copy with mpv.

[  +8.366295] amdgpu 0000:1b:00.0: amdgpu: IH ring buffer overflow (0x00006B90, 0x00009CC0, 0x00006BA0)
[  +0.000016] amdgpu 0000:1b:00.0: amdgpu: GPU fault detected: 147 0x03cb0802
[  +0.000006] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101C01
[  +0.000004] amdgpu 0000:1b:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B048002
[  +0.000004] amdgpu 0000:1b:00.0: amdgpu: VM fault (0x02, vmid 5) at page 1055745, write from '' (0x000000000) (72)
[Nov 9 03:38] [drm] Fence fallback timer expired on ring sdma0

This time, I have been using modesetting driver with Xorg. In previous report, I was using amdgpu driver with Xorg.

I have posted the patch for review - https://patchwork.freedesktop.org/patch/623871/

Also, you are also getting excellent feedback.

@arunpravin24 Is there a test we could perform on a live system to debug further?

DuckDuckGo brought me here.

Just moved from Debian (Bookworm) to Fedora (41).

My Radeon HD 8570 (GCN 1.0) is also experiencing the same issue.

Running mpv with --hwdec=auto produces an error:

Cannot load libcuda.so.1
amdgpu: The CS has been rejected, see dmesg for more information (-22).
Aborted (core dumped)

dmesg:

[drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff00b7e000-ff00b80000 out of 256MB segment!

I was unable to launch video playback at all.

But it is fine on Debian, although I didn't check dmesg.

Weird. I've had the problem with unpatched (see above) kernels 6.11.5, 6.11.7, and 6.11.9, on Debian trixie/sid.

Debian Bookworm is using kernel 6.1

Fedora 41 is using kernel 6.11

Here's a screenshot of mpv on Debian live cd.

There is no error in dmesg on Debian.

Maybe I should go back to Debian.

The regression of this bug started with Kernel series 6.10 and newer versions. I am stuck at 6.9.x releases with no HD decode crashes on CGN 1.0 GPUs

Yes I know, I have read all the posts here.

I'm just confirming that I'm using debian bookworm not trixie/sid to avoid confusion.

Thanks anyway.

I hope we find a solution.

I compiled LTS kernels 5.4.286 and 4.19.324. The problem disappears in those kernels. In 5.4, there's still a lockup if dpm=1 is enabled. In 4.19, there's been more stability with dpm=1.

For now until this issue is fixed I am using LTS kernel from kwizart for my Fedora machine. Sorry for not helping.

I'm comparing the kwizart 6.6.63-200.fc40.x86_64 kernel with the 6.8.7-300.fc40.x86_64 I've been running here on Fedora 40 and which is very stable and the sole problem at boot is below, but eventually, it fixes itself later on and the module loads properly.

systemd[1]: systemd-modules-load.service: Failed with result 'exit-code'.
systemd[1]: Failed to start systemd-modules-load.service - Load Kernel Modules.
systemd-modules-load[255]: Failed to insert module 'snd_pcm': Invalid argument

$ journalctl -b -u systemd-modules-load
nov 26 04:56:43 systemd-modules-load[268]: modprobe: FATAL: Module snd-seq not found in directory /lib/modules/6.6.63-200.fc40.x86_64

Fixed itself anyway

$ lsmod | grep snd_pcm
snd_pcm               180224  7 snd_hda_codec_hdmi,snd_hda_intel,snd_oxygen_lib,snd_hda_codec,snd_aloop,snd_hda_core
snd_timer              49152  4 snd_seq,snd_hrtimer,snd_aloop,snd_pcm
snd                   147456  30 snd_seq,snd_seq_device,snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_oxygen_lib,snd_hda_codec,snd_timer,snd_virtuoso,snd_mpu401_uart,snd_aloop,snd_pcm,snd_rawmidi

@samuelken Thanks for that excellent idea.

Please try the attached patch. v2-0001-drm-amdgpu-Fix-UVD-contiguous-CS-mapping-problem.patch

IT WORKS!

I built Fedora kernel with this patch.

Hardware acceleration works on mpv without any errors on dmesg.

Hardware: Radeon HD 8570 (GCN 1.0)
OS: Fedora 41
Kernel: Linux fedora 6.11.10-300.uvd_test_kernel.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Nov 27 10:50:04 XXX 2024 x86_64 GNU/Linux

● Video  --vid=1               (h264 1280x720 30 fps) [default]
● Audio  --aid=1  --alang=eng  (aac 2ch 44100 Hz 128 kbps) [default]
Cannot load libcuda.so.1
Using hardware decoding (vaapi).
AO: [pipewire] 44100Hz stereo 2ch floatp
VO: [gpu] 1280x720 vaapi[nv12]
AV: 00:09:11 / 01:32:07 (10%) A-V:  0.000
Exiting... (Quit)

Thanks @arunpravin24 !!!

I can confirm re-compiling Xanmod version of Kernel 6.12.1 with the patch resolved the bug

mpv --hwdec=vaapi --vo=gpu-next ~/Videos/I\ Am\ Legend\ -\ Trailer.mp4                                                                                                                                                                                                                
[ffmpeg/demuxer] mov,mp4,m4a,3gp,3g2,mj2: stream 0, timescale not set
● Video  --vid=1  (h264 1920x816 23.976 fps) [default]
○ Image  --vid=2  (mjpeg)
● Audio  --aid=1  (aac 6ch 48000 Hz 258 kbps) [default]
File tags:
 Artist: Warner Bros.
 Date: 2007
 Genre: Science-Fiction
 Title: I Am Legend - Trailer
[vo/gpu-next/wayland] Unable to set DRM atomic cap: Operation not supported
Using hardware decoding (vaapi).
AO: [pipewire] 48000Hz 5.1 6ch floatp
VO: [gpu-next] 1920x816 vaapi[nv12]
AV: 00:02:00 / 00:02:03 (98%) A-V:  0.000 Dropped: 5
Exiting... (Quit)
Execution time: 02m:02s                                                                                   

uname -a                                                                  
Linux eos 6.12.1-x64v2-xanmod1-1-edge #1 SMP PREEMPT_DYNAMIC Thu, 28 Nov 2024 00:30:19 +0000 x86_64 GNU/Linux

I just want to report that I tried your patch and it worked on my system. Thank you!

VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii XT / Grenada XT [Radeon R9 290X/390X]
OS: Debian GNU/Linux trixie/sid
Kernel: Linux titan 6.11.10--bymattia3 #7 (closed) SMP PREEMPT_DYNAMIC Fri Nov 29 01:51:17 CET 2024 x86_64 GNU/Linux

I used the latest Debian kernel available.

I was finally able to reproduce the issue and worked on Arun's patch a bit.

The first version was already pretty close to solving it while the second just reverted the feature who initially created the problem.

Here is an updated patch, please test it.

0001-drm-amdgpu-fix-UVD-contiguous-CS-mapping-problem.patch

Tested good with my Radeon 520 playing a combination of AV1, AVC, HEVC, and VC-1 format videos via VA-API (mpv).

Works fine without errors. Thanks.

Radeon HD 8570.

Failed on latest patched kernel, on Radeon HD 7800

mpv --hwdec=vaapi --vo=gpu-next ~/Videos/I\ Am\ Legend\ -\ Trailer.mp4
[ffmpeg/demuxer] mov,mp4,m4a,3gp,3g2,mj2: stream 0, timescale not set
● Video  --vid=1  (h264 1920x816 23.976 fps) [default]
○ Image  --vid=2  (mjpeg)
● Audio  --aid=1  (aac 6ch 48000 Hz 258 kbps) [default]
File tags:
 Artist: Warner Bros.
 Date: 2007
 Genre: Science-Fiction
 Title: I Am Legend - Trailer
[vo/gpu-next/wayland] Unable to set DRM atomic cap: Operation not supported
amdgpu: The CS has been rejected, see dmesg for more information (-22).
[1]    7382 IOT instruction (core dumped)  mpv --hwdec=vaapi --vo=gpu-next ~/Videos/I\ Am\ Legend\ -\ Trailer.mp4

uname -a                                                                                                                                                                                                                                                                        
Linux eos 6.12.1-x64v2-xanmod2-2-edge #1 SMP PREEMPT_DYNAMIC Sat, 30 Nov 2024 05:00:25 +0000 x86_64 GNU/Linux

vainfo                                                                                                                                                                                                                                                                                
Trying display: wayland
vainfo: VA-API version: 1.22 (libva 2.22.0)
vainfo: Driver version: Mesa Gallium driver 25.0.0-devel for AMD Radeon HD 7800 Series (radeonsi, pitcairn, ACO, DRM 3.59, 6.12.1-x64v2-xanmod2-2-edge)
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointVLD
      VAProfileNone                   : VAEntrypointVideoProc

_ dmesg G amdgpu                                                                                                                                                                                                                                                                      
[   33.571795] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff0093a000-ff0093c000 out of 256MB segment!
[   64.938562] amdgpu 0000:01:00.0: amdgpu: Disabling VM faults because of PRT request!
[  104.072735] [drm:amdgpu_uvd_cs_pass2 [amdgpu]] *ERROR* msg/fb buffer ff0093a000-ff0093c000 out of 256MB segment!

Please double check that you correctly applied the patch. When two people report back that the patch works it is really unlikely that it still breaks for you.

I applied the patch to 6.11.10-300.fc41.x86_64. Using AMD A10-7870K Radeon R7. Checked with multiple H264 videos in mpv using HW acceleration. Playback was successful, and no kernel errors were reported.

@ckoenig I will do another kernel build and test it. I will report back.

mentioned in commit agd5f/linux@6f4b6d55

mentioned in commit agd5f/linux@12f325bc

20241210 is now out and I notice it includes an update to DCN314 (https://gitlab.com/kernel-firmware/linux-firmware/-/commit/209c18b0e7cd2de304ad11c1042b085429fab1b4). Is this issue resolved or is it still necessary to downgrade that firmware file?

Woops wrong issue

Just tried the Linux 6.13-rc3 kernel announced yesterday which includes the above patch.

Tested with kernel from https://copr.fedorainfracloud.org/coprs/g/kernel-vanilla/mainline-wo-mergew/

Works fine.

Thanks @arunpravin24 @ckoenig

Edit:

Can't wait for fedora to build kernel 6.12.6 which includes the above patch. https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.12.6

Can confirm hardware acceleration (VA-API) works again with AMD Tobago PRO [Radeon R7 360 / R9 OEM] on a patched 6.12.6-1-liquorix-amd64 kernel.

Fedora just (50 minutes ago) pushed kernel 6.12.6 to the stable repository.

So (my) problem is solved (tested).

Linux fedora 6.12.6-200.fc41.x86_64 # 1 SMP PREEMPT_DYNAMIC Thu Dec 19 21:06:34 UTC 2024 x86_64 GNU/Linux

Fixed for me as well. Linux fedora 6.12.6-200.fc41.x86_64; AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G (4) @ 3.70 GHz.

Thanks @ckoenig @arunpravin24

Can confim that problem solved and vaapi works. kernel: 6.12.6-arch1-1. I appreciate all of you who have contributed to this problem!

closed

[R9 390X] Broken hardware acceleration in 6.10 kernel

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Log files (for system lockups / game freezes / crashes)

Designs

Child items ...

Activity

Hardware description:

System information:

Admin message

Admin message

[R9 390X] Broken hardware acceleration in 6.10 kernel

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Log files (for system lockups / game freezes / crashes)

Activity

Hardware description:

System information: