Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
Welcome to our new datacenter. The migration is still not over, but we try to bring up the service to the best we can. There are some parts not working yet (shared runners, previous job logs, previous job artifacts, ... ) but we try to do our best.
We do not guarantee data while the migration is not over, please consider this as read-only
I noticed the issue with a 6.2 kernel but it persists in 6.3. An attempt to play a 4K H.265 HDR10+ video file with the mpv played and HW acceleration enabled leads to the entire desktop immediately becoming very laggy. After some time, perhaps 30 seconds at most, the GPU crashes with the backtrace listed below. The driver then tries to recover the GPU without success until the computer is restarted.
I spoke too soon. Changing vo=gpu-next to vo=gpu does not really fix the problem. What does help is switching the video profile from profile=gpu-hq to profile=default. gpu-hq enables some additional and more expensive postprocessing filters etc. but it any case it shouldn't crash the GPU.
Not an AMD dev but judging by the backtrace, do you have any runtime power management tweaks enabled? I recall that Fedora comes with tuned service that controls PM. Any chance you can disable it or force to not apply any power saving features?
Here is a log with 6.3.1 and the patch applied. I let the computer running for a while once it got to the compromised state to make sure that at least something makes it into the log.
Are you sure it's applied? That should have cleared up the gmc_v_9_0 irq_put warning (the others are separate).
Nonetheless the warnings are red herrings because they only happen because the GPU reset is attempted and fails. The real problem is the first ring timeout:
My apologies. I got distracted and didn't modify the building script correctly. This time I made sure that the patch got applied. FTR, the relevant part of the code now looks like this:
static int gmc_v9_0_hw_fini(void *handle){ struct amdgpu_device *adev = (struct amdgpu_device *)handle; gmc_v9_0_gart_disable(adev); if (amdgpu_sriov_vf(adev)) { /* full access mode, so don't touch any GMC register */ DRM_DEBUG("For SRIOV client, shouldn't do anything.\n"); return 0; } /* * Pair the operations did in gmc_v9_0_hw_init and thus maintain * a correct cached state for GMC. Otherwise, the "gate" again * operation on S3 resuming will fail due to wrong cached state. */ if (adev->mmhub.funcs->update_power_gating) adev->mmhub.funcs->update_power_gating(adev, false); amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0); return 0;}
I assume this is correct...?
I ran the test again with the patch applied and amd_iommu=off in the kernel command line. The GPU still gets stuck in an unrecoverable state and I seem to be getting the same warnings as before.
Furthermore, it seems that reducing the fanciness of MPVs postprocessing filters greatly reduces the likelihood of this crash to happen but it doesn't completely eliminate it.
Furthermore, it seems that reducing the fanciness of MPVs postprocessing filters greatly reduces the likelihood of this crash to happen but it doesn't completely eliminate it.
The other SDMA warning is known right now and mentioned in a few issues, there isn't a patch for it yet but as I mentioned before it's red herring.
As you mentioned this seems to be a new issue with 6.2, is it possible that you also happened to upgrade GPU firmware around the time it showed up? If so; could you revert to older GPU firmware to see if it improves?
Is the GPU firmware a part of what Arch Linux ships as the linux-firmware (https://archlinux.org/packages/core/any/linux-firmware/) package? The last update was on 23-04-04, apparently. I can roll back to 23-03-10 and see how that goes.
Mind you, I don't have the same GPU as OP, but I get a very similar crash and for me, this has also started occuring recently (Guessing around the release of 6.2 on fedora)
I manually built an older version of the FW package (20230310). This time I had to play the test video twice to get the GPU to crash. I'm not sure if that's significant because there is no specific point in the video that would trigger the crash so maybe it was just a fluke. Assuming that I'm doing the right thing should I go back to an even older FW?
It might not be caused by the firmware, but as both of your distros track latest kernel and latest firmware it is worth trying to identify which one caused it.
If you're sure it's kernel and not GPU firmware can either of you guys possibly bisect back to a point that it was stable to identify the first problematic commit?
I'll see what I can do over the weekend. TBH I'm not sure if 6.2 is really the first problematic kernel because playing super high quality videos is not something I'd regularly do. If it helps, disabling VAAPI decoding in MPV had no effect, dialing down the postprocessing quality did. I'll get back to you if I manage to come up with something bisecting. Thanks!
@madcatx1 I'll let you try it once as you seem to be able to reproduce the problem easily. In my case, I gotta let the system run for a while and it happens randomly (As far as I can see).
I'll try to play with it too this weekend though, to see if I can reproduce it more frequently/accurately.
One more piece of information before I attempt to bisect this. I've reverted my system to a "standard" configuration - meaning the latest available firmware package and no additional kernel options - and tried the following:
To make sure that this is indeed a regression that can be tracked down within a reasonable span of kernel versions, I switched to kernel 6.1 which Arch Linux conveniently packages as linux-lts. The exact version was 6.1.27. My test video played fine 3 times in a row. Then I booted back into 6.3.1, expecting a problem. Naturally, there was no crash and everything seemed to work okay. Then I realized that I probably had Firefox running every time I experienced the crash. The moment I launched Firefox while the video was still playing, the GPU froze. I've tried quite hard to reproduce this under 6.1 but 6.1 seems okay. The gpu-hq profile MPV setting was probably a false clue. This is all happening under Wayland if that makes any difference.
We have a winner. I used Greg's repository as the source of stable kernels (https://github.com/gregkh/linux/) and went over all commits between 6.1.9 and 6.2.0. Since that would be almost 17k commits, I narrowed down the scope only to commits that mentioned "drm/amd" in the commit message. I eventually bisected the problem to this particular change:
3f4c175d62d89819121cbbd5a0a30f4b80862025 drm/amdgpu: MCBP based on DRM scheduler (v9)
Then I speculated that the preemption logic introduced by this commit could issue instructions to the GPU in some invalid order. I applied this patch:
@farchord If my understanding is correct, than this issue is specific to GFX9 class of chips but your 6600M is a GFX10. Therefore, I think that you're dealing with a different problem.
The preemption would discard low priority ibs(in this case the hw acced video) when high priority ib comes, then resubmit those skipped ibs again. It looks like the preemption/resubmission breaks some dependency of the video ib lists.
@MadCatX The workaround disables the preemption. We hope to find out the guilty ib packages.
I am trying to reproduce it on my side but I cannot download the video file on mega.nz
Is there any other video could be used to reproduce the hang?
@JiadongZhu Thanks for looking into this. I figured that setting the threshold high enough would effectively disable the preemption.
I think that any high bitrate video would do because I can reproduce the problem with multiple video files. I'll try to get another shareable video for you, in the meantime, here is my setup that can reproduce the problem with almost 100 % reliability.
mpv player (0.35.1) set as follows:
vo=gpu-next
profile=gpu-hq
gpu-context=wayland
hwdec=auto
KDE 5.27.4, KWin_wayland
2 screens, internal laptop screen and an external screen, both FHD. (I think I could repro this even with just the laptop screen.)
To repro, I
Start the video
Launch Firefox and flick through the tabs a bit
If the GPU doesn’t freeze, I close Firefox and launch it again.
I am trying to reproduce it on my side but I cannot download the video file on mega.nz Is there any other video could be used to reproduce the hang?
Let me know if you need any more help.
Can you post the file somewhere else perhaps that can be accessible? AMD I/T blocks mega.nz
No problem. I can't reliably repro this on my system as it's not using the same GPU as OP. But I'm kinda hoping that you guys fixing this fixes it for me too crosses fingers
This issue is firmware related.
After the preemption happens, kmd would reset preempt register with a write_data command, cp waits on mmCP_VMID_PREEMPT all zero to finish the preemption. Sometimes the write_data cmd is not working.
The hang comes out if preemption happens more frequently.
root@amd-Majolica-RN:~# umr -r renoir.gfx930.mmCP_VMID_PREEMPT
gfx930.mmCP_VMID_PREEMPT => 0x0000ffff
The mec version on ubuntu 23.04 is 1d4
root@amd-Majolica-RN:/home/strix# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info |grep MEC
MEC feature version: 53, firmware version: 0x000001d4
you might have a try using an old version of mec.bin (version 1d0 is working on my side)
I had the latest linux-firmware package installed which contais Renoir MEC firmware 1d4. I went back to an unpatched kernel and 1d0 MEC firmware (appears to be from Feb 2022) and the issue seems to be gone. Hopefully you can sort this out with the firmware guys.
the firmware issue is a timing issue which could be solved by this patch,
https://patchwork.freedesktop.org/series/118260/
You might have a try with the patch built in together with the latest firmware.
Getting the expected log output in dmesg when stressing the APU/'GPU'.
[ +3.437095] [drm] ring 0 timeout to preempt ib[ +1.724453] [drm] ring 0 timeout to preempt ib[ +0.667181] [drm] ring 0 timeout to preempt ib[ +6.961722] [drm] ring 0 timeout to preempt ib[ +11.361640] [drm] ring 0 timeout to preempt ib[ +0.167330] [drm] ring 0 timeout to preempt ib[ +5.683224] [drm] ring 0 timeout to preempt ib[Jun12 14:25] [drm] ring 0 timeout to preempt ib
Currently testing with the 'Cube Diorama' from Blenders test files. Using Blender 3.5.1.
The only thing I noticed is pretty hard cursor lagging but I am pretty certain this is due to the integrated APU being completely overloaded
Tested on my usual setup wit 2x4k@60Hz via USB-C docking still works fine and does not crash the system
Only issue is dmesg log is spammed with these drm messages and every time a new line is written the mouse lags / stutters for a few seconds (likely not caused by the logging but rather non functioning preemption).
So the underlying issue must still be fixed via amd-gpu-firmware update by AMD.
Though the issue here might be an edge case (changing the resolution of the Parsec host while being connected freezes the whole system for a few seconds until Parsec crashes, had to change the resolution multiple times, after three times both my screens turned (and stayed) black) and I found myself with the log output above.
Non professional (sorry):
Where do we have to escalate this issue so the firmware bugs will finally be addressed? This has been going on for many months and makes devices so unstable that you can't actually use them productively. I thought the patch was finally a good workaround but it's only thin band aid that falls apart after a few days.
I can't imagine the frustration of most end users that don't even know about this bug and have crashing machines all day long
The L2 protection fault errors are likely another issue that is not related to IB preemption. There are numerous issues similar to this one already reported here.
You're right @MadCatX#2627 for example. Hmm I judged to early due to *ERROR* ring gfx_low timeout. I will try to follow the more recent issue then, thanks for pointing that out.
At lease the fix (patch above) is stable regarding all other cases which crashed the system before, which is a huge improvement.
My laptop, running the AMD APU 6800H, had a crash while the screen was in power savings. I'm running Gentoo with kernel: Linux lenny 6.3.4-gentoo-r1 #1 (closed) SMP Mon May 29 07:59:08 PDT 2023 x86_64 AMD Ryzen 7 6800H with Radeon Graphics AuthenticAMD GNU/Linux. There is a built in Nvidia RTX 3060 which is disabled and used on occasion by QEMU/KVM, which is owned by the kernel driver: vfio-pci.
After the crash, X crashes and restarts and everything is okay. I have had something similar while playing windows games on proton on my media PC with my RX6650XT, but I haven't copied syslog, because I thought it was a wine issue until it happened on my laptop.
Also, audio crashes permanently, but I'm not sure that's related.
I'm not sure this is this post belongs in this thread. If it doesn't, let me know and I'll open a new ticket.
Also, please let me know how I might improve the data for you guys.