Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
Should i try only paramters with DPM in name ? Or others too (this GFXOFF seems kinda suspicious to me)?
BTW i tried amdgpu.runpm=0 also and it didn't change anything. So can i rule out some of them already ?
Probably makes sense to start with DPM ones, but it could be some interaction between DPM and one of the other features. runpm is a separate feature. The just controls whether the GPU is runtime suspended (powered down) when it's idle to save power. The ppfeatures control power at runtime when the GPU is powered on.
I was running fine for few days with mask amdgpu.ppfeaturemask=0xfffd3fff
Today I rebooted with mask amdgpu.ppfeaturemask=0xfffd7fff clearing the lowest bit PP_OVERDRIVE_MASK = 0x4000 after several hours (including suspend to RAM followed by wakeup) I got the error:
[11397.145866] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
For several days I'm running with mask amdgpu.ppfeaturemask=0xffffbfff without any issues. It really seems that masking bit PP_OVERDRIVE_MASK = 0x4000 causes this issue.
I think my wife and I are hitting this. We're using AMDGPU with 6700xt cards.
Fedora 36
5.19.15-201.fc36.x86_64
Mesa 22.1.7-1
Fedora 37
5.19.13-300.fc37.x86_64
Mesa 22.2.0-7
Before I found this thread I started testing forcing high performance. I assume we should try to kernel commandline argument amdgpu.ppfeaturemask=0xfffd3fff as well?
Having the same problem with linux 6.0.2.arch1-1 and AMD 5600xt. Trying out the remedy in the post, hopefully will get back to this post in a couple of days, because it started to annoy me.
Edit1: just crashed again, even with the forced "echo 'high'" remedy applied. Will try the kernel setting.
Edit2: amdgpu.ppfeaturemask=0xffffbfff has also worked for me. Running with this kernel setting 2 days in a row without any crashes now.
I'm also getting this with the AMD 5600XT & kernel 6.0.3 and mesa-git with Archlinux. I was using the "3D Fullscreen" power profile when this occurred (set through corectrl). Have booted with those flags, which removes the ability to set the profile (as it changed the bootflags corectrl uses which is amdgpu.ppfeaturemask=0xffffffff.
Not to pile on, but I've been hitting this as well after a recent hardware upgrade. I have an RX 5700 (PowerColor Red Dragon Radeon RX 5700) and have been periodically having desktop freeze/crashes with "[drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, but soft recovered" printed in the kernel log.
This is with kernel 6.0.2 and (if relevant) xf86-video-amdgpu 22.0.0 and mesa 22.1.7 on Gentoo. This is occurring on a new PC with a Ryzen 1950X on an X670E motherboard. I was previously using this same same video card in a another computer with a 3900X and X570, I believe running kernel 5.19. It was stable on that system; didn't start encountering these crashes until moving to the new system w/ the newer kernel.
I've tried running with both amdgpu.ppfeaturemask=0xffffbfff and amdgpu.ppfeaturemask=0xfffd3fff, but neither seems to have made a difference.
Happy to provide any additional info that may help with troubleshooting.
@nitro322 I crashed like a dozen yesterday before I added amdgpu.ppfeaturemask=0xffffbfff, I'm also on the latest mesa-git and using Wayland (sway). I haven't crashed since, maybe try the latest? I'm also on xf86-video-amdgpu and was crashing on Chrome and apps that used OpenGL.
@joshuataylorx@nitro322 same here. When it started crashing I switched to wayland to try it out. Currently the kernel setting is on and it didn't crash for some time now. However I still feel a bit of clunkiness and intermittent minimal hangups for unknown reasons.
(have been out of town since posting my previous comment)
Appreciate the suggestions regarding Wayland. I'm still running Xorg w/ KDE Plasma 5.25. I tried switching to Wayland, but discovered that screen sharing doesn't work in Wayland under Teams, which I require for work. So, I'm stuck on Xorg for the forseeable future.
A bit off-topic. I had a similar situation with Teams & screen-sharing. I solved it by compiling minimal ungoogled-chromium with screencast flag (using pipewire as screen grabbing backend) and it works wonders under Wayland. I had to create Teams as Chromium app in its own window with an icon shortcut in my dock for this but it's indistinguishible from regular Teams client (which is IIRC built on top of Electron). The only little annoyance is that it requires login + 2FA after restart the next day which the Electron version does not. (I was able to fix it by allowing 3rd party cookies) Enabling WebRTCPipeWireCapturer in chrome://flags or using startup option --enable-features is required.
I was plagued with this issue on a 5600xt, with this error happening within seconds of starting a game. I tried these various ppfeaturemasks and they seemed to help at times. However, I think I've narrowed it down to a power supply/ power connector issue. I reseated the GPU in the PCIE slot, removed and reattached the cables off the PSU. So far it has been going without a crash for 3 days, even with the latest kernel.
Well, just yesterday I removed my video card while changing CPU coolers, so I'm in the same situation now. Will report back if it seems to make any difference.
@nitro322 did it make a difference? It looks like the 7950X or AM5 is definitely doing something, since I'm also running a 7950X, with a 6800XT, but on B650, and otherwise, I'm in the same boat as you: I've tried high DPM performance level as well as a variety of mask values and none seem to help. I have also reseated and reconnected the power connector numerous times to swap to my backup graphics card (Intel A380). I'm going to try a PSU swap to see if that helps.
It seems like it may have. I haven't had any more crashes since that post. I'm currently running with amdgpu.ppfeaturemask=0xfffd3fff, but I had that in place before my last post and was still seeing period crashes.
Let me try removing amdgpu.ppfeaturemask=0xfffd3fff and run another few days, see what happens. Will report back.
Edit: One other note I should mention - I've also been running with video=3440x1400@120 since around the same time to deal with a high power consumption issue in amdgpu (mentioned here: #1301 (closed)). I wouldn't think it's related to this, but sharing just in case.
PSU swap seemed to help, but it started crashing again shortly after. I had an XMP profile active, so I reset to JEDEC speeds now, let's see if that helps. I'm still running 0xffffffff, if things remain broken I'll give 0xfffd3fff a try as well.
@joshuataylorx thanks, I'll try that next. I decided to give echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level another shot since I hadn't done that since swapping PSUs and disabling XMP. It's actually stable so far. At this point my bar is so low that if it only crashes once a day I'll take it...
@joshuataylorx interestingly, I actually see pretty much the same power consumption with high set, it may be because I have a reference card with conservative cooling, clocks, and power limits. And high is actually the most stable it has been yet. So I think the issue ultimately lies with the DPM power saving features after all.
So it seems the way high works is it just jumps between 0 and the max frequency
But it doesn't seem to have an impact on power usage: my idle usage was exactly the same with stock settings, and peak usage is capped at the power limit.
I decided to tweak the power_dpm_force_performance_level setting by switching high and low. After 2-3 switches, I had a crash the instant I shifted down to low. It seems there is definitely a bug with frequency scaling. @joshuataylorx I'm now giving your solution of setting min to max on corectrl a try, had to enable PP_OVERDRIVE_MASK for this of course. I expect things to still remain stable, since the bug seems to be the result of dynamic graphics clock frequency scaling, which points to masking PP_SCLK_DPM_MASK as being the final workaround. I will verify this is indeed the case, CC @agd5f
@agd5f I can reproduce this crash with power_dpm_force_performance_level=high set. Does that mean the problem (at least the one I'm seeing) is not in the DPM code?
Can you try with power_dpm_force_performance_level=low
Will try that if I get a crash with masking PP_SCLK_DPM i.e. 0xfffffffe (so far 5 days using this with no gfx timeout, but I've had it take longer than that before).
@rocketraman also check with cat /sys/class/drm/card0/device/power_dpm_force_performance_level after you echo into /sys/class/drm/card0/device/power_dpm_force_performance_level to ensure it's set properly. I was seeing in some cases that the value did not change, which could be because I had corectrl running in the background
also check with cat /sys/class/drm/card0/device/power_dpm_force_performance_level after you echo into /sys/class/drm/card0/device/power_dpm_force_performance_level to ensure it's set properly
Just had my first crash since removing amdgpu.ppfeaturemask=0xfffd3fff 4 days ago. Same "[drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, but soft recovered" error. "soft recovered", to be clear, does none mean my desktop session recovered. Had to kill X11 to get it usable again.
I'm back to running with "video=3440x1400@120 amdgpu.ppfeaturemask=0xfffd3fff" now. That was stable for me for over a week.
Can you try with power_dpm_force_performance_level=low as well?
I tested with power_dpm_force_performance_level set to low (with feature mask 0xffffffff), and the system crashed again. Same error messages, with the *ERROR* ring kiq_2.1.0 test failed (-110) message and everything.
If that doesn't help, then your issue would not likely be related to dynamic clocks.
Should I open a new issue then? And what should I try to debug/investigate next?
Can not absolutely confirm but yeah, might be. I haven't had any crashes in lts but again I haven't really used it under load. At the very least it's usable. I was getting constant crashes in v6, it was pretty unstable for me.
Have this crash with kernel 6.1.0 rc-3 and an ASUS Radeon RX 6600 XT. dmesg logs look just like the OPs, except that I also see a buffer underflow, use-after-free error from the kernel after the GPU reset:
I'd like to help narrow down this problem because this issue is severe, but need guidance in terms of understanding how to correlate the enabled features with the feature mask kernel argument.
/sys/class/drm/card1/device/pp_features does not map 1:1 with amdgpu.ppfeaturemask. If you can narrow down which feature(s) are causing problems using amdgpu.ppfeaturemask that would be helpful.
@agd5f And what would be the best way to do that? You use feature(s) with an (s) appended, so that means the problem may very well be combinations of features. Just randomly trying all combinations of all features is not really possible.
Is it sane to do a binary search i.e. start with amdgpu.ppfeaturemask=0x0 and go from there? In other words, is it right to say that disabling all power features should solve the problem and if 0x0 does not solve the problem, then the problem is not with the power management features at all?
I would start with the DPM features (PP_SCLK_DPM_MASK, PP_MCLK_DPM_MASK, PP_PCIE_DPM_MASK, PP_SOCCLK_DPM_MASK, PP_DCEFCLK_DPM_MASK) and GFXOFF (PP_GFXOFF_MASK). Disable each one individually and see if any of those improve things. Next you could also try disabling various clockgating features See the AMD_CG_SUPPORT_* flags in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/include/amd_shared.h#n118 and use them like a bit mask with the amdgpu.cg_mask module parameter.
Thanks. It sometimes takes a week for this issue to present itself, so this is going to be a long process. I wish there was a better way to find out where the problem is, like an instrumented kernel or something.
First report: 0xffffbfff (tried vainly because some people above reported succcess) does not work, it failed after about 5 days. I believe 0xffffbfff is the default anyway so it makes sense that this does not work.
Still going strong. I'll keep you posted on the status, but i think this might be it, it's night and day, before things were extremely bad: it would crash in under 15 minutes all the time. Also, I don't have PCIe ASPM enabled in the bios, can that have an impact on the functioning of PCIe DPM?
@Ambyjkl I never crash this quickly. Makes we wonder if we're debugging the same issue or not. Are you doing anything special to make it crash? Or just normal desktop use / web browsing?
It's kinda sporadic actually, and happens a lot during regular use, typically in under an hour. But I have found a way to speed this up: run a VAAPI workload (video playback) on the side, and then continue regular usage like web browsing and with this config, the average survival time is only 15 minutes. Here is what a crash looks like to me:
And ring kiq_2.1.0 test failed is always present, just like in the original Dmesg log from the OP.
The message comes after the ring gfx_0.0.0 timeout and GPU reset, and so I suspect its an unimportant side-effect. amdgpu_ring_test_helper (whatever that is) may just be having problems reconnecting to the graphics card, like everything else. What kernel version are you on? I'm trying 6.0.5 right now -- if you're on an earlier version perhaps amdgpu_ring_test_helper recovery was fixed.
@CodeDead I do see the refcount_warn_saturate error in my logs as well, about 2 seconds after the ring gfx_0.0.0 timeout. Do you get the ring gfx_0.0.0 timeout as well? My understanding was that the errors after ring gfx_0.0.0 timeout, such as refcount_warn_saturate, are downstream effects of the GPU reset rather than problems in and of themselves.
Not entirely sure @rocketraman . The problem is so random it is hard to diagnose. The only logs I have right now are the ones I provided. I'll be sure to take a closer look the next time it happens.
@CodeDead Based on your kernel, you're on Fedora. You can see the logs from the crashed boot by doing journalctl -b-1 where -1 is the previous boot. journalctl --list-boots if you need to go back farther than that.
Yep. I generally switch to a TTY and do systemctl --user stop user.slice first to try and stop as many graphical programs as possible gracefully. Then if you aren't rebooting to try out a new DRM mask option (man, there has to be a better way!), you can even reset your DM with systemctl sddm restart.
@agd5f A little more than a week running with 0xfffffffe to turn off PP_SCLK_DPM without the gfx timeout. I did reboot a couple of times, so its not one week uptime, but so far this is promising.