ERROR ring gfx_0.0.0 timeout when using firefox, chrome or icaclient when dpm performance level = auto

Can you use the amdgpu.ppfeaturemask parameter to narrow down which power feature is causing problems? The bits in that parameter are defined by the PP_FEATURE_MASK enum here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/include/amd_shared.h#n199

Should i try only paramters with DPM in name ? Or others too (this GFXOFF seems kinda suspicious to me)? BTW i tried amdgpu.runpm=0 also and it didn't change anything. So can i rule out some of them already ?

Probably makes sense to start with DPM ones, but it could be some interaction between DPM and one of the other features. runpm is a separate feature. The just controls whether the GPU is runtime suspended (powered down) when it's idle to save power. The ppfeatures control power at runtime when the GPU is powered on.

I have encountered this exact error with kernel 6.0.0-rc4 now. On "auto" chromium based browser freezes and the error shows in dmesg.

OS: Gentoo Base System release 2.8 x86_64
Host: MS-7B93 1.0
Kernel: 6.0.0-rc4-llvm
Uptime: 1 day, 13 hours, 43 mins
Packages: 1160 (emerge), 19 (flatpak), 10 (snap)
Shell: zsh 5.9
Resolution: 3840x2160
DE: GNOME 42.3.1
WM: Mutter
WM Theme: Adwaita
Theme: WhiteSur-dark-solid [GTK2/3]
Icons: Adwaita [GTK2/3]
Terminal: WezTerm
CPU: AMD Ryzen 9 5950X (32) @ 3.785GHz
GPU: AMD ATI Radeon RX 5600 OEM/5600 XT
Memory: 7881MiB / 64239MiB

Try adding following thing to kernel boot parameters: amdgpu.ppfeaturemask=0xfffd3fff

And check if the fault happens again. I managed to bisect some of parameters and this fault is caused by 1 or more of those features:

PP_OVERDRIVE_MASK = 0x4000, PP_GFXOFF_MASK = 0x8000, PP_STUTTER_MODE = 0x20000,

Don't have time/will to play with it right now since the fault is very destructive.

Did so. I will report back in case of problems again.

So far, so good. Several days without issue. Is there anything I can do to help narrow down the problem?

Each bit in ppfeaturemask represents a power feature. Try clearing individual bits to narrow down which power features seems to cause the problem.

I was running fine for few days with mask amdgpu.ppfeaturemask=0xfffd3fff

Today I rebooted with mask amdgpu.ppfeaturemask=0xfffd7fff clearing the lowest bit PP_OVERDRIVE_MASK = 0x4000 after several hours (including suspend to RAM followed by wakeup) I got the error:

[11397.145866] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered

For several days I'm running with mask amdgpu.ppfeaturemask=0xffffbfff without any issues. It really seems that masking bit PP_OVERDRIVE_MASK = 0x4000 causes this issue.

I think my wife and I are hitting this. We're using AMDGPU with 6700xt cards.

Fedora 36 5.19.15-201.fc36.x86_64 Mesa 22.1.7-1

Fedora 37 5.19.13-300.fc37.x86_64 Mesa 22.2.0-7

Before I found this thread I started testing forcing high performance. I assume we should try to kernel commandline argument amdgpu.ppfeaturemask=0xfffd3fff as well?

Having the same problem with linux 6.0.2.arch1-1 and AMD 5600xt. Trying out the remedy in the post, hopefully will get back to this post in a couple of days, because it started to annoy me.

Edit1: just crashed again, even with the forced "echo 'high'" remedy applied. Will try the kernel setting.

Edit2: amdgpu.ppfeaturemask=0xffffbfff has also worked for me. Running with this kernel setting 2 days in a row without any crashes now.

Hi!

I'm also getting this with the AMD 5600XT & kernel 6.0.3 and mesa-git with Archlinux. I was using the "3D Fullscreen" power profile when this occurred (set through corectrl). Have booted with those flags, which removes the ability to set the profile (as it changed the bootflags corectrl uses which is amdgpu.ppfeaturemask=0xffffffff.

It just started happening today.

Will see if any more occur.

Not to pile on, but I've been hitting this as well after a recent hardware upgrade. I have an RX 5700 (PowerColor Red Dragon Radeon RX 5700) and have been periodically having desktop freeze/crashes with "[drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, but soft recovered" printed in the kernel log.

This is with kernel 6.0.2 and (if relevant) xf86-video-amdgpu 22.0.0 and mesa 22.1.7 on Gentoo. This is occurring on a new PC with a Ryzen 1950X on an X670E motherboard. I was previously using this same same video card in a another computer with a 3900X and X570, I believe running kernel 5.19. It was stable on that system; didn't start encountering these crashes until moving to the new system w/ the newer kernel.

I've tried running with both amdgpu.ppfeaturemask=0xffffbfff and amdgpu.ppfeaturemask=0xfffd3fff, but neither seems to have made a difference.

Happy to provide any additional info that may help with troubleshooting.

@nitro322 I crashed like a dozen yesterday before I added amdgpu.ppfeaturemask=0xffffbfff, I'm also on the latest mesa-git and using Wayland (sway). I haven't crashed since, maybe try the latest? I'm also on xf86-video-amdgpu and was crashing on Chrome and apps that used OpenGL.

@joshuataylorx @nitro322 same here. When it started crashing I switched to wayland to try it out. Currently the kernel setting is on and it didn't crash for some time now. However I still feel a bit of clunkiness and intermittent minimal hangups for unknown reasons.

Ok it didn't work.. It crashed again with amdgpu.ppfeaturemask=0xffffbfff.

(have been out of town since posting my previous comment)

Appreciate the suggestions regarding Wayland. I'm still running Xorg w/ KDE Plasma 5.25. I tried switching to Wayland, but discovered that screen sharing doesn't work in Wayland under Teams, which I require for work. So, I'm stuck on Xorg for the forseeable future.

A bit off-topic. I had a similar situation with Teams & screen-sharing. I solved it by compiling minimal ungoogled-chromium with screencast flag (using pipewire as screen grabbing backend) and it works wonders under Wayland. I had to create Teams as Chromium app in its own window with an icon shortcut in my dock for this but it's indistinguishible from regular Teams client (which is IIRC built on top of Electron). ~~The only little annoyance is that it requires login + 2FA after restart the next day which the Electron version does not.~~ (I was able to fix it by allowing 3rd party cookies) Enabling WebRTCPipeWireCapturer in chrome://flags or using startup option --enable-features is required.

I was plagued with this issue on a 5600xt, with this error happening within seconds of starting a game. I tried these various ppfeaturemasks and they seemed to help at times. However, I think I've narrowed it down to a power supply/ power connector issue. I reseated the GPU in the PCIE slot, removed and reattached the cables off the PSU. So far it has been going without a crash for 3 days, even with the latest kernel.

Your mileage may vary.

Well, just yesterday I removed my video card while changing CPU coolers, so I'm in the same situation now. Will report back if it seems to make any difference.

@nitro322 did it make a difference? It looks like the 7950X or AM5 is definitely doing something, since I'm also running a 7950X, with a 6800XT, but on B650, and otherwise, I'm in the same boat as you: I've tried high DPM performance level as well as a variety of mask values and none seem to help. I have also reseated and reconnected the power connector numerous times to swap to my backup graphics card (Intel A380). I'm going to try a PSU swap to see if that helps.

It seems like it may have. I haven't had any more crashes since that post. I'm currently running with amdgpu.ppfeaturemask=0xfffd3fff, but I had that in place before my last post and was still seeing period crashes.

Let me try removing amdgpu.ppfeaturemask=0xfffd3fff and run another few days, see what happens. Will report back.

Edit: One other note I should mention - I've also been running with video=3440x1400@120 since around the same time to deal with a high power consumption issue in amdgpu (mentioned here: #1301 (closed)). I wouldn't think it's related to this, but sharing just in case.

PSU swap seemed to help, but it started crashing again shortly after. I had an XMP profile active, so I reset to JEDEC speeds now, let's see if that helps. I'm still running 0xffffffff, if things remain broken I'll give 0xfffd3fff a try as well.

And it crashed again. Now trying 0xfffd3fff

And it crashed yet again, even with 0xfffd3fff. To say I'm disappointed with AMD is an understatement.

I haven't crashed in ages, I was getting a lot of crashes.

I'm using "3D Profile" in Corectrl, with the minimum bumped up to match the max. This doesn't use as much power as you would think. Maybe try setting up corectrl with the debug option here: https://gitlab.com/corectrl/corectrl/-/wikis/Setup#full-amd-gpu-controls

You can do this via CLI, but corectl makes it easier.

@joshuataylorx thanks, I'll try that next. I decided to give echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level another shot since I hadn't done that since swapping PSUs and disabling XMP. It's actually stable so far. At this point my bar is so low that if it only crashes once a day I'll take it...

Reason I dislike high is because it consumes much more power (I think it was 25-30w?) than 3d profile (9-10w?). This saves about $20/yr :).

@joshuataylorx interestingly, I actually see pretty much the same power consumption with high set, it may be because I have a reference card with conservative cooling, clocks, and power limits. And high is actually the most stable it has been yet. So I think the issue ultimately lies with the DPM power saving features after all.

actually reading the source (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/pm/amdgpu_pm.c#n199) makes it seem like high is not functioning the way it is described. My clocks are not always at the max frequency according to corectrl and according to my wall outlet power meter as well.

So it seems the way high works is it just jumps between 0 and the max frequency

But it doesn't seem to have an impact on power usage: my idle usage was exactly the same with stock settings, and peak usage is capped at the power limit.

high forces the clocks to the max. The drops are likely due to throttling on the SMU or other features like gfxoff kicking in.

I decided to tweak the power_dpm_force_performance_level setting by switching high and low. After 2-3 switches, I had a crash the instant I shifted down to low. It seems there is definitely a bug with frequency scaling. @joshuataylorx I'm now giving your solution of setting min to max on corectrl a try, had to enable PP_OVERDRIVE_MASK for this of course. I expect things to still remain stable, since the bug seems to be the result of dynamic graphics clock frequency scaling, which points to masking PP_SCLK_DPM_MASK as being the final workaround. I will verify this is indeed the case, CC @agd5f

@agd5f I can reproduce this crash with power_dpm_force_performance_level=high set. Does that mean the problem (at least the one I'm seeing) is not in the DPM code?

Can you try with power_dpm_force_performance_level=low as well? If that doesn't help, then your issue would not likely be related to dynamic clocks.

Can you try with power_dpm_force_performance_level=low

Will try that if I get a crash with masking PP_SCLK_DPM i.e. 0xfffffffe (so far 5 days using this with no gfx timeout, but I've had it take longer than that before).

@rocketraman also check with cat /sys/class/drm/card0/device/power_dpm_force_performance_level after you echo into /sys/class/drm/card0/device/power_dpm_force_performance_level to ensure it's set properly. I was seeing in some cases that the value did not change, which could be because I had corectrl running in the background

also check with cat /sys/class/drm/card0/device/power_dpm_force_performance_level after you echo into /sys/class/drm/card0/device/power_dpm_force_performance_level to ensure it's set properly

Yep, I did check that.

Just had my first crash since removing amdgpu.ppfeaturemask=0xfffd3fff 4 days ago. Same "[drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, but soft recovered" error. "soft recovered", to be clear, does none mean my desktop session recovered. Had to kill X11 to get it usable again.

I'm back to running with "video=3440x1400@120 amdgpu.ppfeaturemask=0xfffd3fff" now. That was stable for me for over a week.

@agd5f

Can you try with power_dpm_force_performance_level=low as well?

I tested with power_dpm_force_performance_level set to low (with feature mask 0xffffffff), and the system crashed again. Same error messages, with the *ERROR* ring kiq_2.1.0 test failed (-110) message and everything.

If that doesn't help, then your issue would not likely be related to dynamic clocks.

Should I open a new issue then? And what should I try to debug/investigate next?

Switched to linux-lts 5.15 kernel. So far so good, no crashes. I think I will stick to this.

Still stable? I'm getting crashes under load still but idle crashes seem gone

Yep, still stable. Didn't have any crashes since the switch.

Same for idle yeah, then I assume load crashes are a seperate issue. Will stick to lts for now, thanks for response

Can not absolutely confirm but yeah, might be. I haven't had any crashes in lts but again I haven't really used it under load. At the very least it's usable. I was getting constant crashes in v6, it was pretty unstable for me.

Have this crash with kernel 6.1.0 rc-3 and an ASUS Radeon RX 6600 XT. dmesg logs look just like the OPs, except that I also see a buffer underflow, use-after-free error from the kernel after the GPU reset:

Nov 01 12:52:29 kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(2) succeeded!
Nov 01 12:52:29 kernel: [drm] Skip scheduling IBs!
Nov 01 12:52:29 kernel: ------------[ cut here ]------------
Nov 01 12:52:29 kernel: refcount_t: underflow; use-after-free.
Nov 01 12:52:29 kernel: WARNING: CPU: 6 PID: 968 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
Nov 01 12:52:29 kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs vhost_net vhost vhost_iotlb tap tun tls uinput rfcomm bnep snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core xt_MASQUERADE xt_CHECKSUM xt_polic>
Nov 01 12:52:29 kernel:  intel_rapl_msr iTCO_wdt intel_pmc_bxt iTCO_vendor_support mei_pxp mei_hdcp pmt_telemetry pmt_class intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass rapl intel_cstate in>
Nov 01 12:52:29 kernel:  snd_hda_codec snd_hda_core snd_hwdep rfkill snd_seq snd_seq_device snd_pcm snd_timer snd intel_vsec soundcore acpi_tad acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc zram hid_logitech_hidpp hid_logitech_dj amdgpu raid1 d>
Nov 01 12:52:29 kernel: CPU: 6 PID: 968 Comm: gfx_0.0.0 Kdump: loaded Not tainted 6.1.0-0.rc2.21.fc38.x86_64 #1
Nov 01 12:52:29 kernel: Hardware name: ASUS System Product Name/ROG MAXIMUS Z690 HERO, BIOS 1505 05/31/2022
Nov 01 12:52:29 kernel: RIP: 0010:refcount_warn_saturate+0xba/0x110
Nov 01 12:52:29 kernel: Code: 01 01 e8 6a bd 67 00 0f 0b c3 cc cc cc cc 80 3d bf 44 bd 01 00 75 85 48 c7 c7 d0 85 76 9a c6 05 af 44 bd 01 01 e8 47 bd 67 00 <0f> 0b c3 cc cc cc cc 80 3d 9a 44 bd 01 00 0f 85 5e ff ff ff 48 c7
Nov 01 12:52:29 kernel: RSP: 0018:ffffabb480fbfe98 EFLAGS: 00010286
Nov 01 12:52:29 kernel: RAX: 0000000000000026 RBX: ffff9a2ca39da400 RCX: 0000000000000000
Nov 01 12:52:29 kernel: RDX: 0000000000000001 RSI: ffffffff9a74efeb RDI: 00000000ffffffff
Nov 01 12:52:29 kernel: RBP: ffff9a214d769a40 R08: 0000000000000000 R09: ffffabb480fbfd38
Nov 01 12:52:29 kernel: R10: 0000000000000003 R11: ffffffff9b146248 R12: 0000000000000000
Nov 01 12:52:29 kernel: R13: ffff9a214d769bb8 R14: ffff9a2fefff4e40 R15: ffff9a214d769a40
Nov 01 12:52:29 kernel: FS:  0000000000000000(0000) GS:ffff9a406d380000(0000) knlGS:0000000000000000
Nov 01 12:52:29 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 01 12:52:29 kernel: CR2: 0000178015686000 CR3: 0000000305010001 CR4: 0000000000770ee0
Nov 01 12:52:29 kernel: PKRU: 55555554
Nov 01 12:52:29 kernel: Call Trace:
Nov 01 12:52:29 kernel:  <TASK>
Nov 01 12:52:29 kernel:  drm_sched_main+0x4c/0x410 [gpu_sched]
Nov 01 12:52:29 kernel:  ? dequeue_task_stop+0x70/0x70
Nov 01 12:52:29 kernel:  ? drm_sched_resubmit_jobs+0x10/0x10 [gpu_sched]
Nov 01 12:52:29 kernel:  kthread+0xe6/0x110
Nov 01 12:52:29 kernel:  ? kthread_complete_and_exit+0x20/0x20
Nov 01 12:52:29 kernel:  ret_from_fork+0x1f/0x30
Nov 01 12:52:29 kernel:  </TASK>
Nov 01 12:52:29 kernel: ---[ end trace 0000000000000000 ]---

Furthermore, echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level (in my case, card1) did not fix the problem.

Trying to understand the feature masking. Currently experimenting with amdgpu.ppfeaturemask=0xffffbfff with kernel 6.0.5. The output of cat /sys/class/drm/card1/device/pp_features is below, but I see features in that list that are not shown in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/include/amd_shared.h#n199, and vice-versa:

# cat /sys/class/drm/card1/device/pp_features
features high: 0x00003763 low: 0xa37ffdfb
No. Feature               Bit : State
00. DPM_PREFETCHER       ( 0) : enabled
01. DPM_GFXCLK           ( 1) : enabled
02. DPM_GFX_GPO          ( 2) : disabled
03. DPM_UCLK             ( 3) : enabled
04. DPM_FCLK             ( 4) : enabled
05. DPM_SOCCLK           ( 5) : enabled
06. DPM_MP0CLK           ( 6) : enabled
07. DPM_LINK             ( 7) : enabled
08. DPM_DCEFCLK          ( 8) : enabled
09. DPM_XGMI             ( 9) : disabled
10. MEM_VDDCI_SCALING    (10) : enabled
11. MEM_MVDD_SCALING     (11) : enabled
12. DS_GFXCLK            (12) : enabled
13. DS_SOCCLK            (13) : enabled
14. DS_FCLK              (14) : enabled
15. DS_LCLK              (15) : enabled
16. DS_DCEFCLK           (16) : enabled
17. DS_UCLK              (17) : enabled
18. GFX_ULV              (18) : enabled
19. FW_DSTATE            (19) : enabled
20. GFXOFF               (20) : enabled
21. BACO                 (21) : enabled
22. MM_DPM_PG            (22) : enabled
23. PPT                  (24) : enabled
24. TDC                  (25) : enabled
25. APCC_PLUS            (26) : disabled
26. GTHR                 (27) : disabled
27. ACDC                 (28) : disabled
28. VR0HOT               (29) : enabled
29. VR1HOT               (30) : disabled
30. FW_CTF               (31) : enabled
31. FAN_CONTROL          (32) : enabled
32. THERMAL              (33) : enabled
33. GFX_DCS              (34) : disabled
34. RM                   (35) : disabled
35. LED_DISPLAY          (36) : disabled
36. GFX_SS               (37) : enabled
37. OUT_OF_BAND_MONITOR  (38) : enabled
38. TEMP_DEPENDENT_VMIN  (39) : disabled
39. MMHUB_PG             (40) : enabled
40. ATHUB_PG             (41) : enabled
41. APCC_DFLL            (42) : enabled
42. RSMU_SMN_CG          (44) : enabled

I'd like to help narrow down this problem because this issue is severe, but need guidance in terms of understanding how to correlate the enabled features with the feature mask kernel argument.

/sys/class/drm/card1/device/pp_features does not map 1:1 with amdgpu.ppfeaturemask. If you can narrow down which feature(s) are causing problems using amdgpu.ppfeaturemask that would be helpful.

@agd5f And what would be the best way to do that? You use feature(s) with an (s) appended, so that means the problem may very well be combinations of features. Just randomly trying all combinations of all features is not really possible.

Is it sane to do a binary search i.e. start with amdgpu.ppfeaturemask=0x0 and go from there? In other words, is it right to say that disabling all power features should solve the problem and if 0x0 does not solve the problem, then the problem is not with the power management features at all?

I would start with the DPM features (PP_SCLK_DPM_MASK, PP_MCLK_DPM_MASK, PP_PCIE_DPM_MASK, PP_SOCCLK_DPM_MASK, PP_DCEFCLK_DPM_MASK) and GFXOFF (PP_GFXOFF_MASK). Disable each one individually and see if any of those improve things. Next you could also try disabling various clockgating features See the AMD_CG_SUPPORT_* flags in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/include/amd_shared.h#n118 and use them like a bit mask with the amdgpu.cg_mask module parameter.

Thanks. It sometimes takes a week for this issue to present itself, so this is going to be a long process. I wish there was a better way to find out where the problem is, like an instrumented kernel or something.

First report: 0xffffbfff (tried vainly because some people above reported succcess) does not work, it failed after about 5 days. I believe 0xffffbfff is the default anyway so it makes sense that this does not work.

Now trying 0xfffffffe to turn off PP_SCLK_DPM.

anyone else attempting this, here is quick way to turn off the bits and print the flag value in the shell

printf 'amdgpu.ppfeaturemask=0x%x\n' "$((0xffffffff & ~0x4000 & ~0x1 & ~0x2 & ~0x4 & ~0x1000 & ~0x2000))"

seeing more stability than usual with PP_PCIE_DPM_MASK disabled, (amd.ppfeaturemask=0xffffbffb)

Still going strong. I'll keep you posted on the status, but i think this might be it, it's night and day, before things were extremely bad: it would crash in under 15 minutes all the time. Also, I don't have PCIe ASPM enabled in the bios, can that have an impact on the functioning of PCIe DPM?

ASPM is an independent feature; there are no dependencies.

and it has crashed once again, rip

@rocketraman do you also see [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110) in your logs?

@Ambyjkl No, I don't see anything from amdgpu_ring_test_helper.

it would crash in under 15 minutes all the time

@Ambyjkl I never crash this quickly. Makes we wonder if we're debugging the same issue or not. Are you doing anything special to make it crash? Or just normal desktop use / web browsing?

It's kinda sporadic actually, and happens a lot during regular use, typically in under an hour. But I have found a way to speed this up: run a VAAPI workload (video playback) on the side, and then continue regular usage like web browsing and with this config, the average survival time is only 15 minutes. Here is what a crash looks like to me:

And ring kiq_2.1.0 test failed is always present, just like in the original Dmesg log from the OP.

Here are my specs:

CPU: 7950X

GPU: RX 6800 XT

Motherboard: ASRock B650M PG Riptide

It's possible we are seeing two different issues

The message comes after the ring gfx_0.0.0 timeout and GPU reset, and so I suspect its an unimportant side-effect. amdgpu_ring_test_helper (whatever that is) may just be having problems reconnecting to the graphics card, like everything else. What kernel version are you on? I'm trying 6.0.5 right now -- if you're on an earlier version perhaps amdgpu_ring_test_helper recovery was fixed.

@Ambyjkl Can confirm a similar issue happens here on an 3900X and RX 6900 XT with the same artifacting as you:

OS: Fedora 36
Kernel: 6.0.5-200.fc36.x86_64

WARNING: CPU: 7 PID: 651 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
Modules linked in: tun tls uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter iptable_filter qrtr bnep sunrpc binfmt_misc vfat fat snd_hda_codec_realtek iwlmvm snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel intel_rapl_msr snd_intel_dspcfg snd_intel_sdw_acpi intel_rapl_common mac80211 snd_hda_codec libarc4 snd_hda_core btusb edac_mce_amd snd_hwdep btrtl snd_seq iwlwifi btbcm kvm_amd snd_seq_device btintel snd_pcm btmtk kvm cfg80211 snd_timer bluetooth snd irqbypass rapl joydev gigabyte_wmi wmi_bmof pcspkr mxm_wmi i2c_piix4 k10temp soundcore rfkill acpi_cpufreq zram amdgpu
 drm_ttm_helper ttm iommu_v2 gpu_sched crct10dif_pclmul crc32_pclmul drm_buddy crc32c_intel polyval_clmulni polyval_generic drm_display_helper nvme ghash_clmulni_intel sp5100_tco igb ccp cec nvme_core r8169 dca nvme_common wmi ip6_tables ip_tables fuse
CPU: 7 PID: 651 Comm: gfx_0.0.0 Not tainted 6.0.5-200.fc36.x86_64 #1
Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS MASTER/X570 AORUS MASTER, BIOS F36f 07/20/2022
RIP: 0010:refcount_warn_saturate+0xba/0x110
Code: 01 01 e8 20 53 66 00 0f 0b e9 c2 7c 91 00 80 3d ab cd bd 01 00 75 85 48 c7 c7 f8 96 7c b8 c6 05 9b cd bd 01 01 e8 fd 52 66 00 <0f> 0b e9 9f 7c 91 00 80 3d 86 cd bd 01 00 0f 85 5e ff ff ff 48 c7
RSP: 0018:ffffbf0002673e98 EFLAGS: 00010286
RAX: 0000000000000026 RBX: ffffa0aa913f4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffb87b020a RDI: 00000000ffffffff
RBP: ffffa0a54c889628 R08: 0000000000000000 R09: ffffbf0002673d38
R10: 0000000000000003 R11: ffffffffb91461a8 R12: 0000000000000000
R13: ffffa0a54c8897a0 R14: ffffa0a775af49c0 R15: ffffa0a54c889628
FS:  0000000000000000(0000) GS:ffffa0ac5ebc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561485c15090 CR3: 000000010f710000 CR4: 0000000000350ee0
Call Trace:
 <TASK>
 drm_sched_main+0x4f/0x410 [gpu_sched]
 ? dequeue_task_stop+0x70/0x70
 ? drm_sched_resubmit_jobs+0x10/0x10 [gpu_sched]
 kthread+0xe9/0x110
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x22/0x30
 </TASK>

More people are having it here as well:
https://bugzilla.redhat.com/show_bug.cgi?id=2134683

@CodeDead I do see the refcount_warn_saturate error in my logs as well, about 2 seconds after the ring gfx_0.0.0 timeout. Do you get the ring gfx_0.0.0 timeout as well? My understanding was that the errors after ring gfx_0.0.0 timeout, such as refcount_warn_saturate, are downstream effects of the GPU reset rather than problems in and of themselves.

Not entirely sure @rocketraman . The problem is so random it is hard to diagnose. The only logs I have right now are the ones I provided. I'll be sure to take a closer look the next time it happens.

@CodeDead Based on your kernel, you're on Fedora. You can see the logs from the crashed boot by doing journalctl -b-1 where -1 is the previous boot. journalctl --list-boots if you need to go back farther than that.

I'm on arch, my kernel is 6.0.7-arch1-1, have repro'd the exact issue in:

lts: 5.15.77 (which is before this regression, ruling it out: #2113 (closed))
drm-tip-git: 6.1.r1138119.ee91a500e2dc
also on Fedora Workstation 37 Live ISO running 5.19

@CodeDead you can also switch to a TTY or ssh in, only drm is borked, everything else should work

Yep. I generally switch to a TTY and do systemctl --user stop user.slice first to try and stop as many graphical programs as possible gracefully. Then if you aren't rebooting to try out a new DRM mask option (man, there has to be a better way!), you can even reset your DM with systemctl sddm restart.

Here is my dmesg log for reference: amd.log

I also see refcount_warn_saturate and ring gfx_0.0.0 timeout

mentioned in issue #2233 (closed)

mentioned in issue #1915

mentioned in issue #2068

@agd5f A little more than a week running with 0xfffffffe to turn off PP_SCLK_DPM without the gfx timeout. I did reboot a couple of times, so its not one week uptime, but so far this is promising.

ERROR ring gfx_0.0.0 timeout when using firefox, chrome or icaclient when dpm performance level = auto

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Screenshots/video files

Log files (for system lockups / game freezes / crashes)

Designs

Child items ...

Activity

Admin message

Admin message

*ERROR* ring gfx_0.0.0 timeout when using firefox, chrome or icaclient when dpm performance level = auto

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Screenshots/video files

Log files (for system lockups / game freezes / crashes)

Activity

ERROR ring gfx_0.0.0 timeout when using firefox, chrome or icaclient when dpm performance level = auto