Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
Equinix is shutting down its operations with us on April 30, 2025. They have graciously supported us for almost 5 years, but all good things come to an end.
Given the time frame, it's going to be hard to make a smooth transition of the cluster to somewhere else (TBD). Please expect in the next months some hiccups in the service and probably at least a full week of downtime to transfer gitlab to a different place.
All help is appreciated.
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout
On the Dell OptiPlex 5055 with AMD Ryzen 5 PRO 1500 Quad-Core Processor and [AMD/ATI] Cape Verde PRO / Venus LE / Tropo PRO-L [Radeon HD 8830M / R7 250 / R7 M465X] [1002:682b], Linux 5.10.93 sometimes logs:
[1363331.162546] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2302264, emitted seq=2302266[1363331.175960] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 3019 thread Xorg:cs0 pid 3020[1363331.190214] amdgpu 0000:06:00.0: amdgpu: GPU recovery disabled.[2416170.361756] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2302264, emitted seq=2302266[2416170.375186] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 3019 thread Xorg:cs0 pid 3020[2416170.389436] amdgpu 0000:06:00.0: amdgpu: GPU recovery disabled.
linux 5.17 rc6, Revert "drm/amdgpu: check vm ready by amdgpu_vm->evicting flag", i don't have "[drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout" problem.
@youling257, thank you for the report and the problem report. My report is for Linux 5.10.93 though, so before the commit was merged. Can you please create a new issue with your findings?
lightdm started and after showing the background the monitor went black, and the display couldn’t be made to receive a signal, and also Numlock LED on the plugged USB keyboard did not light up when pressing the Numlock key.
On the system (same model Dell OptiPlex 5055, but different system), I experienced this with today, it’s a HiDPI Dell UP3214Q connected over DisplayPort. I wasn’t able to reproduce the problem.
Lately I've been having this problem even just using the Gnome Desktop (Not any game or demanding app). I will rollback to 5.18.16 and see if it happens there too
Desktop system with 1700 + 5700XT (Arch Linux on 5.19.3)
kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
I suggest to create a separate issue with a detailed description, and all logs attached. Then the developers/maintainers need to decide, if it is the same issue.
Started running into this or something similar myself, wanted to add my logs and context in case it could be helpful. I don't want to open a new issues since I'm almost certain it would be a dupe and just clutter things up.
Linux devitra 6.0.1-x64v2-xanmod1 #0~20221012.gitf9885bf SMP PREEMPT_DYNAMIC Wed Oct 12 17:17:49 U x86_64 GNU/Linux
Extended renderer info (GLX_MESA_query_renderer): Vendor: AMD (0x1002) Device: AMD Radeon RX 5700 XT (navi10, LLVM 14.0.6, DRM 3.48, 6.0.1-x64v2-xanmod1) (0x731f) Version: 22.3.0 Accelerated: yes Video memory: 8192MB Unified memory: no Preferred profile: core (0x1) Max core profile version: 4.6 Max compat profile version: 4.6 Max GLES1 profile version: 1.1 Max GLES[23] profile version: 3.2
OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.3.0-devel (git-5fa7c53631)
Been getting strange video driver crashes and other issues since upgrading CPU to new AMD 7950x; same 5700 XT GPI though. The CPU has an integrated GPU, and I had to update to latest linux-firmware to get it to boot. Something about that process has caused many bugs like this to start appearing, though, or else it's just some change to Mesa that happened to coincide.
I think I'm having the exact same problem: I am getting random crashes while playing Path of Exile. Especially opening and browsing skill tree / Atlas skill tree causes crash every time. These are from journalctl:
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=144181, emitted seq=144183 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process PathOfExileStea pid 4011 thread dxvk-submit pid 4067
Running Arch, 6.0.8-arch1-1, KDE Plasma, AMD 3400G.
@agd5f I also see these ring gfx timeout issues for months with my Vega 56, in all 3D games I own sooner or later, e.g. Battlefield 1, Total War: Troy, Call of Juarez: Gunslinger. dmesg_crash.log
Got the same problem and I can reproduce it pretty well.
If I play Minecraft with Sodium Renderer and set Chunk Renderer to Multidraw (GL 4.3), then it takes up to five minutes until the entire screen changes between black screen and green-purple stripes over and over, which also does not recover by itself. With chunk renderer Oneshot (GL 3.0) and Oneshot (GL 2.0) everything runs stable.
Linux asus-h170-pro 6.0.8-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 11 15:09:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.2.3
Extended renderer info (GLX_MESA_query_renderer): Vendor: AMD (0x1002) Device: AMD Radeon RX 570 Series (polaris10, LLVM 15.0.0, DRM 3.48, 6.0.8-300.fc37.x86_64) (0x67df) Version: 22.2.3 Accelerated: yes Video memory: 4096MB Unified memory: no Preferred profile: core (0x1) Max core profile version: 4.6 Max compat profile version: 4.6 Max GLES1 profile version: 1.1 Max GLES[23] profile version: 3.2
My journalctl also spits out a use after free call trace.
Dez 14 14:27:24 asus-h170-pro kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=47622, emitted seq=47624Dez 14 14:27:24 asus-h170-pro kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process java pid 5518 thread java:cs0 pid 5540Dez 14 14:27:24 asus-h170-pro kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!Dez 14 14:27:24 asus-h170-pro kernel: amdgpu: cp is busy, skip halt cpDez 14 14:27:25 asus-h170-pro kernel: amdgpu: rlc is busy, skip halt rlcDez 14 14:27:25 asus-h170-pro kernel: amdgpu 0000:01:00.0: amdgpu: BACO resetDez 14 14:27:25 asus-h170-pro kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resumeDez 14 14:27:25 asus-h170-pro kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400E80000).Dez 14 14:27:25 asus-h170-pro kernel: [drm] VRAM is lost due to GPU reset!Dez 14 14:27:25 asus-h170-pro kernel: [drm] UVD and UVD ENC initialized successfully.Dez 14 14:27:25 asus-h170-pro kernel: [drm] VCE initialized successfully.Dez 14 14:27:25 asus-h170-pro kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow startDez 14 14:27:25 asus-h170-pro kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow doneDez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2) succeeded!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: ------------[ cut here ]------------Dez 14 14:27:25 asus-h170-pro kernel: refcount_t: underflow; use-after-free.Dez 14 14:27:25 asus-h170-pro kernel: WARNING: CPU: 0 PID: 445 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110Dez 14 14:27:25 asus-h170-pro kernel: Modules linked in: tls uinput snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter iptable_filter qrtr intel_rapl_msr sunrpc intel_rapl_common binfmt_misc intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt ee1004 intel_pmc_bxt iTCO_vendor_support mei_hdcp mei_wdt mei_pxp snd_hda_codec_hdmi kvm snd_hda_codec_realtek snd_hda_codec_generic irqbypass eeepc_wmi rapl snd_hda_intel asus_wmi ledtrig_audio sparse_keymap intel_cstate snd_intel_dspcfg snd_intel_sdw_acpi platform_profile rfkillDez 14 14:27:25 asus-h170-pro kernel: snd_hda_codec intel_uncore i2c_i801 mxm_wmi snd_usb_audio pcspkr i2c_smbus wmi_bmof snd_usbmidi_lib snd_hda_core snd_rawmidi snd_hwdep mc snd_seq joydev snd_seq_device ddcci_backlight(OE) snd_pcm snd_timer snd soundcore vfat fat mei_me acpi_pad mei zram amdgpu nvme drm_ttm_helper crct10dif_pclmul nvme_core ttm crc32_pclmul crc32c_intel polyval_clmulni polyval_generic iommu_v2 gpu_sched drm_buddy drm_display_helper r8169 ghash_clmulni_intel nvme_common cec wmi video scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath i2c_dev fuse ddcci(OE)Dez 14 14:27:25 asus-h170-pro kernel: CPU: 0 PID: 445 Comm: gfx Tainted: G OE 6.0.8-300.fc37.x86_64 #1Dez 14 14:27:25 asus-h170-pro kernel: Hardware name: System manufacturer System Product Name/H170-PRO, BIOS 3805 05/16/2018Dez 14 14:27:25 asus-h170-pro kernel: RIP: 0010:refcount_warn_saturate+0xba/0x110Dez 14 14:27:25 asus-h170-pro kernel: Code: 01 01 e8 90 4b 66 00 0f 0b c3 cc cc cc cc 80 3d dc c6 bd 01 00 75 85 48 c7 c7 20 9a 7c 9c c6 05 cc c6 bd 01 01 e8 6d 4b 66 00 <0f> 0b c3 cc cc cc cc 80 3d b7 c6 bd 01 00 0f 85 5e ff ff ff 48 c7Dez 14 14:27:25 asus-h170-pro kernel: RSP: 0018:ffffb794c0e7fe98 EFLAGS: 00010286Dez 14 14:27:25 asus-h170-pro kernel: RAX: 0000000000000026 RBX: ffff9c77bf3a8400 RCX: 0000000000000000Dez 14 14:27:25 asus-h170-pro kernel: RDX: 0000000000000001 RSI: ffffffff9c7b0542 RDI: 00000000ffffffffDez 14 14:27:25 asus-h170-pro kernel: RBP: ffff9c75c89a9628 R08: 0000000000000000 R09: ffffb794c0e7fd38Dez 14 14:27:25 asus-h170-pro kernel: R10: 0000000000000003 R11: ffffffff9d146328 R12: 0000000000000000Dez 14 14:27:25 asus-h170-pro kernel: R13: ffff9c75c89a97a0 R14: ffff9c7722910c00 R15: ffff9c75c89a9628Dez 14 14:27:25 asus-h170-pro kernel: FS: 0000000000000000(0000) GS:ffff9c78e5c00000(0000) knlGS:0000000000000000Dez 14 14:27:25 asus-h170-pro kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033Dez 14 14:27:25 asus-h170-pro kernel: CR2: 000055b62568fb28 CR3: 000000016e010006 CR4: 00000000003706f0Dez 14 14:27:25 asus-h170-pro kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000Dez 14 14:27:25 asus-h170-pro kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400Dez 14 14:27:25 asus-h170-pro kernel: Call Trace:Dez 14 14:27:25 asus-h170-pro kernel: Dez 14 14:27:25 asus-h170-pro kernel: drm_sched_main+0x4c/0x410 [gpu_sched]Dez 14 14:27:25 asus-h170-pro kernel: ? dequeue_task_stop+0x70/0x70Dez 14 14:27:25 asus-h170-pro kernel: ? drm_sched_resubmit_jobs+0x10/0x10 [gpu_sched]Dez 14 14:27:25 asus-h170-pro kernel: kthread+0xe6/0x110Dez 14 14:27:25 asus-h170-pro kernel: ? kthread_complete_and_exit+0x20/0x20Dez 14 14:27:25 asus-h170-pro kernel: ret_from_fork+0x1f/0x30Dez 14 14:27:25 asus-h170-pro kernel: Dez 14 14:27:25 asus-h170-pro kernel: ---[ end trace 0000000000000000 ]---Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro kernel: [drm] Skip scheduling IBs!Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: amdgpu_cs_query_fence_status failed.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: amdgpu_cs_query_fence_status failed.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: amdgpu_cs_query_fence_status failed.Dez 14 14:27:25 asus-h170-pro kernel: amdgpu_cs_ioctl: 121 callbacks suppressedDez 14 14:27:25 asus-h170-pro kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: amdgpu_cs_query_fence_status failed.Dez 14 14:27:25 asus-h170-pro kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!Dez 14 14:27:25 asus-h170-pro kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!Dez 14 14:27:25 asus-h170-pro kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!Dez 14 14:27:25 asus-h170-pro kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro firefox.desktop[4419]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro firefox.desktop[4419]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro firefox.desktop[4419]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro firefox.desktop[4419]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro firefox.desktop[4419]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro firefox.desktop[4419]: [GFX1-]: GFX: RenderThread detected a device reset in PostUpdateDez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.Dez 14 14:27:25 asus-h170-pro gnome-shell[3289]: amdgpu: The CS has been cancelled because the context is lost.
I was having the same problem with Overwatch 2 installed through Lutris. It has been fixed by going into the GUI client and adding an environment variable. Navigate to Configure > System Options > Environment Variables > Add then add the RADV_DEBUG=llvm variable.
It is early days, and I will update if the issue crops up again. Thanks again to everyone for investing time and effort into this!
System Specs
CPU: AMD Ryzen 5 3600 (12) @ 3.600GHz
GPU: AMD ATI Radeon 6700 XT
Resolution: 2560x1440
If anyone knows some tweaks that I can apply to get my frame rate up (outside of the in game settings) let me know =)
EDIT: The freeze just happened again... sigh... Good news is that I can play for a couple of hours instead of a couple of minutes. The search continues for a workaround.
EDIT: I thought that further testing might help. I installed Nobara and the AMD Pro drivers as well. I then tried to run Overwatch 2 on ACO with stock settings. It crashed with the same fault. Then I turned on the Pro Drivers in the configuration menu. These changes seemed to extend the amount of time that I could go without a timeout error, but ultimately did not end up being a viable work around since the same fault happened.
I have issues in both Xorg and Wayland where my PC will just lock up with visual glitches. It seems to happen at random and not necessarily while gaming. As a matter of fact, it's mostly happened outside of games for me.
If there's any way that I can help debug this I'd be more than happy to do so, but for now, this makes Linux desktop totally unusable for me since this issue causes me to hard-reset my PC every time it happens; it does not recover if I don't.
As it looks unrelated to this issue, I recommend to create a new issue with all the log files attached, and the model names and firmware versions of each component.
I have experienced this issue since I started testing Kernel 6.3rc5.
[110705.424250] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=363925, emitted seq=363926[110705.424700] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2736 thread gnome-shel:cs0 pid 2767[110705.425111] amdgpu 0000:04:00.0: amdgpu: GPU reset begin![110705.579614] amdgpu 0000:04:00.0: amdgpu: MODE2 reset[110705.579700] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume[110705.579854] [drm] PCIE GART of 1024M enabled.[110705.579857] [drm] PTB located at 0x000000F41FC00000[110705.579923] [drm] PSP is resuming...[110706.302321] [drm] reserve 0x400000 from 0xf41f800000 for PSP TMR[110706.583845] amdgpu 0000:04:00.0: amdgpu: RAS: optional ras ta ucode is not available[110706.595135] amdgpu 0000:04:00.0: amdgpu: RAP: optional rap ta ucode is not available[110706.595137] amdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available[110706.595140] amdgpu 0000:04:00.0: amdgpu: SMU is resuming...[110706.595778] amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully![110706.596428] [drm] DMUB hardware initialized: version=0x0101001F[110706.930413] [drm] kiq ring mec 2 pipe 1 q 0[110706.933424] [drm] VCN decode and encode initialized successfully(under DPG Mode).[110706.933473] [drm] JPEG decode initialized successfully.[110706.933477] amdgpu 0000:04:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0[110706.933481] amdgpu 0000:04:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0[110706.933482] amdgpu 0000:04:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0[110706.933483] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0[110706.933484] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0[110706.933485] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0[110706.933486] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0[110706.933487] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0[110706.933488] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0[110706.933489] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0[110706.933490] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0[110706.933491] amdgpu 0000:04:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 13 on hub 0[110706.933492] amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1[110706.933493] amdgpu 0000:04:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1[110706.933494] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1[110706.933495] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1[110706.933495] amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1[110706.936281] amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start[110706.936286] amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done[110706.936356] amdgpu 0000:04:00.0: amdgpu: GPU reset(2) succeeded![110706.937520] [drm] Skip scheduling IBs![110707.004236] amdgpu_cs_ioctl: 1 callbacks suppressed[110707.004239] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125![110707.017729] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125![110707.187542] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
In most cases it recovers, but sometimes it seems to be stuck in a weird state were it seems to "repeat" the last animation or similar, it is hard do describe.
The system itself is unaffected, just GPU/display output are broken.
I'm more or less reliably able to reproduce this by quickly/heavily using Google Maps at zoomed-out zoom levels in Firefox on Ubuntu; the most recent crash happened with Google Earth Web in Chrome.
Symptoms are a crash to a screen that shows boot messages. I don't seem to be able to get a console or anything else useful from there and need to force-poweroff the machine.
kernel: [198871.116760] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=3351772, emitted seq=3351774kernel: [198871.117505] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 3623 thread gnome-shel:cs0 pid 3668kernel: [198871.118214] amdgpu 0000:07:00.0: amdgpu: GPU reset begin!kernel: [198871.268814] [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)kernel: [198871.295338] amdgpu 0000:07:00.0: amdgpu: MODE2 resetkernel: [198871.295395] amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resumekernel: [198871.295597] [drm] PCIE GART of 1024M enabled.kernel: [198871.295599] [drm] PTB located at 0x000000F47FC00000kernel: [198871.295660] [drm] PSP is resuming...kernel: [198871.996967] [drm] reserve 0x400000 from 0xf47f800000 for PSP TMRkernel: [198872.261894] amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not availablekernel: [198872.272774] amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not availablekernel: [198872.278755] [drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)kernel: [198872.278899] [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)kernel: [198872.278906] amdgpu 0000:07:00.0: amdgpu: Secure display: Generic Failure.kernel: [198872.278914] amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0kernel: [198872.278921] amdgpu 0000:07:00.0: amdgpu: SMU is resuming...kernel: [198872.279350] amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!kernel: [198872.279790] [drm] DMUB hardware initialized: version=0x01010026kernel: [198872.627457] [drm] kiq ring mec 2 pipe 1 q 0kernel: [198872.810879] amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)kernel: [198872.811161] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failedkernel: [198872.811379] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110kernel: [198872.811597] amdgpu 0000:07:00.0: amdgpu: GPU reset(2) failedkernel: [198872.811649] amdgpu 0000:07:00.0: amdgpu: GPU reset end with ret = -110kernel: [198872.811652] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110rtkit-daemon[2054]: message repeated 3 times: [ Supervising 14 threads of 11 processes of 1 users.]firefox_firefox.desktop[6647]: [GFX1-]: GFX: RenderThread detected a device reset in PostUpdategoogle-chrome.desktop[5953]: [5992:5992:0525/212139.578910:ERROR:shared_context_state.cc(870)] SharedContextState context lost via ARB/EXT_robustness. Reset status = GL_INNOCENT_CONTEXT_RESET_KHRgoogle-chrome.desktop[5953]: [5992:5992:0525/212139.579172:ERROR:gpu_service_impl.cc(986)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.gnome-shell[3623]: amdgpu: The CS has been rejected (-125), but the context isn't robust.gnome-shell[3623]: amdgpu: The process will be terminated.kernel: [198872.823903] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!google-chrome.desktop[5953]: [113439:1:0525/212139.603449:ERROR:command_buffer_proxy_impl.cc(325)] GPU state invalid after WaitForGetOffsetInRange.pavucontrol[21544]: Error reading events from display: Broken pipethunderbird[18560]: Error reading events from display: Broken pipeWeb Content[18719]: Error reading events from display: Broken pipeWebExtensions[18799]: Error reading events from display: Broken pipegnome-calendar[5820]: Error reading events from display: Broken pipe
System data:
------------------------------------------------------------------------Kernel (`uname -a`):Linux hostname-redacted 6.2.0-20-generic #20-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 6 07:48:48 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux------------------------------------------------------------------------Distro (`cat /etc/os-release`):PRETTY_NAME="Ubuntu 23.04"NAME="Ubuntu"VERSION_ID="23.04"VERSION="23.04 (Lunar Lobster)"VERSION_CODENAME=lunarID=ubuntuID_LIKE=debianHOME_URL="https://www.ubuntu.com/"SUPPORT_URL="https://help.ubuntu.com/"BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"UBUNTU_CODENAME=lunarLOGO=ubuntu-logo------------------------------------------------------------------------Display server (X11 or Wayland?)$XDG_SESSION_TYPE: waylandloginctl-based detection: wayland------------------------------------------------------------------------Audio (Pipewire or not?, `pactl info | grep ^Server Name`)Server Name: PulseAudio (on PipeWire 0.3.65)------------------------------------------------------------------------Desktop environment:$XDG_CURRENT_DESKTOP: ubuntu:GNOME$XDG_SESSION_DESKTOP: ubuntu$DESKTOP_SESSION: ubuntu------------------------------------------------------------------------
Thank you for your report. If you do not use the same hardware as I do, I recommend to create a separate report with the output of dmesg attached, and reference the new issue here.
Got the crash again. Firefox always seem to be the application to trigger the crash. I'm also unable to get the computer to suspend, as it nearly instantly wakes up when suspend is selected.
Both issues started when swapped my old Nvidia 1070 Ti with an AMD 6900 XT.
Here's a log where this happens with firefox, it seems to keep repeating, so I'll cut it.
This seems newish for me on fedora, since about 2 weeks.
I have seen this in other contexts, when trying to screenshare from the edge flatpak via the teams pwa. Not sure what the other cases triggering it were.
I'm experiencing similar error when playing CS2 trough Steam. Game freezes, holds for 5-10 secs, then I hear some game sound before it again totally freezes. I then have to kill cs2 process.
Before I killed the process, I checked dmesg and found this error:
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
..and nothing more.
uname -ra
Linux silje 6.5.6-300.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:57:21 UTC 2023 x86_64 GNU/Linux
It's happening very rarely when playing certain games. I'm still not sure if it's vulkan-specific or any game issue. I'm also not certain as to what the actual fix is. Trying every suggested solution or patch will take a long time due to how rare it is.
Here's the related crash message from journalctljournalctl.txt
After the GPU is reset, the affected processes are no longer allowed to submit work to the GPU unless they recreate their contexts. This can be done using OpenGL and Vulkan robustness features, but very few applications do so. The app tried to keep submitting work, but the kernel rejects it. Ideally the compositor would support robustness extensions and recreate it's context when it's lost so you retain your desktop. This is how it works on other OSes, but at the moment most compositors don't support this functionality. Apps that use the GPU won't update until they are killed or they recreate their context. You can try restarting your desktop manager.
xfwm4 doesn't use OpenGL for compositing (only optionally for vblank), it relied on XRender for compositing, and as such cannot use the robustness features itself.
If you're using XPresent (the default in xfwm4), there is no GL context involved, so not applicable.