Regression: Since Mesa 23.1.x, onboard AMD Radeon graphics APUs hang / freeze with "ring stalled" errors after a few seconds using GNOME Shell
Brief summary of the problem:
Since applying package upgrades to version 23.1.3 of Mesa, this computer with onboard AMD graphics reliably/reproduceably experiences hangs/freezes of all kinds (i.e: input non-responsive, animations not working, graphical corruption if you resart gdm, etc.) after a few seconds of using GNOME Shell (ex: launching applications, trying to open the "Activities" overview, etc.), under Wayland (did not test Xorg). Surprisingly, those hangs eventually recover after waiting maybe a minute or so, rather than crashing down the whole machine/kernel; maybe because I'm running on Wayland there?
Downgrading package versions to 23.0.1 "fixes" the problem.
The dmesg logs point to the infamous "ring stalled" errors that we've seen elsewhere here and elsewhere there.
...except that this time it has nothing to do with suspend/resume (it happens on a fresh boot, unlike issue #2252) and nothing to do with video or hardware-accelerated decoding like issue #2440 (unless the fact that GNOME Remote Desktop is present/activated is related to this, but it's not actually used with a remote connection while this bug is occuring). It just really happens rapidly after basic usage on a fresh boot, and it didn't before this update, so maybe this is an independent regression.
Hardware description:
-
CPU: AMD Phenom 9600B Quad-Core, on a HP Compaq dc5850 Microtower (AP470US#ABC)
-
GPU:
01:05.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] RS780C [Radeon 3100] [1002:9611]
physical id: 5 bus info: pci@0000:01:05.0 logical name: /dev/fb0 version: 00 width: 32 bits clock: 33MHz capabilities: pm msi vga_controller bus_master cap_list rom fb configuration: depth=32 driver=radeon latency=0 resolution=1600,900 resources: irq:18 memory:e0000000-efffffff ioport:1100(size=256) memory:f0100000-f010ffff memory:f0000000-f00fffff memory:c0000-dffff
-
System Memory: 4GiB DDR2
-
Display(s): HP 2011x LCD TN display, 1600x900 resolution
-
Type of Display Connection: VGA (D-Sub) (it's the only output on the motherboard, AFAICS)
System information:
- Distro name and Version: Fedora 38, 64-bits, Wayland GNOME
- Kernel version: 6.3.8-200.fc38.x86_64
- Custom kernel: N/A
- AMD official driver version: N/A (default open source AMD/radeon Mesa Gallium driver provided by Fedora)
How to reproduce the issue:
On that kind of computer with onboard AMD graphics, a fully up-to-date Fedora 38 with GNOME is sufficient to trigger the issue on my end, simply by logging into a GNOME Wayland session, launching a few applications (such as Nautilus, Evolution, Firefox, etc.) and pressing the Super
key (or using the Activities
button) to try to open the overview. At that time, I will experience a hang that may easily last for a minute.
Attached files:
Log files
- Full dmesg log: attached as 2023-07-04_full_dmesg_on_radeon_APU_lockup_on_home_computer.txt
- Xorg log: not available, since the computer is running Wayland? At least I could not find any xorg log on the system partition nor in ~/.local/share/xorg/
- Any other log: see below
Watching the computer "live" over SSH with journalctl -f
, this is what happens as soon as the GPU locks up a few seconds after logging into a GNOME Shell Wayland session, when you mouse over some app launcher icons:
jui 04 13:01:18 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 10080msec
jui 04 13:01:18 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:19 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 10584msec
jui 04 13:01:19 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:19 desktop-house realmd[1331]: quitting realmd service after timeout
jui 04 13:01:19 desktop-house realmd[1331]: stopping service
jui 04 13:01:19 desktop-house systemd[1]: realmd.service: Deactivated successfully.
jui 04 13:01:19 desktop-house audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=realmd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
jui 04 13:01:19 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 11088msec
jui 04 13:01:19 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:20 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 11592msec
jui 04 13:01:20 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:20 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 12096msec
jui 04 13:01:20 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:21 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 12600msec
jui 04 13:01:21 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:21 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 13104msec
jui 04 13:01:21 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:22 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 13608msec
jui 04 13:01:22 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:22 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 14112msec
jui 04 13:01:22 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:23 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 14616msec
jui 04 13:01:23 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:23 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 15120msec
jui 04 13:01:23 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:24 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 15624msec
jui 04 13:01:24 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:24 desktop-house systemd[1]: systemd-hostnamed.service: Deactivated successfully.
jui 04 13:01:24 desktop-house audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
jui 04 13:01:24 desktop-house audit: BPF prog-id=75 op=UNLOAD
jui 04 13:01:24 desktop-house audit: BPF prog-id=74 op=UNLOAD
jui 04 13:01:24 desktop-house audit: BPF prog-id=73 op=UNLOAD
jui 04 13:01:24 desktop-house systemd[1]: systemd-localed.service: Deactivated successfully.
jui 04 13:01:24 desktop-house audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-localed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
jui 04 13:01:24 desktop-house audit: BPF prog-id=82 op=UNLOAD
jui 04 13:01:24 desktop-house audit: BPF prog-id=81 op=UNLOAD
jui 04 13:01:24 desktop-house audit: BPF prog-id=80 op=UNLOAD
jui 04 13:01:24 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 16128msec
jui 04 13:01:24 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:25 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 16632msec
jui 04 13:01:25 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:25 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 17136msec
jui 04 13:01:25 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:26 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 17640msec
jui 04 13:01:26 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:26 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 18144msec
jui 04 13:01:26 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:27 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 18648msec
jui 04 13:01:27 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:27 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 19152msec
jui 04 13:01:27 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:28 desktop-house chronyd[722]: Selected source 216.197.156.83 (2.fedora.pool.ntp.org)
jui 04 13:01:28 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 19656msec
jui 04 13:01:28 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:28 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 20160msec
jui 04 13:01:28 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:29 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 20664msec
jui 04 13:01:29 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:29 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 21168msec
jui 04 13:01:29 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:30 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 21672msec
jui 04 13:01:30 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:30 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 22176msec
jui 04 13:01:30 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:31 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 22680msec
jui 04 13:01:31 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:31 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 23184msec
jui 04 13:01:31 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:32 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 23688msec
jui 04 13:01:32 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:32 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 24193msec
jui 04 13:01:32 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:33 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 24696msec
jui 04 13:01:33 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:33 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 25200msec
jui 04 13:01:33 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:34 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 25704msec
jui 04 13:01:34 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:34 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 26208msec
jui 04 13:01:34 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:35 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 26712msec
jui 04 13:01:35 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:35 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 27216msec
jui 04 13:01:35 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:36 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 27720msec
jui 04 13:01:36 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:36 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 28224msec
jui 04 13:01:36 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:37 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 28728msec
jui 04 13:01:37 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:37 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 29232msec
jui 04 13:01:37 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: ring 0 stalled for more than 29736msec
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: GPU lockup (current fence id 0x00000000000002ae last fence id 0x00000000000002af on ring 0)
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: Saved 25 dwords of commands on ring 0.
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: GPU softreset: 0x00000008
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_008010_GRBM_STATUS = 0xA0002030
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_008014_GRBM_STATUS2 = 0x00000003
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_000E50_SRBM_STATUS = 0x20000040
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_008674_CP_STALLED_STAT1 = 0x00000000
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_008678_CP_STALLED_STAT2 = 0x00040000
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_00867C_CP_BUSY_STAT = 0x00000000
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_008680_CP_STAT = 0x80040000
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_00D034_DMA_STATUS_REG = 0x44C83D57
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00004001
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: SRBM_SOFT_RESET=0x00000100
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_008010_GRBM_STATUS = 0xA0003030
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_008014_GRBM_STATUS2 = 0x00000003
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_000E50_SRBM_STATUS = 0x20008040
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_008674_CP_STALLED_STAT1 = 0x00000000
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_008678_CP_STALLED_STAT2 = 0x00000000
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_00867C_CP_BUSY_STAT = 0x00000000
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_008680_CP_STAT = 0x80100000
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: R_00D034_DMA_STATUS_REG = 0x44C83D57
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: GPU reset succeeded, trying to resume
jui 04 13:01:38 desktop-house kernel: [drm] PCIE GART of 512M enabled (table at 0x00000000C0040000).
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: WB enabled
jui 04 13:01:38 desktop-house kernel: radeon 0000:01:05.0: fence driver on ring 0 use gpu addr 0x00000000a0000c00
jui 04 13:01:38 desktop-house kernel: debugfs: File 'radeon_ring_gfx' in directory '0' already present!
jui 04 13:01:38 desktop-house kernel: [drm] ring test on 0 succeeded in 1 usecs
jui 04 13:01:39 desktop-house gnome-shell[1802]: radeon: The kernel rejected CS, see dmesg for more information (-16).
jui 04 13:01:39 desktop-house kernel: [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait timed out.
jui 04 13:01:39 desktop-house kernel: [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on GFX ring (-110).
jui 04 13:01:39 desktop-house gnome-shell[1802]: radeon: The kernel rejected CS, see dmesg for more information (-16).
Additional troubleshooting and package versions info
Kernel downgrades between 6.3.8 and 6.3.5 did nothing, so it's not a kernel regression; neither did gnome-shell or mutter downgrades, nor downgrades of the amd-gpu-firmware
Fedora package.
Doing a dnf downgrade *mesa*
to 23.0.1 instead of 23.1.3 is what fixed it (no need to downgrade packages matching *radeon*
and *amd*
, according to my testing). These are the packages that get downgraded by the command, leading to a stable system again:
mesa-dri-drivers-23.0.1-2.fc38.x86_64
mesa-filesystem-23.0.1-2.fc38.x86_64
mesa-libEGL-23.0.1-2.fc38.x86_64
mesa-libGL-23.0.1-2.fc38.x86_64
mesa-libgbm-23.0.1-2.fc38.x86_64
mesa-libglapi-23.0.1-2.fc38.x86_64
mesa-libxatracker-23.0.1-2.fc38.x86_64
mesa-va-drivers-23.0.1-2.fc38.x86_64
mesa-vulkan-drivers-23.0.1-2.fc38.x86_64
With these, the system no longer experiences hangs/freezes, and the journalctl logs do not log anything in particular regarding the radeon graphics.
The faulty packages, according to a dnf upgrade
without versionlock, would be:
mesa-dri-drivers 23.1.3-1.fc38
mesa-filesystem 23.1.3-1.fc38
mesa-libEGL 23.1.3-1.fc38
mesa-libGL 23.1.3-1.fc38
mesa-libgbm 23.1.3-1.fc38
mesa-libglapi 23.1.3-1.fc38
mesa-libxatracker 23.1.3-1.fc38
mesa-va-drivers 23.1.3-1.fc38
mesa-vulkan-drivers 23.1.3-1.fc38
I have now done a dnf versionlock *mesa*
, which freezes the installed package versions to 23.0.1, to avoid accidental upgrades.