Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
i get random hard freezes of Wayland KDE Plasma. Sometimes can be fine for weeks, and other times i get the issue multiple times a day. This typically occurs around suspend/thaw, however the attached files cover a crash when attempting to moving a window. i've had this for a number of months with many attempted fixes but none had resolved the issue. This time i was able to get more info to help fix the problem. This issue was also experienced on Gnome 46 however i only used that for reproducing. The below info is of the issue on KDE.
in this specific scenario, the laptop was connected to the Lenovo TB3 dock (type 40AC) via a certified TB3 cable (99% of the time, laptop is connected via dock), and entered sleep. While the laptop was asleep, the USBC/TB3 connection was removed. Laptop was woken up without any peripherals connected, a window was resized and the crash occurred. I havent had any luck with sugestions of SysRq+REISUB method i found online when trying to get more info on the issue, but SSH did work to obtain the following attachments.
Backtrace file should have 2 attempts; one before and one after installing the kwin-wayland-debuginfo package.
Recovery on the laptop is a press&hold power button till laptop turns off, so essentially a 'hard crash'. I didnt think to try a graceful restart via SSH at the time, but i assume that would work since SSH responds.
laptop is a Framework 13 i5-1240p laptop, latest Fedora40 KDE spin.
$ uname -srvmo
Linux 6.10.11-200.fc40.x86_64#1 (moved) SMP PREEMPT_DYNAMIC Wed Sep 18 21:09:58 UTC 2024 x86_64 GNU/Linux
$ sudo lspci -vnn -d :*:0300
00:02.0 VGA compatible controller [0300]: Intel Corporation Alder Lake-P GT2 [Iris Xe Graphics] [8086:46a6] (rev 0c) (prog-if 00 [VGA controller]) Subsystem: Framework Computer Inc. Device [f111:0002] Flags: bus master, fast devsel, latency 0, IRQ 163, IOMMU group 0 Memory at 605c000000 (64-bit, non-prefetchable) [size=16M] Memory at 4000000000 (64-bit, prefetchable) [size=256M] I/O ports at 3000 [size=64] Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] Capabilities: [40] Vendor Specific Information: Len=0c <?> Capabilities: [70] Express Root Complex Integrated Endpoint, IntMsgNum 0 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit- Capabilities: [d0] Power Management version 2 Capabilities: [100] Process Address Space ID (PASID) Capabilities: [200] Address Translation Service (ATS) Capabilities: [300] Page Request Interface (PRI) Capabilities: [320] Single Root I/O Virtualization (SR-IOV) Kernel driver in use: i915 Kernel modules: i915, xe
edit: my appologies if the above is poorly formated. at original posting time, i couldnt edit and now i can edit, it doesnt look quite like i remember after using the code tags.
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
i beleive i can reproduce this fairly frequently by sleeping the device with dock (& external monitor) connected.
disconnect dock while asleep and then wake laptop up.
this seems to trigger the "window move" behaviour as it attempts to fit the new screen geometry.
@Tau512 , in dmesg.log I see suspend happening with only the monitor via the hub enabled (and eDP disabled) and resuming with the hub/monitor disconnected (matching your description of the scenario).
After system suspending/resuming the driver restores the output on the hub/monitor, where link training fails as expected, since the monitor is disconnected, but the output from source on pipe A is left enabled for userspace/kernel FB client as it was before suspend:
Then userspace/kernel FB client tries to enable eDP on the already active pipe A, which will fail as expected, since pipe A is still enabled for the hub/monitor (that got disconnected since):
Afterwards I can't see anything noteworthy in the log for the remaining ~30 sec (besides, what probably is cursor movement w/o any errors). So I can't see any issues on the KMD side, it's doing what userspace requests.
Perhaps after the modeset errors at resume time, userspace gets confused somehow, leading to either the crash you mentioned, or an unexpected layout change? A dmesg from the exact same scenario, booting with drm.debug=0x15e would tell more about the exact userspace modesetting parameters, that could probably help root causing the issue in the userspace component.
Thank you for the investigation. I agree with your analysis with my "end user" experience, but a lot of "noise" in dmesg to the untrained eye!
I've reproduced the issue using 0x15e. Hopefully it highlights something.
dmesg0x15e.log
i believe the point of interest is after 230secs
this time the issue didn't occur on the first window move (KDE Konsole) but only after attempting to move another apps' window (Superslicer appimage).
I believe my last dmesg freeze was actually moving the Konsole window (but cant be 100% sure).
I've seen a freeze moving a Firefox window too, but thats historic before this report was raised.
Hm, can't see the debug of modeset parameters for all commits, was this log taken after booting with the drm.debug kernel parameter? I can't see either the previous system suspend/resume events here, otherwise there are similar failing commits on eDP due to incorrect commit parameters.
To check where things in the
#0 0x00007fc6a9f25f2d in ioctl () from /lib64/libc.so.6#1 0x00007fc68ea107a1 in i915_gem_create () from /usr/lib64/dri/iris_dri.so#2 0x00007fc68e9f5a5e in alloc_fresh_bo ()
backtrace hang: could you add to your kernel parameters: 'log_buf_len=20M drm.debug=0x15f', boot, reproduce the problem, then as soon as possible do as root on an ssh console:
# echo -e 'l\nw\nt' > /proc/sysrq-trigger
capture the gdb backtrace as before of kwin_wayland (should contain again the same i915_gem_create() -> ioctl() calls on top of stack) and attach this and (compressed if needed) dmesg, which should be now untruncated containing the boot-up messages as well? Thanks.
Thanks. Unfortunately the sysrq info I hoped for didn't show up in dmesg. Could you check if you have CONFIG_MAGIC_SYSRQ=y in your kernel's .config and if not rebuild your kernel with that? Having that you can confirm that you get the expected sysrq output before reproducing the problem by # echo -e 'l\nw\nt' > /proc/sysrq-trigger
with 1 backslash in each \n and no backslash before > (what you pasted has extra backslashes) after which you should see the stack traces of all processes in dmesg, i.e. something like:
# dmesg | grep sysrq:[338733.417862] sysrq: Show backtrace of all active CPUs[338733.419227] sysrq: Show Blocked State[338733.420084] sysrq: Show State
After ensuring that sysrq works as above you'd need to reproduce the problem again and redo # echo -e 'l\nw\nt' > /proc/sysrq-trigger.
After that please also do # find /proc -name stack -exec sh -c 'echo {}; cat {}' \; > stack.txt
in case dmesg logging itself is stuck somehow in the failure state. Please also verify in advance that the above command provides the expected stack traces before reproducing the problem, should output something like
with the stacktraces for all processes/tasks on your system.
Then, I'd like to ask for a full gdb output that is everything output by the $ sudo gdb -p $(pidof kwin_wayland)
command in particular containing the
Attaching to process <pid of kwin_wayland>
and the [New LWP <task ID>]
lines, but to be sure just everything output by gdb.
Then please attach dmesg, stack.txt and the gdb output.
This is on the drm-tip as Suresh requested 4 days ago (all my outputs are from this compile)
the echo -e 'l\nw\nt' > /proc/sysrq-trigger was ran as you typed it - not sure if Copy&Paste or gitlab messed with the formatting but it was ran with single \n.
Bonus info maybe... This is the first time I've seen this which was on reboot of the frozen screen. service stopped shortly after (around the 60sec mark) and reboot continued successfully. I never typically experience slow shutdown/reboots so this appears to be new. I've no idea if it's related to the reported issue, but just mentioning it in case its helpful
GDB seems to be missing now the debug symbols (Missing debuginfo for kwin-wayland), but let's assume kwin-wayland with PID 2079 got stuck in i915_gem_set_domain() -> intel_ioctl() -> __GI___ioctl()
this time as before.
And after ~2 sec without any activity in dmesg the process seems to be still spinning around the same spot:
[ 213.354082] CPU: 4 UID: 1000 PID: 2079 Comm: kwin_wayland Tainted: G W 6.12.0-rc3+
The reboot delay could be related to the above stuck state.
Not sure about the reason for i915_gem_set_domain_ioctl() getting stuck, it's making progress (state: R) so could just take a long time to complete. Someone from GEM team should continue checking this.
i did notice the Missing debuginfo packages. not sure the reason but when i attempted to install, there was no packages found. just tried a package search and it's on version 6.2.1.1-1.fc40, so already superceded.
I can easily re-run and fix those missing debuginfo packages if needed. I didnt think about it yesterday but dropping the targeted version should avoid the missing package.
I'm looking at it. I've switched with my KDE environmet from X11 to Wayland and have been trying to reproduce the issue on my laptop with ADL-P graphics [8086:46a6] + Lenovo ThinkPad TBT 3 Dock [17ef:3082], running Arch Linux, kernel version 6.11.5-arch1-1, kwin_wayland version 6.2.1 -- no success so far.
@Tau512 on the call traces you provided I can see the i915 graphics driver competing for physical memory with other consumers, e.g., Firefox. Can you please retry with a limited set of running applications (e.g., no Firefox)?
this is from the drm-tip build used throughout this issue:
clean boot & login
close any windows at login (Firefox & OpenRGB app+tray icon were the only apps running)
start dolphin & konsole (2 of each)
sleep, wait 15-30 secs
d/c tb3 cable, wait another 15-30secs (total sleep approximately 45 secs)
wake & login
move windows - no reproduce
reconnect tb3 cable
move windows - FREEZE
I can't see why they'd be a physical memory issue (I'm not saying you're wrong Janusz), even when i use the laptop with many apps, i tend to max out around 18-20GB of non-cached mem usage. Thats very rare and i typically run around 8-10GB used.
I think it's been mentioned already, but the laptop has 32GB ram. Based on how long it's taken me to find something to bug report, and appears to not affect a large amount of users, I have considered it some weird hardware issue on the laptop itself.
@AthAshino thanks for reporting. Unfortunately your report is missing details we need. Could you please go through the former comments and try to follow Imre's instructions on how to collect the required information. then post that infor here when ready?
I've asked on https://bugs.kde.org/show_bug.cgi?id=493277, where the issue was first discussed together with a discussion on similar erratic behavior of kwin_wayland on AMD graphics (closed on Oct 23 as upstream resolved), for some clarification.
... the commit thread prints that message when it doesn't receive a pageflip event for an atomic commit it did in 5s. It doesn't process drm events itself though, the main thread is responsible for doing that. So when the main thread hangs, the commit thread also prints this warning in Plasma 6.2 (which will be fixed in 6.3, where it'll properly detect hang vs pageflip timeout).
@Tau512 could you please repeat once more the reproduction steps with drm.debug=0x15e and also provide output of journalctl this time? I'd like to see how that "Pageflip timed out!" message is correlated in time with DRM debug messages, which should all hopefully land in the system journal.
After upgrading to 6.12, waking from hibernate shows a black screen for a certain time, then after pressing random keys and mouse clicks the login screen appears, if I type the password it takes much longer to load desktop and all hibernated apps, but even if I recover mouse pointer and can switch between apps they all perform slowly.
One solution to recover the speed of my system is to disable swap (sudo swapoff -a) then re-enable it again (sudo swapon -a).
NB: Hibernation was working fine with kernels 6.10/6.11.
Just dropping here to say that I'm facing the same issue. I have a triple screen setup (QHD@165hz 100% scale/UHD@240hz 150% scale/FHD@144hz 125% scale).
One of my monitors will freeze completely at least once or twice a day. The only solution is "sudo systemctl restart sddm", sometimes it also causes my other monitors to freeze, and I have to go into TTY.
For me, it's nothing to do with hibernation/waking up. All my monitors are connected using DisplayPort 1.4, so I don't use a Thunderbolt docker, etc.
I'm trying to generate a backtrace of my scenario to attach here, but as soon as I run sudo gdb -p $(pidof kwin_wayland) > gdb.txt my system freezes completely and I have to reboot. What should I run if it freezes again? Consider I'll most likely run this from another TTY (Ctrl+Alt+F3) while one of the monitor is frozen.
Edit: I'm monitoring wayland with this, if it crashes I should hopefully have a backtrace:
I don't know if this is relevant or not, but I just found out I can recover from a complete freeze, by going to TTY3, logging in and running loginctl list-sessions and then loginctl unlock-session <SESSION_ID>.
It there is someone reading this thread who suffers from this (kwin_wayland) issue and is able to reproduce it, could you please try to revert that commit (git revert 96a5c186) and check if that helps?
Hi! I'm also suffering from this issue. Reverted the commit you've mentioned, but the issue persists, got a kwin_wayland "Pageflip timed out!" freeze after about 28 hours of uptime. Managed to recover without a reboot by running killall kwin_wayland -KILL though, if that's of any significance.
EDIT: Downgraded the kernel all the way down to linux-lts 5.15.94, the issue is still reproducible like this:
Clean boot with tb cable plugged in
Disconnect the cable, wait 15 seconds
Move window - no repro
Connect tb cable, wait for the displays to initialize and settle
Move window - FREEZE
Also tried downgrading kwin; the issue is reproducible all the way down to v6.2.0, and I'm unable to log in after downgrading to v6.1.* or older.