kwin_wayland_drm: Pageflip timed out! This is a kernel bug

added Community feature: GEM platform: ADL_P labels

changed the description

i beleive i can reproduce this fairly frequently by sleeping the device with dock (& external monitor) connected. disconnect dock while asleep and then wake laptop up. this seems to trigger the "window move" behaviour as it attempts to fit the new screen geometry.

external screen: 3440 x 1440 21:9 @ 100% scale laptop screen: 2256x1504 3:2 @ 150% scale

@Tau512 Please try with latest drmtip and attach full dmesg logs drm.debug=0x1e log_buf_len=4M Wiki: https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html

hopefully this is everything you need. reproduced as per steps in my last response with it freezing on a manual window move.

fyi, i've compiled the required drm-tip, with commit 01c7b2c084e5c84313f382734c10945b9aa49823

dmesg.log

# lspci -vnn -d :*:0300
00:02.0 VGA compatible controller [0300]: Intel Corporation Alder Lake-P GT2 [Iris Xe Graphics] [8086:46a6] (rev 0c) (prog-if 00 [VGA controller])
        Subsystem: Framework Computer Inc. Device [f111:0002]
        Flags: bus master, fast devsel, latency 0, IRQ 163, IOMMU group 0
        Memory at 605c000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=256M]
        I/O ports at 3000 [size=64]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Root Complex Integrated Endpoint, IntMsgNum 0
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
        Capabilities: [d0] Power Management version 2
        Capabilities: [100] Process Address Space ID (PASID)
        Capabilities: [200] Address Translation Service (ATS)
        Capabilities: [300] Page Request Interface (PRI)
        Capabilities: [320] Single Root I/O Virtualization (SR-IOV)
        Kernel driver in use: i915
        Kernel modules: i915, xe

# uname -srvmo
Linux 6.12.0-rc3+ #1 SMP PREEMPT_DYNAMIC Tue Oct 15 21:25:06 BST 2024 x86_64 GNU/Linux

@Tau512 , in dmesg.log I see suspend happening with only the monitor via the hub enabled (and eDP disabled) and resuming with the hub/monitor disconnected (matching your description of the scenario).

After system suspending/resuming the driver restores the output on the hub/monitor, where link training fails as expected, since the monitor is disconnected, but the output from source on pipe A is left enabled for userspace/kernel FB client as it was before suspend:

[  141.798159] PM: suspend entry (s2idle)
[  181.469123] i915 0000:00:02.0: [drm:intel_dp_link_train_phy [i915]] [CONNECTOR:259:DP-1][ENCODER:258:DDI TC1/PHY TC1][DPRX] Sink disconnected: Failed to enable link training
[  181.513810] i915 0000:00:02.0: [drm:intel_enable_transcoder [i915]] enabling pipe A
[  181.560547] PM: resume devices took 0.392 seconds

Then userspace/kernel FB client tries to enable eDP on the already active pipe A, which will fail as expected, since pipe A is still enabled for the hub/monitor (that got disconnected since):

[  182.087571] i915 0000:00:02.0: [drm:drm_atomic_helper_check_modeset] [CONNECTOR:241:eDP-1] using [ENCODER:240:DDI A/PHY A] on [CRTC:82:pipe A]
[  182.087993] i915 0000:00:02.0: [drm:intel_atomic_check [i915]] [ENCODER:240:DDI A/PHY A] rejecting invalid cloning configuration
[  182.088311] i915 0000:00:02.0: [drm:intel_crtc_state_dump [i915]] [CRTC:82:pipe A] enable: yes [failed]

This incorrect modeset is retried a few times, after which pipe A is disabled and eDP enabled on it with a 2256x1504 mode/FB (i.e. un-scaled config):

[  182.675856] i915 0000:00:02.0: [drm:intel_disable_transcoder [i915]] disabling pipe A
[  182.728466] i915 0000:00:02.0: [drm:intel_disable_shared_dpll [i915]] disabling TBT PLL
[  182.951100] i915 0000:00:02.0: [drm:intel_power_well_enable [i915]] enabling DDI_IO_A
[  182.960820] i915 0000:00:02.0: [drm:intel_enable_transcoder [i915]] enabling pipe A
[  182.649760] i915 0000:00:02.0: [drm:intel_crtc_state_dump [i915]] pipe mode: "2256x1504": 60 235690 2256 2304 2336 2536 1504 1507 1513 1549 0x40 0x9
[  182.650311] i915 0000:00:02.0: [drm:intel_crtc_state_dump [i915]] port clock: 216000, pipe src: 2256x1504+0+0, pixel rate 235690

Afterwards I can't see anything noteworthy in the log for the remaining ~30 sec (besides, what probably is cursor movement w/o any errors). So I can't see any issues on the KMD side, it's doing what userspace requests.

Perhaps after the modeset errors at resume time, userspace gets confused somehow, leading to either the crash you mentioned, or an unexpected layout change? A dmesg from the exact same scenario, booting with drm.debug=0x15e would tell more about the exact userspace modesetting parameters, that could probably help root causing the issue in the userspace component.

Thank you for the investigation. I agree with your analysis with my "end user" experience, but a lot of "noise" in dmesg to the untrained eye!

I've reproduced the issue using 0x15e. Hopefully it highlights something.

dmesg0x15e.log i believe the point of interest is after 230secs

this time the issue didn't occur on the first window move (KDE Konsole) but only after attempting to move another apps' window (Superslicer appimage). I believe my last dmesg freeze was actually moving the Konsole window (but cant be 100% sure). I've seen a freeze moving a Firefox window too, but thats historic before this report was raised.

Hm, can't see the debug of modeset parameters for all commits, was this log taken after booting with the drm.debug kernel parameter? I can't see either the previous system suspend/resume events here, otherwise there are similar failing commits on eDP due to incorrect commit parameters.

To check where things in the

#0  0x00007fc6a9f25f2d in ioctl ()
   from /lib64/libc.so.6
#1  0x00007fc68ea107a1 in i915_gem_create ()
   from /usr/lib64/dri/iris_dri.so
#2  0x00007fc68e9f5a5e in alloc_fresh_bo ()

backtrace hang: could you add to your kernel parameters: 'log_buf_len=20M drm.debug=0x15f', boot, reproduce the problem, then as soon as possible do as root on an ssh console:

# echo -e 'l\nw\nt' > /proc/sysrq-trigger

capture the gdb backtrace as before of kwin_wayland (should contain again the same i915_gem_create() -> ioctl() calls on top of stack) and attach this and (compressed if needed) dmesg, which should be now untruncated containing the boot-up messages as well? Thanks.

i believe it was booting with drm.debug=0x15e. I'll double check the grub boot cmdline and provide all the requested info tomorrow.

dmesg0x15e-2410190939.log.gz

not sure on the timeframes with this one. exact process to re-produce this time:

start laptop, login, open some windows
sleep
disconnect USBC/TB3 cable
wake & login
move some windows to reproduce; unable to reproduce.
reconnect USBC/TB3 cable. External monitor restores window locations
move some windows to reproduce; successful freeze.
as su, attempt echo -e 'l\\nw\\nt' \> /proc/sysrq-trigger. no visible change/output.
run gdb -p $(pidof kwin_wayland)

confirmation on the drm.debug=0x15f enablement:

kwin_wayland_backtrace_241019.txt

Thanks. Unfortunately the sysrq info I hoped for didn't show up in dmesg. Could you check if you have CONFIG_MAGIC_SYSRQ=y in your kernel's .config and if not rebuild your kernel with that? Having that you can confirm that you get the expected sysrq output before reproducing the problem by
# echo -e 'l\nw\nt' > /proc/sysrq-trigger
with 1 backslash in each \n and no backslash before > (what you pasted has extra backslashes) after which you should see the stack traces of all processes in dmesg, i.e. something like:

# dmesg | grep sysrq:
[338733.417862] sysrq: Show backtrace of all active CPUs
[338733.419227] sysrq: Show Blocked State
[338733.420084] sysrq: Show State

After ensuring that sysrq works as above you'd need to reproduce the problem again and redo
# echo -e 'l\nw\nt' > /proc/sysrq-trigger.

After that please also do
# find /proc -name stack -exec sh -c 'echo {}; cat {}' \; > stack.txt
in case dmesg logging itself is stuck somehow in the failure state. Please also verify in advance that the above command provides the expected stack traces before reproducing the problem, should output something like

/proc/1/task/1/stack
[<0>] do_epoll_wait+0x71c/0x990
...
/proc/1/stack
[<0>] do_epoll_wait+0x71c/0x990
...

with the stacktraces for all processes/tasks on your system.

Then, I'd like to ask for a full gdb output that is everything output by the
$ sudo gdb -p $(pidof kwin_wayland)
command in particular containing the Attaching to process <pid of kwin_wayland>
and the
[New LWP <task ID>]
lines, but to be sure just everything output by gdb.

Then please attach dmesg, stack.txt and the gdb output.

Thanks.

CONFIG_MAGIC_SYSRQ is enabled

$ cat git/drm-tip/.config | grep CONFIG_MAGIC_SYSRQ
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x0
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_MAGIC_SYSRQ_SERIAL_SEQUENCE=""

This is on the drm-tip as Suresh requested 4 days ago (all my outputs are from this compile)

the echo -e 'l\nw\nt' > /proc/sysrq-trigger was ran as you typed it - not sure if Copy&Paste or gitlab messed with the formatting but it was ran with single \n.

Fresh logs for the most recent reproducible freeze: gdb.txt dmesg2410191831.log.gz 1stack.txt 1task1stack.txt stack.txt

Bonus info maybe... This is the first time I've seen this which was on reboot of the frozen screen. service stopped shortly after (around the 60sec mark) and reboot continued successfully. I never typically experience slow shutdown/reboots so this appears to be new. I've no idea if it's related to the reported issue, but just mentioning it in case its helpful

Thanks, seeing now the relevant stack traces.

GDB seems to be missing now the debug symbols (Missing debuginfo for kwin-wayland), but let's assume kwin-wayland with PID 2079 got stuck in
i915_gem_set_domain() -> intel_ioctl() -> __GI___ioctl()
this time as before.

In kernel in turn the above process is stuck at:

[  211.471607] CPU: 4 UID: 1000 PID: 2079 Comm: kwin_wayland Tainted: G        W          6.12.0-rc3+ #1
[  211.471610] RIP: 0010:clear_page_erms+0xb/0x20
[  211.471642]  shmem_get_folio_gfp+0x402/0x5f0
[  211.471656]  shmem_read_folio_gfp+0x3e/0x80
[  211.471658]  shmem_sg_alloc_table+0x196/0x300 [i915]
[  211.471783]  shmem_get_pages+0xdb/0x2e0 [i915]
[  211.471874]  __i915_gem_object_get_pages+0x38/0x50 [i915]
[  211.471962]  i915_gem_set_domain_ioctl+0x279/0x300 [i915]
[  211.472049]  ? __pfx_i915_gem_set_domain_ioctl+0x10/0x10 [i915]
[  211.472136]  drm_ioctl_kernel+0xad/0x100
[  211.472140]  drm_ioctl+0x288/0x530
[  211.486217] task:kwin_wayland    state:R  running task     stack:0     pid:2079  tgid:2079  ppid:2069   flags:0x00000000

And after ~2 sec without any activity in dmesg the process seems to be still spinning around the same spot:

[  213.354082] CPU: 4 UID: 1000 PID: 2079 Comm: kwin_wayland Tainted: G        W          6.12.0-rc3+

The reboot delay could be related to the above stuck state.

Not sure about the reason for i915_gem_set_domain_ioctl() getting stuck, it's making progress (state: R) so could just take a long time to complete. Someone from GEM team should continue checking this.

i did notice the Missing debuginfo packages. not sure the reason but when i attempted to install, there was no packages found. just tried a package search and it's on version 6.2.1.1-1.fc40, so already superceded.

I can easily re-run and fix those missing debuginfo packages if needed. I didnt think about it yesterday but dropping the targeted version should avoid the missing package.

i'll wait for your guys to investigate...

Has anyone from the GEM team been able to look at this, or found anything that could be the cause?

Cc : @andi @tmistat

I'm looking at it. I've switched with my KDE environmet from X11 to Wayland and have been trying to reproduce the issue on my laptop with ADL-P graphics [8086:46a6] + Lenovo ThinkPad TBT 3 Dock [17ef:3082], running Arch Linux, kernel version 6.11.5-arch1-1, kwin_wayland version 6.2.1 -- no success so far.

@Tau512 on the call traces you provided I can see the i915 graphics driver competing for physical memory with other consumers, e.g., Firefox. Can you please retry with a limited set of running applications (e.g., no Firefox)?

this is from the drm-tip build used throughout this issue:

clean boot & login
close any windows at login (Firefox & OpenRGB app+tray icon were the only apps running)
start dolphin & konsole (2 of each)
sleep, wait 15-30 secs
d/c tb3 cable, wait another 15-30secs (total sleep approximately 45 secs)
wake & login
move windows - no reproduce
reconnect tb3 cable
move windows - FREEZE

I can't see why they'd be a physical memory issue (I'm not saying you're wrong Janusz), even when i use the laptop with many apps, i tend to max out around 18-20GB of non-cached mem usage. Thats very rare and i typically run around 8-10GB used. I think it's been mentioned already, but the laptop has 32GB ram. Based on how long it's taken me to find something to bug report, and appears to not affect a large amount of users, I have considered it some weird hardware issue on the laptop itself.

i've just retested a few things just to satisfy myself and problem occurs with the following setups:

DP disconnected from dock. dock still active for charge & peripherals. connect HDMI cable direct from laptop to monitor & retest. Issue reproducible.

Disconnect tb3 cable to dock (laptop no dock or peripherals). test with direct HDMI cable. Issue reproducible.

While the logs don't suggest a cabling/hardware issue, these tests reconfirm that the problem is not related to docks, TB, connectivity type etc.

Hey, I can reproduce the bug really fast:

Turning on my laptop (1920x1080) with a hub connected to an ultrawide screen (5120x1440)
Open a window (anything)
Let him poweroff the screen after timeout (5 minutes in my case)
Bring back screen using mouse/keyboard
Trying to drag a window
Freeze !

@AthAshino thanks for reporting. Unfortunately your report is missing details we need. Could you please go through the former comments and try to follow Imre's instructions on how to collect the required information. then post that infor here when ready?

I've asked on https://bugs.kde.org/show_bug.cgi?id=493277, where the issue was first discussed together with a discussion on similar erratic behavior of kwin_wayland on AMD graphics (closed on Oct 23 as upstream resolved), for some clarification.

Response from https://bugs.kde.org/show_bug.cgi?id=493277: about the "Pageflip timed out!" message:

... the commit thread prints that message when it doesn't receive a pageflip event for an atomic commit it did in 5s. It doesn't process drm events itself though, the main thread is responsible for doing that. So when the main thread hangs, the commit thread also prints this warning in Plasma 6.2 (which will be fixed in 6.3, where it'll properly detect hang vs pageflip timeout).

@Tau512 could you please repeat once more the reproduction steps with drm.debug=0x15e and also provide output of journalctl this time? I'd like to see how that "Pageflip timed out!" message is correlated in time with DRM debug messages, which should all hopefully land in the system journal.

output log is from journalctl -b 0, booted with 0x15e and reproduced with the steps mentioned in my last response above (#12341 (comment 2657149))

journalctl.tar.gz

After upgrading to 6.12, waking from hibernate shows a black screen for a certain time, then after pressing random keys and mouse clicks the login screen appears, if I type the password it takes much longer to load desktop and all hibernated apps, but even if I recover mouse pointer and can switch between apps they all perform slowly.

One solution to recover the speed of my system is to disable swap (sudo swapoff -a) then re-enable it again (sudo swapon -a).

NB: Hibernation was working fine with kernels 6.10/6.11.

journald.log

Operating System: Manjaro Linux rolling
KDE Plasma Version: 6.2.4
KDE Frameworks Version: 6.8.0
Qt Version: 6.8.0
Kernel Version: 6.12.1-4-MANJARO (64-bit)
Graphics Platform: Wayland
Processors: 2 × Pentium® Dual-Core CPU T4400 @ 2.20GHz
Memory: 5.6 GiB of RAM
Graphics Processor: Mesa Mobile Intel® GM45 Express Chipset

Just dropping here to say that I'm facing the same issue. I have a triple screen setup (QHD@165hz 100% scale/UHD@240hz 150% scale/FHD@144hz 125% scale).

One of my monitors will freeze completely at least once or twice a day. The only solution is "sudo systemctl restart sddm", sometimes it also causes my other monitors to freeze, and I have to go into TTY.

For me, it's nothing to do with hibernation/waking up. All my monitors are connected using DisplayPort 1.4, so I don't use a Thunderbolt docker, etc.

KDE Wayland Manjaro.

Operating System: Manjaro Linux
KDE Plasma Version: 6.2.4
KDE Frameworks Version: 6.9.0
Qt Version: 6.8.1
Kernel Version: 6.11.11-1-MANJARO (64-bit)
Graphics Platform: Wayland
Processors: 20 × 13th Gen Intel® Core™ i5-13600K
Memory: 62,6 GiB of RAM
Graphics Processor: NVIDIA GeForce RTX 4090/PCIe/SSE2
Manufacturer: Micro-Star International Co., Ltd.
Product Name: MS-7D42
System Version: 1.0

NVIDIA-SMI 565.77

I'm trying to generate a backtrace of my scenario to attach here, but as soon as I run sudo gdb -p $(pidof kwin_wayland) > gdb.txt my system freezes completely and I have to reboot. What should I run if it freezes again? Consider I'll most likely run this from another TTY (Ctrl+Alt+F3) while one of the monitor is frozen.

Edit: I'm monitoring wayland with this, if it crashes I should hopefully have a backtrace:

sudo gdb -pid $(pidof kwin_wayland) \
-batch \
-ex "set logging file kwin_wayland.gdb" \
-ex "set logging enabled on" \
-ex "continue" \
-ex "thread apply all backtrace" \
-ex "quit"

I don't know if this is relevant or not, but I just found out I can recover from a complete freeze, by going to TTY3, logging in and running loginctl list-sessions and then loginctl unlock-session <SESSION_ID>.

Waking from hibernation is still broken with kernel 6.12.7 and also 6.11.11. But it works fine with kernel 6.6.68.

I suspect this issue may have a common root cause with #12941. I've bisected the latter and found it caused by commit 96a5c186 ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone").

It there is someone reading this thread who suffers from this (kwin_wayland) issue and is able to reproduce it, could you please try to revert that commit (git revert 96a5c186) and check if that helps?

Another factor that has some impact on #12941 is kernel config. That issue started appearing after our CI changed their kernel config in CI_DRM_15714. I tried to identify a specific setting responsible for that but I failed -- I think there may be a couple of settings that matter. To get an idea on what has changed, you may want to look at https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_15713/kconfig.txt and https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_15714/kconfig.txt for differences.

Hi! I'm also suffering from this issue. Reverted the commit you've mentioned, but the issue persists, got a kwin_wayland "Pageflip timed out!" freeze after about 28 hours of uptime. Managed to recover without a reboot by running killall kwin_wayland -KILL though, if that's of any significance.

EDIT: Downgraded the kernel all the way down to linux-lts 5.15.94, the issue is still reproducible like this:

Clean boot with tb cable plugged in
Disconnect the cable, wait 15 seconds
Move window - no repro
Connect tb cable, wait for the displays to initialize and settle
Move window - FREEZE

Also tried downgrading kwin; the issue is reproducible all the way down to v6.2.0, and I'm unable to log in after downgrading to v6.1.* or older.

Hi @consoleaf, thank you for reporting. For completeness of the record, please provide information on you graphics adapter model.

kwin_wayland_drm: Pageflip timed out! This is a kernel bug

Child items ...

Activity

Admin message

Admin message

kwin_wayland_drm: Pageflip timed out! This is a kernel bug

Activity