Blocked clients that received too many events get destroyed even though the Wayland connection is not broken
This problem is reproducible in every DE I tried: GNOME, Plasma, weston. The most common situation where this happens is in applications blocked due to disk I/O. As can be seen in https://bugreports.qt.io/browse/QTBUG-66997 and https://bugs.kde.org/show_bug.cgi?id=392376, developers have assumed it's the application's fault for doing blocking tasks in the main GUI thread.
However, the issue can be reproduced by simply following these steps:
- Open a Wayland application from a terminal.
- Suspend it with Ctrl+Z.
- Move the mouse over the application's window for a few seconds.
- Resume the application with 'fg' and witness how it dies.
Note: this seems to happen less often in GNOME. If you cannot reproduce at the first attempt, try Kwin or weston instead.
Obviously, it wouldn't make any sense if Wayland didn't allow suspending or debugging applications. So I began to debug what was happening in libwayland and observed the following. This is with the
weston-evendemo client on GNOME:
- Client resumes execution after being blocked.
- Client receives events from the server and dispatches them (
- Client flushes the connection (
- Server receives the events and begins processing them (
- At some point, the server finds an error and destroys the client. Only then the client's window is closed (that is, if the client never flushes its data, it is also never destroyed).
- Client notices an error in epoll_wait (
EPOLLHUP) and exits.
Other applications may detect the error in a different way. Qt, for example, keeps reading from the socket until
recvmsg returns the error
Connection reset by peer (104). libwayland handles this error in
read_events (https://gitlab.freedesktop.org/wayland/wayland/-/blob/cc8b6aa3d937dda055cb25fdbec55d0afb6be2a0/src/wayland-client.c#L1482). Then Qt aborts in https://code.qt.io/cgit/qt/qtwayland.git/tree/src/client/qwaylanddisplay.cpp?id=00390ccf893aa02c8f51e0887624455c7e8d111d#n177.
A debug plan for this issue can be:
- Start a Wayland session. There, open a terminal and run
weston-eventdemoand hover the mouse over it for a few seconds.
- Switch to another tty/DE or a remote shell.
- Attach a debugger on the compositor. Set breakpoints on
destroy_client_with_error. Let it continue.
- Attach a debugger on
weston-eventdemo. Set a breakpoint on
handle_display_dataif you wish to inspect how it processes the received events. Otherwise, just let it continue. The compositor should hit a breakpoint after
wl_display_flush. Make sure the breakpoint is hit because of
weson-eventdemoand not another client.
What happens next varies among compositors:
- In GNOME, the breakpoint is hit at
msg="invalid object %u".
- In weston and Kwin, the breakpoint is hit at
destroy_client_with_error. If you repeat the experiment with a breakpoint set on
wl_client_connection_data, you'll notice that
client->erroris already set (that is, even before the client's data is read). Then, you can do the following:
set client->error=0to dismiss the error flag.
watch -l client->errorto break when the error is detected next time for this client.
- Delete the breakpoint on
wl_client_connection_data, let the compositor continue and detach the debugger on
- Switch back to the DE, suspend
weston-eventdemoagain from the terminal and hover the mouse over it. The breakpoint will be hit while you do this, so be sure to have a remote shell from where you can run
sudo chvt <tty>, or run
sudo sleep 10 && sudo chvt <tty>before switching to the DE.
- In kwin, the breakpoint is hit at:
#0 handle_array (resource=resource@entry=0x559670da6a00, opcode=opcode@entry=2, args=args@entry=0x7ffded0f0640, send_func=0x7f157422d050 <wl_closure_send>) at ../wayland-1.18.0/src/wayland-server.c:231 #1 0x00007f157422779c in wl_resource_post_event_array (resource=resource@entry=0x559670da6a00, opcode=opcode@entry=2, args=args@entry=0x7ffded0f0640) at ../wayland-1.18.0/src/wayland-server.c:238 #2 0x00007f1574227891 in wl_resource_post_event (resource=0x559670da6a00, opcode=2) at ../wayland-1.18.0/src/wayland-server.c:253 #3 0x00007f1577630770 in () at /usr/lib/libKF5WaylandServer.so.5 #4 0x00007f1576ad2cde in QtPrivate::QSlotObjectBase::call(QObject*, void**) (a=0x7ffded0f0960, r=0x55966fe499d0, this=0x559670c6c240) at ../../include/QtCore/../../src/corelib/kernel/qobjectdefs_impl.h:398 #5 doActivate<false>(QObject*, int, void**) (sender=0x55966fd5f990, signal_index=8, argv=0x7ffded0f0960) at kernel/qobject.cpp:3870 #6 0x00007f1577600866 in KWayland::Server::SeatInterface::pointerPosChanged(QPointF const&) () at /usr/lib/libKF5WaylandServer.so.5
wl_closure_send; further inspection reveals that the error originates from
Resource temporally unavailable (11). Disabling the error setting after
send_func(or conditioning it to
errno != EAGAIN) seems to fix the issue, but is not necessarily a good solution. Another way would be unsetting
wl_client_connection_data. In the end, if the client is able to message the server normally, I guess it doesn't deserve being destroyed? (I say this without having any knowledge about the protocol or libwayland's invariants).
This is all I could figure out. I hope it's useful.
Thank you very much!