Blocked clients that received too many events get destroyed even though the Wayland connection is not broken
This problem is reproducible in every DE I tried: GNOME, Plasma, weston. The most common situation where this happens is in applications blocked due to disk I/O. As can be seen in https://bugreports.qt.io/browse/QTBUG-66997 and https://bugs.kde.org/show_bug.cgi?id=392376, developers have assumed it's the application's fault for doing blocking tasks in the main GUI thread.
However, the issue can be reproduced by simply following these steps:
- Open a Wayland application from a terminal.
- Suspend it with Ctrl+Z.
- Move the mouse over the application's window for a few seconds.
- Resume the application with 'fg' and witness how it dies.
Note: this seems to happen less often in GNOME. If you cannot reproduce at the first attempt, try Kwin or weston instead.
Obviously, it wouldn't make any sense if Wayland didn't allow suspending or debugging applications. So I began to debug what was happening in libwayland and observed the following. This is with the weston-evendemo
client on GNOME:
- Client resumes execution after being blocked.
- Client receives events from the server and dispatches them (
handle_display_data
: https://gitlab.freedesktop.org/wayland/weston/-/blob/bac1a7a71f5d49451ad5c6655ef4e79334ba9c38/clients/window.c#L6201). - Client flushes the connection (
display_run
: https://gitlab.freedesktop.org/wayland/weston/-/blob/bac1a7a71f5d49451ad5c6655ef4e79334ba9c38/clients/window.c#L6509) - Server receives the events and begins processing them (
wl_client_connection_data
: https://gitlab.freedesktop.org/wayland/wayland/-/blob/cc8b6aa3d937dda055cb25fdbec55d0afb6be2a0/src/wayland-server.c#L323) - At some point, the server finds an error and destroys the client. Only then the client's window is closed (that is, if the client never flushes its data, it is also never destroyed).
- Client notices an error in epoll_wait (
EPOLLHUP
) and exits.
Other applications may detect the error in a different way. Qt, for example, keeps reading from the socket until recvmsg
returns the error Connection reset by peer
(104). libwayland handles this error in read_events
(https://gitlab.freedesktop.org/wayland/wayland/-/blob/cc8b6aa3d937dda055cb25fdbec55d0afb6be2a0/src/wayland-client.c#L1482). Then Qt aborts in https://code.qt.io/cgit/qt/qtwayland.git/tree/src/client/qwaylanddisplay.cpp?id=00390ccf893aa02c8f51e0887624455c7e8d111d#n177.
A debug plan for this issue can be:
- Start a Wayland session. There, open a terminal and run
weston-eventdemo
. - Suspend
weston-eventdemo
and hover the mouse over it for a few seconds. - Switch to another tty/DE or a remote shell.
- Attach a debugger on the compositor. Set breakpoints on
wl_resource_post_error_vargs
anddestroy_client_with_error
. Let it continue. - Attach a debugger on
weston-eventdemo
. Set a breakpoint onhandle_display_data
if you wish to inspect how it processes the received events. Otherwise, just let it continue. The compositor should hit a breakpoint afterweson-eventdemo
performswl_display_flush
. Make sure the breakpoint is hit because ofweson-eventdemo
and not another client.
What happens next varies among compositors:
- In GNOME, the breakpoint is hit at
wl_resource_post_error_vargs
withmsg="invalid object %u"
. - In weston and Kwin, the breakpoint is hit at
destroy_client_with_error
. If you repeat the experiment with a breakpoint set onwl_client_connection_data
, you'll notice thatclient->error
is already set (that is, even before the client's data is read). Then, you can do the following:-
set client->error=0
to dismiss the error flag. -
watch -l client->error
to break when the error is detected next time for this client. - Delete the breakpoint on
wl_client_connection_data
, let the compositor continue and detach the debugger onweston-eventdemo
. - Switch back to the DE, suspend
weston-eventdemo
again from the terminal and hover the mouse over it. The breakpoint will be hit while you do this, so be sure to have a remote shell from where you can runsudo chvt <tty>
, or runsudo sleep 10 && sudo chvt <tty>
before switching to the DE. - In kwin, the breakpoint is hit at:
#0 handle_array (resource=resource@entry=0x559670da6a00, opcode=opcode@entry=2, args=args@entry=0x7ffded0f0640, send_func=0x7f157422d050 <wl_closure_send>) at ../wayland-1.18.0/src/wayland-server.c:231 #1 0x00007f157422779c in wl_resource_post_event_array (resource=resource@entry=0x559670da6a00, opcode=opcode@entry=2, args=args@entry=0x7ffded0f0640) at ../wayland-1.18.0/src/wayland-server.c:238 #2 0x00007f1574227891 in wl_resource_post_event (resource=0x559670da6a00, opcode=2) at ../wayland-1.18.0/src/wayland-server.c:253 #3 0x00007f1577630770 in () at /usr/lib/libKF5WaylandServer.so.5 #4 0x00007f1576ad2cde in QtPrivate::QSlotObjectBase::call(QObject*, void**) (a=0x7ffded0f0960, r=0x55966fe499d0, this=0x559670c6c240) at ../../include/QtCore/../../src/corelib/kernel/qobjectdefs_impl.h:398 #5 doActivate<false>(QObject*, int, void**) (sender=0x55966fd5f990, signal_index=8, argv=0x7ffded0f0960) at kernel/qobject.cpp:3870 #6 0x00007f1577600866 in KWayland::Server::SeatInterface::pointerPosChanged(QPointF const&) () at /usr/lib/libKF5WaylandServer.so.5
send_func
fails.send_func
is actuallywl_closure_send
; further inspection reveals that the error originates fromsendmsg
failing withResource temporally unavailable (11)
. Disabling the error setting aftersend_func
(or conditioning it toerrno != EAGAIN
) seems to fix the issue, but is not necessarily a good solution. Another way would be unsettingclient->error
inwl_client_connection_data
. In the end, if the client is able to message the server normally, I guess it doesn't deserve being destroyed? (I say this without having any knowledge about the protocol or libwayland's invariants). -
This is all I could figure out. I hope it's useful.
Thank you very much!