Blocked clients that received too many events get destroyed even though the Wayland connection is not broken

This problem is reproducible in every DE I tried: GNOME, Plasma, weston. The most common situation where this happens is in applications blocked due to disk I/O. As can be seen in https://bugreports.qt.io/browse/QTBUG-66997 and https://bugs.kde.org/show_bug.cgi?id=392376, developers have assumed it's the application's fault for doing blocking tasks in the main GUI thread.

However, the issue can be reproduced by simply following these steps:

Open a Wayland application from a terminal.
Suspend it with Ctrl+Z.
Move the mouse over the application's window for a few seconds.
Resume the application with 'fg' and witness how it dies.

Note: this seems to happen less often in GNOME. If you cannot reproduce at the first attempt, try Kwin or weston instead.

Obviously, it wouldn't make any sense if Wayland didn't allow suspending or debugging applications. So I began to debug what was happening in libwayland and observed the following. This is with the weston-evendemo client on GNOME:

Client resumes execution after being blocked.
Client receives events from the server and dispatches them (handle_display_data: https://gitlab.freedesktop.org/wayland/weston/-/blob/bac1a7a71f5d49451ad5c6655ef4e79334ba9c38/clients/window.c#L6201).
Client flushes the connection (display_run: https://gitlab.freedesktop.org/wayland/weston/-/blob/bac1a7a71f5d49451ad5c6655ef4e79334ba9c38/clients/window.c#L6509)
Server receives the events and begins processing them (wl_client_connection_data: https://gitlab.freedesktop.org/wayland/wayland/-/blob/cc8b6aa3d937dda055cb25fdbec55d0afb6be2a0/src/wayland-server.c#L323)
At some point, the server finds an error and destroys the client. Only then the client's window is closed (that is, if the client never flushes its data, it is also never destroyed).
Client notices an error in epoll_wait (EPOLLHUP) and exits.

Other applications may detect the error in a different way. Qt, for example, keeps reading from the socket until recvmsg returns the error Connection reset by peer (104). libwayland handles this error in read_events (https://gitlab.freedesktop.org/wayland/wayland/-/blob/cc8b6aa3d937dda055cb25fdbec55d0afb6be2a0/src/wayland-client.c#L1482). Then Qt aborts in https://code.qt.io/cgit/qt/qtwayland.git/tree/src/client/qwaylanddisplay.cpp?id=00390ccf893aa02c8f51e0887624455c7e8d111d#n177.

A debug plan for this issue can be:

Start a Wayland session. There, open a terminal and run weston-eventdemo.
Suspend weston-eventdemo and hover the mouse over it for a few seconds.
Switch to another tty/DE or a remote shell.
Attach a debugger on the compositor. Set breakpoints on wl_resource_post_error_vargs and destroy_client_with_error. Let it continue.
Attach a debugger on weston-eventdemo. Set a breakpoint on handle_display_data if you wish to inspect how it processes the received events. Otherwise, just let it continue. The compositor should hit a breakpoint after weson-eventdemo performs wl_display_flush. Make sure the breakpoint is hit because of weson-eventdemo and not another client.

What happens next varies among compositors:

In GNOME, the breakpoint is hit at wl_resource_post_error_vargs with msg="invalid object %u".
In weston and Kwin, the breakpoint is hit at destroy_client_with_error. If you repeat the experiment with a breakpoint set on wl_client_connection_data, you'll notice that client->error is already set (that is, even before the client's data is read). Then, you can do the following:
1. set client->error=0 to dismiss the error flag.
2. watch -l client->error to break when the error is detected next time for this client.
3. Delete the breakpoint on wl_client_connection_data, let the compositor continue and detach the debugger on weston-eventdemo.
4. Switch back to the DE, suspend weston-eventdemo again from the terminal and hover the mouse over it. The breakpoint will be hit while you do this, so be sure to have a remote shell from where you can run sudo chvt <tty>, or run sudo sleep 10 && sudo chvt <tty> before switching to the DE.
5. In kwin, the breakpoint is hit at:
```
#0  handle_array (resource=resource@entry=0x559670da6a00, opcode=opcode@entry=2, args=args@entry=0x7ffded0f0640, send_func=0x7f157422d050 <wl_closure_send>) at ../wayland-1.18.0/src/wayland-server.c:231
#1  0x00007f157422779c in wl_resource_post_event_array (resource=resource@entry=0x559670da6a00, opcode=opcode@entry=2, args=args@entry=0x7ffded0f0640) at ../wayland-1.18.0/src/wayland-server.c:238
#2  0x00007f1574227891 in wl_resource_post_event (resource=0x559670da6a00, opcode=2) at ../wayland-1.18.0/src/wayland-server.c:253
#3  0x00007f1577630770 in  () at /usr/lib/libKF5WaylandServer.so.5
#4  0x00007f1576ad2cde in QtPrivate::QSlotObjectBase::call(QObject*, void**) (a=0x7ffded0f0960, r=0x55966fe499d0, this=0x559670c6c240) at ../../include/QtCore/../../src/corelib/kernel/qobjectdefs_impl.h:398
#5  doActivate<false>(QObject*, int, void**) (sender=0x55966fd5f990, signal_index=8, argv=0x7ffded0f0960) at kernel/qobject.cpp:3870
#6  0x00007f1577600866 in KWayland::Server::SeatInterface::pointerPosChanged(QPointF const&) () at /usr/lib/libKF5WaylandServer.so.5
```
Because send_func fails. send_func is actually wl_closure_send; further inspection reveals that the error originates from sendmsg failing with Resource temporally unavailable (11). Disabling the error setting after send_func (or conditioning it to errno != EAGAIN) seems to fix the issue, but is not necessarily a good solution. Another way would be unsetting client->error in wl_client_connection_data. In the end, if the client is able to message the server normally, I guess it doesn't deserve being destroyed? (I say this without having any knowledge about the protocol or libwayland's invariants).

This is all I could figure out. I hope it's useful.

Thank you very much!

Edited Apr 27, 2020 by magiblot

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Blocked clients that received too many events get destroyed even though the Wayland connection is not broken