Hanging on recursive _XReply() invocation

First a quick comment to the problem that your patch introduces:

Each request that libX11 sends to the X11 server (implicitly) gets a sequence number. The first request has number 1 (or 0? I'm not sure), the second one is 2 etc. Events and errors contain the sequence number that they refer to so that they can be matched. So, if you send five requests and the third one fails (while the others do not cause any replies), thanks to the sequence number you will know which request failed.

In the X11 protocol, sequence numbers are a 16 bit field. So after 65536 requests, the sequence number wraps around and you can no longer identify which request a reply/error/event refers to, because you only get the last 16 bits of the sequence number. Thus, after sending 65536 without getting a reply, libX11 has to send a request that generates a reply (e.g. GetInputFocus) to make sure the wrap-around of sequence numbers is handled correctly.

Your patch removes that code, so now sequence number wrap-arounds will no longer be handled correctly. I am not completely sure what the symptoms of this would be, but it certainly will cause problems.

Now to the actual bug here:

_XReply reads from the connection to the X11 server and handled incoming events, errors and replies. Thanks to X11 being asynchronous, the errors can be to older requests, hence there is a loop handling all events/errors it can find.

According to your backtrace, an error was received and handled. Since there was an error callback set, Xlib unlocks the display, runs the error callback, and then locks the display again. This goes through _XLockDisplay and then calls _XSeqSyncFunction. I guess this one part is not visible in your backtrace due to tail-call optimisation. Anyway, on this "lock the thing"-path, Xlib notices that sequence numbers are close to wrap-around and tries to send a GetInputFocus request. However, the earlier calls already registered themselves as "we are handling replies/errors, do not interfere!" and so the code here waits for "that other thread" to be done before it continues. Only that there is no other thread, but it is this thread itself and thus a deadlock follows.

I guess one has to teach this "send a sync"-code to somehow not try to handle events/errors again if it came from that function... or something like that. Dunno how to fix this, really.

Here is a small program that reproduces some deadlock:

#include <X11/Xlib.h>
#include <X11/Xlib-xcb.h>
#include <stdio.h>
#include <assert.h>
#include <unistd.h>

static int handler(Display *dpy, XErrorEvent *ev)
{
	return 0;
}

int main()
{
	Window invalid_window = 42;
	Display *dpy;
	xcb_connection_t *con;

	XInitThreads();
	XSetErrorHandler(handler);
	
	dpy = XOpenDisplay(NULL);
	con = XGetXCBConnection(dpy);

	while (1) {
		printf("0x%lx 0x%lx\n", NextRequest(dpy), XNextRequest(dpy));
		for (int i = 0; i < (1 << 17); i++) {
			xcb_no_operation(con);
		}
		puts("done with forwarding of sequence");
		for (int i = 0; i < 10; i++) {
			XCreateSimpleWindow(dpy, invalid_window, 10, 10, 100, 100, 0, 0, 0);
		}
		puts("done with causing errors");
		//assert((NextRequest(dpy) & 0xffff) == 0xfffa);
		XFlush(dpy);
		sleep(1);
		for (int i = 0; i < 1000; i++) {
			XCreateSimpleWindow(dpy, invalid_window, 10, 10, 100, 100, 0, 0, 0);
		}
		puts("loop done");
	}

	XCloseDisplay(dpy);
	return 0;
}

I needed XCB for the reproducer. Without it, sync_hazard catches the problem earlier and no deadlock occurs. The backtrace for the deadlock looks like this:

(gdb) bt
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x558bd69c42c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x558bd69c35c0, cond=0x558bd69c42a0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x558bd69c42a0, mutex=0x558bd69c35c0) at pthread_cond_wait.c:655
#3  0x00007f50b5375ff2 in _XReply () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#4  0x00007f50b5378959 in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#5  0x00007f50b537812e in _XError () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#6  0x00007f50b5375077 in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#7  0x00007f50b537511d in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#8  0x00007f50b5376050 in _XReply () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#9  0x00007f50b5378959 in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#10 0x00007f50b537812e in _XError () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#11 0x00007f50b5375077 in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#12 0x00007f50b537511d in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#13 0x00007f50b5375a55 in _XEventsQueued () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#14 0x00007f50b53787f5 in _XGetRequest () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#15 0x00007f50b5354bc2 in XCreateSimpleWindow () from /usr/lib/x86_64-linux-gnu/libX11.so.6
#16 0x0000558bd53ca1ad in main () at test.c:31

Also, here is my version of a patch that fixes the deadlock:

diff --git a/include/X11/Xlibint.h b/include/X11/Xlibint.h
index 6b95bcf7..20f0496a 100644
--- a/include/X11/Xlibint.h
+++ b/include/X11/Xlibint.h
@@ -501,6 +501,7 @@ extern LockInfoPtr _Xglobal_lock;
 #define XlibDisplayReply       (1L << 5) /* in _XReply */
 #define XlibDisplayWriting     (1L << 6) /* in _XFlushInt, _XSend */
 #define XlibDisplayDfltRMDB     (1L << 7) /* mark if RM db from XGetDefault */
+#define XlibDisplaySyncSent    (1L << 8)
 
 /*
  * X Protocol packetizing macros.
diff --git a/src/XlibInt.c b/src/XlibInt.c
index 4c8eaeb7..6df67e5e 100644
--- a/src/XlibInt.c
+++ b/src/XlibInt.c
@@ -198,10 +198,12 @@ void _XSeqSyncFunction(
     xGetInputFocusReply rep;
     register xReq *req;
 
-    if ((X_DPY_GET_REQUEST(dpy) - X_DPY_GET_LAST_REQUEST_READ(dpy)) >= (65535 - BUFSIZE/SIZEOF(xReq))) {
+    if ((X_DPY_GET_REQUEST(dpy) - X_DPY_GET_LAST_REQUEST_READ(dpy)) >= (65535 - BUFSIZE/SIZEOF(xReq)) && !(dpy->flags & XlibDisplaySyncSent)) {
+       dpy->flags |= XlibDisplaySyncSent;
        GetEmptyReq(GetInputFocus, req);
        (void) _XReply (dpy, (xReply *)&rep, 0, xTrue);
        sync_while_locked(dpy);
+       dpy->flags &= ~XlibDisplaySyncSent;
     } else if (sync_hazard(dpy))
        _XSetPrivSyncFunction(dpy);
 }

@psychon, thank you very much for the explanation. I will test yours patch during the next few days.

@psychon, I tested the patch and can confirm that it indeed fixes the problem (added a printf when recursive _XSeqSyncFunction call occurs). Thank you very much, please get the fix applied.

I meant s/_XSeqSyncFunction/sync_while_locked/ of course. Also to be more clear, I found a way to more reliably reproduce the bug with KWin and tested that it doesn't hang anymore with yours patch.

please get the fix applied.

Sorry, I'm out. I hope that someone comes up with something else than adding a new flag for this. And I also do not have much time to work on this.

This fixes a real problem that affects mobile devices and the flag solution looks good to me. Don't you mind if I'll submit the patch for you?

Feel free. I'd suggest to add a link to this issue for the origin of the patch.

Note that this flags field is public, so this more-or-less introduces new public API. If something else (ab)uses the flags field for its own purposes, that could lead to conflicts and problems. This is one of the reason that I am not too happy with the flags solution. For me, it was mostly a quick check on "Did we understand this issue correctly?".

Feel free. I'd suggest to add a link to this issue for the origin of the patch.

Thanks, I'll add the link.

Note that this flags field is public, so this more-or-less introduces new public API. If something else (ab)uses the flags field for its own purposes, that could lead to conflicts and problems. This is one of the reason that I am not too happy with the flags solution. For me, it was mostly a quick check on "Did we understand this issue correctly?".

Oh okay, I didn't notice that the flags are a part of the public API. Will take a closer look then.

mentioned in commit digetx/libx11@95dfe29a

mentioned in merge request !13 (merged)

mentioned in commit digetx/libx11@30e408c1

mentioned in commit digetx/libx11@f5ba2c63

closed via commit f5ba2c63

I was able to compile the libX11-1.6.9 build from libX11-1.6.9.tar.gz in the https://www.x.org/releases/individual/lib/ and freeze it by increasing the CPU load of the Qt Application by quickly switching two buttons in the HMI through touch,

Other related components in the machine are as follows,Qt 5.9.7, KWIN 4.11.19-8, libXi 1.7.9-1 , xinput 1.6.2 and x11-server 1.20.1-5.6

[local@bues3-sun724 ~]$ pstack 2883 Thread 3 (Thread 0x7f994c3af700 (LWP 2885)): #0 0x00007f995cafd20d in poll () from /lib64/libc.so.6

#1 (closed) 0x00007f995539c082 in _xcb_conn_wait () from /lib64/libxcb.so.1

#2 0x00007f995539de6f in xcb_wait_for_event () from /lib64/libxcb.so.1

#3 (closed) 0x00007f994eb6fe19 in QXcbEventReader::run() () from /lib64/libQt5XcbQpa.so.5

#4 0x00007f995d85be71 in QThreadPrivate::start(void*) () from /lib64/libQt5Core.so.5

#5 (closed) 0x00007f995d2f6dd5 in start_thread () from /lib64/libpthread.so.0

#6 0x00007f995cb07ead in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f9949d6b700 (LWP 3041)): #0 0x00007f995cafd20d in poll () from /lib64/libc.so.6

#1 (closed) 0x00007f9958bf5c4c in g_main_context_iterate.isra.22 () from /lib64/libglib-2.0.so.0

#2 0x00007f9958bf5d7c in g_main_context_iteration () from /lib64/libglib-2.0.so.0

#3 (closed) 0x00007f995da5445c in QEventDispatcherGlib::processEvents(QFlagsQEventLoop::ProcessEventsFlag) () from /lib64/libQt5Core.so.5

#4 0x00007f995da046db in QEventLoop::exec(QFlagsQEventLoop::ProcessEventsFlag) () from /lib64/libQt5Core.so.5 #5 (closed) 0x00007f995d8577f8 in QThread::exec() () from /lib64/libQt5Core.so.5

#6 0x00007f994eabb3b5 in QDBusConnectionManager::run() () from /lib64/libQt5DBus.so.5

#7 0x00007f995d85be71 in QThreadPrivate::start(void*) () from /lib64/libQt5Core.so.5

#8 (moved) 0x00007f995d2f6dd5 in start_thread () from /lib64/libpthread.so.0

#9 0x00007f995cb07ead in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f995ff27900 (LWP 2883)): #0 0x00007f995d2fa965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1 (closed) 0x00007f995603a685 in _XReply (dpy=dpy@entry=0xb3dd60, rep=rep@entry=0x7ffcea17fbc0, extra=extra@entry=0, discard=discard@entry=1) at xcb_io.c:603

#2 0x00007f995603cf33 in _XSeqSyncFunction (dpy=0xb3dd60) at XlibInt.c:224

#3 (closed) 0x00007f995603c753 in _XError (dpy=dpy@entry=0xb3dd60, rep=rep@entry=0x7f9944004500) at XlibInt.c:1493

#4 0x00007f9956039867 in handle_error (dpy=0xb3dd60, err=0x7f9944004500, in_XReply=) at xcb_io.c:199

#5 (closed) 0x00007f9956039915 in handle_response (dpy=0xb3dd60, response=0x7f9944004500, in_XReply=) at xcb_io.c:324

#6 0x00007f995603a75a in _XReply (dpy=0xb3dd60, rep=0x7ffcea17fdf0, extra=0, discard=0) at xcb_io.c:656 #7 0x00007f994e69e2fe in XIQueryDevice () from /lib64/libXi.so.6

#8 (moved) 0x00007f994eb972bf in QXcbConnection::handleEnterEvent() () from /lib64/libQt5XcbQpa.so.5

#9 0x00007f994eb82c47 in QXcbWindow::handleEnterNotifyEvent(int, int, int, int, unsigned char, unsigned char, unsigned int) () from /lib64/libQt5XcbQpa.so.5

#10 0x00007f994eb83193 in QXcbWindow::handleXIEnterLeave(xcb_ge_event_t*) () from /lib64/libQt5XcbQpa.so.5

#11 (closed) 0x00007f994eb99f8c in QXcbConnection::xi2HandleEvent(xcb_ge_event_t*) () from /lib64/libQt5XcbQpa.so.5 #12 0x00007f994eb6e2a2 in QXcbConnection::handleXcbEvent(xcb_generic_event_t*) () from /lib64/libQt5XcbQpa.so.5 #13 0x00007f994eb7019e in QXcbConnection::processXcbEvents() () from /lib64/libQt5XcbQpa.so.5 #14 (closed) 0x00007f995da2f1de in QObject::event(QEvent*) () from /lib64/libQt5Core.so.5 #15 (closed) 0x00007f995e3fdd8c in QApplicationPrivate::notify_helper(QObject*, QEvent*) () from /lib64/libQt5Widgets.so.5 #16 0x00007f995e404f68 in QApplication::notify(QObject*, QEvent*) () from /lib64/libQt5Widgets.so.5 #17 0x00007f995da05be6 in QCoreApplication::notifyInternal2(QObject*, QEvent*) () from /lib64/libQt5Core.so.5 #18 (closed) 0x00007f995da08503 in QCoreApplicationPrivate::sendPostedEvents(QObject*, int, QThreadData*) () from /lib64/libQt5Core.so.5 #19 0x00007f995da54ab3 in postEventSourceDispatch(_GSource*, int ()(void), void*) () from /lib64/libQt5Core.so.5 #20 0x00007f9958bf5969 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #21 0x00007f9958bf5cc8 in g_main_context_iterate.isra.22 () from /lib64/libglib-2.0.so.0 #22 0x00007f9958bf5d7c in g_main_context_iteration () from /lib64/libglib-2.0.so.0 #23 0x00007f995da5445c in QEventDispatcherGlib::processEvents(QFlagsQEventLoop::ProcessEventsFlag) () from /lib64/libQt5Core.so.5 #24 0x00007f995da046db in QEventLoop::exec(QFlagsQEventLoop::ProcessEventsFlag) () from /lib64/libQt5Core.so.5 #25 (closed) 0x00007f995da0cc04 in QCoreApplication::exec() () from /lib64/libQt5Core.so.5 #26 0x00000000004263a9 in main () [local@bues3-sun724 ~]$

@sparaddi yours log says /lib64/libxcb.so.1, have you done make install after compiling libX11?

Are you sure that the libs were installed into /lib64?

Yes, I ran the make install and made sure the build is loaded into /lib64. Verified using lsof that the Qt application was loading it for sure. But I only installed libX11.so.6.3.0 in the target system, thinking that was only the culprit as indicated in the pstack.

Manually copying libs is not a correct way, in that case you probably need to point the symlinks to a new libraries. Secondly, you need to install all the libs. Thirdly, the library with a fix should be libX11-xcb.so (IIUC).

Either you need to do a proper make install:

cd /home/local/libx11_src/libX11-1.6.9/
sh autogen.sh --prefix=/usr/
make
sudo make install

Or set LD_LIBRARY_PATH:

"# LD_LIBRARY_PATH=/home/local/libx11_src/libX11-1.6.9/src/.libs/ ./app"

It is also not obvious where the freeze happens in your case, is it the Qt application which hangs or is it KWin?

In the absence of target machine being connected to the internet (isolated network), I couldn't do as suggested as of now. But I copied all the binaries from /home/local/libx11_src/libX11-1.6.9/install_pkg/lib/ to the /usr/lib64/ folder and restarted the machine. It was the Qt Application main thread that got deadlocked while waiting for the below condition inside the if(req != current && req->reply_waiter), ConditionWait(dpy, dpy->xcb->reply_notify);

Thread stack for reference: Thread 1 (Thread 0x7f995ff27900 (LWP 2883)): #0 0x00007f995d2fa965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 (closed) 0x00007f995603a685 in _XReply (dpy=dpy@entry=0xb3dd60, rep=rep@entry=0x7ffcea17fbc0, extra=extra@entry=0, discard=discard@entry=1) at xcb_io.c:603

#2 0x00007f995603cf33 in _XSeqSyncFunction (dpy=0xb3dd60) at XlibInt.c:224

#3 (closed) 0x00007f995603c753 in _XError (dpy=dpy@entry=0xb3dd60, rep=rep@entry=0x7f9944004500) at XlibInt.c:1493

#4 0x00007f9956039867 in handle_error (dpy=0xb3dd60, err=0x7f9944004500, in_XReply=) at xcb_io.c:199

#5 (closed) 0x00007f9956039915 in handle_response (dpy=0xb3dd60, response=0x7f9944004500, in_XReply=) at xcb_io.c:324

#6 0x00007f995603a75a in _XReply (dpy=0xb3dd60, rep=0x7ffcea17fdf0, extra=0, discard=0) at xcb_io.c:656

#7 0x00007f994e69e2fe in XIQueryDevice () from /lib64/libXi.so.6

#8 (moved) 0x00007f994eb972bf in QXcbConnection::handleEnterEvent() () from /lib64/libQt5XcbQpa.so.5

But note after this patch is applied, the reproducibility has become lot more difficult (or the issue is rare). I even added log as below to be sure it hits the newly modified code and it did hit it, if(req != current && req->reply_waiter) { sprintf(temp_string, "_XReply: dpy->xcb->reply_notify = 0x%lx", dpy->xcb->reply_notify ); printf("%s\n",temp_string); printf("before ConditionWait(dpy, dpy->xcb->reply_notify);"); fflush(stdout);

		ConditionWait(dpy, dpy->xcb->reply_notify);

@sparaddi You could also try to compile #93 (comment 160109). If it works, then likely that a proper library is being picked up and thus meaning that probably something else causes your problem.

Isn't libX11-1.6.9.tar.gz in the https://www.x.org/releases/individual/lib/ as the result of the fix being mentioned in your reply above? If so, that is what I have loaded. I even added log as below to be sure it hits the newly modified code and it did hit it, if(req != current && req->reply_waiter) { sprintf(temp_string, "_XReply: dpy->xcb->reply_notify = 0x%lx", dpy->xcb->reply_notify ); printf("%s\n",temp_string); printf("before ConditionWait(dpy, dpy->xcb->reply_notify);"); fflush(stdout);

		ConditionWait(dpy, dpy->xcb->reply_notify);

It could be that there is some other problem in libX11 that causes a similar lockup, or maybe current fix just doesn't cover all possible cases. You'll have to collect more details about the conditions under which the lockup happens, but firstly it will be nice to ensure that you're using a proper library version by testing the reproducer from Uli.

Sorry, i couldn't understand "proper library version by testing the reproducer from Uli." particularly the "by testing the reproducer from Uli." part of it.

save the code #93 (comment 160109) into "test.c"
compile it # gcc -lX11 -lxcb -lX11-xcb test.c -o test
run the test # ./test

The test prints endlessly "done with ... loop done" and it will stuck without the fix that got into 1.6.9.

Yes, I am able to run it and as I see the following output continuously. And able to freeze the Qt HMI application while this test application was running too, and this test application didn't freeze and was continuously outputting as follows (loop done)

done with forwarding of sequence

done with causing errors

loop done

0x28d021b 0x28d021b

done with forwarding of sequence

done with causing errors

loop done

0x28f0610 0x28f0610

done with forwarding of sequence

done with causing errors

loop done

0x2910a05 0x2910a05

done with forwarding of sequence

done with causing errors

loop done

And the stack output of the Qt HMI Application is as follows,

[local@bues3-sun724 ~]$ pstack 14821

Thread 3 (Thread 0x7f3373067700 (LWP 14823)):

#0 0x00007f33837b520d in poll () from /lib64/libc.so.6

#1 (closed) 0x00007f337c054082 in _xcb_conn_wait () from /lib64/libxcb.so.1

#2 0x00007f337c055e6f in xcb_wait_for_event () from /lib64/libxcb.so.1

#3 (closed) 0x00007f3375827e19 in QXcbEventReader::run() () from /lib64/libQt5XcbQpa.so.5

#4 0x00007f3384513e71 in QThreadPrivate::start(void*) () from /lib64/libQt5Core.so.5

#5 (closed) 0x00007f3383faedd5 in start_thread () from /lib64/libpthread.so.0

#6 0x00007f33837bfead in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f3370a23700 (LWP 14824)): #0 0x00007f33837b520d in poll () from /lib64/libc.so.6

#1 (closed) 0x00007f337f8adc4c in g_main_context_iterate.isra.22 () from /lib64/libglib-2.0.so.0

#2 0x00007f337f8add7c in g_main_context_iteration () from /lib64/libglib-2.0.so.0

#3 (closed) 0x00007f338470c45c in QEventDispatcherGlib::processEvents(QFlagsQEventLoop::ProcessEventsFlag) () from /lib64/libQt5Core.so.5

#4 0x00007f33846bc6db in QEventLoop::exec(QFlagsQEventLoop::ProcessEventsFlag) () from /lib64/libQt5Core.so.5

#5 (closed) 0x00007f338450f7f8 in QThread::exec() () from /lib64/libQt5Core.so.5

#6 0x00007f33757733b5 in QDBusConnectionManager::run() () from /lib64/libQt5DBus.so.5

#7 0x00007f3384513e71 in QThreadPrivate::start(void*) () from /lib64/libQt5Core.so.5

#8 (moved) 0x00007f3383faedd5 in start_thread () from /lib64/libpthread.so.0

#9 0x00007f33837bfead in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f3386bdf900 (LWP 14821)): #0 0x00007f3383fb2965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1 (closed) 0x00007f337ccf2685 in _XReply (dpy=dpy@entry=0x1a29e10, rep=rep@entry=0x7ffc9be0f490, extra=extra@entry=0, discard=discard@entry=1) at xcb_io.c:603

#2 0x00007f337ccf4f33 in _XSeqSyncFunction (dpy=0x1a29e10) at XlibInt.c:224

#3 (closed) 0x00007f337ccf4753 in _XError (dpy=dpy@entry=0x1a29e10, rep=rep@entry=0x7f336c003980) at XlibInt.c:1493

#4 0x00007f337ccf1867 in handle_error (dpy=0x1a29e10, err=0x7f336c003980, in_XReply=) at xcb_io.c:199

#5 (closed) 0x00007f337ccf1915 in handle_response (dpy=0x1a29e10, response=0x7f336c003980, in_XReply=) at xcb_io.c:324

#6 0x00007f337ccf275a in _XReply (dpy=0x1a29e10, rep=0x7ffc9be0f6c0, extra=0, discard=0) at xcb_io.c:656

#7 0x00007f33753562fe in XIQueryDevice () from /lib64/libXi.so.6

#8 (moved) 0x00007f337584f2bf in QXcbConnection::handleEnterEvent() () from /lib64/libQt5XcbQpa.so.5

#9 0x00007f337583ac47 in QXcbWindow::handleEnterNotifyEvent(int, int, int, int, unsigned char, unsigned char, unsigned int) () from /lib64/libQt5XcbQpa.so.5

#10 0x00007f337583b193 in QXcbWindow::handleXIEnterLeave(xcb_ge_event_t*) () from /lib64/libQt5XcbQpa.so.5

#11 (closed) 0x00007f3375851f8c in QXcbConnection::xi2HandleEvent(xcb_ge_event_t*) () from /lib64/libQt5XcbQpa.so.5

#12 0x00007f33758262a2 in QXcbConnection::handleXcbEvent(xcb_generic_event_t*) () from /lib64/libQt5XcbQpa.so.5

#13 0x00007f337582819e in QXcbConnection::processXcbEvents() () from /lib64/libQt5XcbQpa.so.5

#14 (closed) 0x00007f33846e71de in QObject::event(QEvent*) () from /lib64/libQt5Core.so.5

#15 (closed) 0x00007f33850b5d8c in QApplicationPrivate::notify_helper(QObject*, QEvent*) () from /lib64/libQt5Widgets.so.5

#16 0x00007f33850bcf68 in QApplication::notify(QObject*, QEvent*) () from /lib64/libQt5Widgets.so.5

#17 0x00007f33846bdbe6 in QCoreApplication::notifyInternal2(QObject*, QEvent*) () from /lib64/libQt5Core.so.5

#18 (closed) 0x00007f33846c0503 in QCoreApplicationPrivate::sendPostedEvents(QObject*, int, QThreadData*) () from /lib64/libQt5Core.so.5

#19 0x00007f338470cab3 in postEventSourceDispatch(_GSource*, int ()(void), void*) () from /lib64/libQt5Core.so.5

#20 0x00007f337f8ad969 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0

#21 0x00007f337f8adcc8 in g_main_context_iterate.isra.22 () from /lib64/libglib-2.0.so.0

#22 0x00007f337f8add7c in g_main_context_iteration () from /lib64/libglib-2.0.so.0

#23 0x00007f338470c45c in QEventDispatcherGlib::processEvents(QFlagsQEventLoop::ProcessEventsFlag) () from /lib64/libQt5Core.so.5

#24 0x00007f33846bc6db in QEventLoop::exec(QFlagsQEventLoop::ProcessEventsFlag) () from /lib64/libQt5Core.so.5

#25 (closed) 0x00007f33846c4c04 in QCoreApplication::exec() () from /lib64/libQt5Core.so.5

#26 0x00000000004263a9 in main ()

[local@bues3-sun724 ~]$

@psychon Do you know whether it is really necessary to invoke synchronization from the _XError()?

Could something like this be acceptable?

diff --git a/include/X11/Xlibint.h b/include/X11/Xlibint.h
index 84457168..da223f51 100644
--- a/include/X11/Xlibint.h
+++ b/include/X11/Xlibint.h
@@ -205,6 +205,8 @@ struct _XDisplay
 
 	/* avoid recursion on requests sequence number synchronization */
 	Bool req_seq_syncing; /* requests syncing is in-progress */
+
+	unsigned int skip_seq_syncing;
 };
 
 #define XAllocIDs(dpy,ids,n) (*(dpy)->idlist_alloc)(dpy,ids,n)
diff --git a/src/XlibInt.c b/src/XlibInt.c
index e4fb4e5f..32e97b28 100644
--- a/src/XlibInt.c
+++ b/src/XlibInt.c
@@ -218,7 +218,8 @@ void _XSeqSyncFunction(
     xGetInputFocusReply rep;
     _X_UNUSED register xReq *req;
 
-    if ((X_DPY_GET_REQUEST(dpy) - X_DPY_GET_LAST_REQUEST_READ(dpy)) >= (65535 - BUFSIZE/SIZEOF(xReq)) && !dpy->req_seq_syncing) {
+    if ((X_DPY_GET_REQUEST(dpy) - X_DPY_GET_LAST_REQUEST_READ(dpy)) >= (65535 - BUFSIZE/SIZEOF(xReq)) &&
+            !dpy->req_seq_syncing && !dpy->skip_seq_syncing) {
 	dpy->req_seq_syncing = True;
 	GetEmptyReq(GetInputFocus, req);
 	(void) _XReply (dpy, (xReply *)&rep, 0, xTrue);
@@ -1486,11 +1487,13 @@ int _XError (
 #ifdef XTHREADS
 	if (dpy->lock)
 	    (*dpy->lock->user_lock_display)(dpy);
+	dpy->skip_seq_syncing++;
 	UnlockDisplay(dpy);
 #endif
 	rtn_val = (*_XErrorFunction)(dpy, (XErrorEvent *)&event); /* upcall */
 #ifdef XTHREADS
 	LockDisplay(dpy);
+	dpy->skip_seq_syncing--;
 	if (dpy->lock)
 	    (*dpy->lock->user_unlock_display)(dpy);
 #endif

Although, that's probably not a good idea..

What about this? @psychon

diff --git a/src/xcb_io.c b/src/xcb_io.c
index 6a12d150..7a31cf4d 100644
--- a/src/xcb_io.c
+++ b/src/xcb_io.c
@@ -595,7 +595,7 @@ Status _XReply(Display *dpy, xReply *rep, int extra, Bool discard)
 
 	while(1)
 	{
-		PendingRequest *req = dpy->xcb->pending_requests;
+		PendingRequest *req = dpy->xcb->pending_requests_tail;
 		xcb_generic_reply_t *response;
 
 		if(req != current && req->reply_waiter)

Looks like the real problem is that the recursive XReply() handles requests in a wrong order and that causes the lockup. The dequeue_pending_request() suggests that requests should be handled in a LIFO order and not FIFO.

Edit: swapped LIFO and FIFO.

@sparaddi Could you please try #93 (comment 257299)?

Following code is in poll_for_event(), does that also needs to be changed to LIFO order?

if(dpy->xcb->next_event)
{
	PendingRequest *req = dpy->xcb->pending_requests;

Looking at dequeue_pending_request() again, seems I got it wrong. I'll try to take another look.

I don't quite understand why _XErrorFunction() should be called from _XError() with display being unlocked.. if error handler will try to issue a request, then it should lock up itself like it happens in a case of syncing that happens on a display's re-locking after the _XErrorFunction() completion.

Thus I suppose that this should work for everyone:

diff --git a/src/XlibInt.c b/src/XlibInt.c
index 4e45e62b..5a0aaf58 100644
--- a/src/XlibInt.c
+++ b/src/XlibInt.c
@@ -1480,19 +1480,7 @@ int _XError (
 	!(*dpy->error_vec[rep->errorCode])(dpy, &event.xerror, rep))
 	return 0;
     if (_XErrorFunction != NULL) {
-	int rtn_val;
-#ifdef XTHREADS
-	if (dpy->lock)
-	    (*dpy->lock->user_lock_display)(dpy);
-	UnlockDisplay(dpy);
-#endif
-	rtn_val = (*_XErrorFunction)(dpy, (XErrorEvent *)&event); /* upcall */
-#ifdef XTHREADS
-	LockDisplay(dpy);
-	if (dpy->lock)
-	    (*dpy->lock->user_unlock_display)(dpy);
-#endif
-	return rtn_val;
+	return (*_XErrorFunction)(dpy, (XErrorEvent *)&event); /* upcall */
     } else {
 	return _XDefaultError(dpy, (XErrorEvent *)&event);
     }

Although, maybe it won't hurt to leave the user_lock_display untouched. I'm not very familiar with that code and don't really know how the locking works in libX11, will be nice if somebody who is more familiar with it all could take a look.

Interesting.. I just spotted InternalLockDisplay() and the comment says that it's useful for XReply(). It's the same LockDisplay() but without syncing :)

This seems to work:

diff --git a/src/XlibInt.c b/src/XlibInt.c
index 4e45e62b..39faa337 100644
--- a/src/XlibInt.c
+++ b/src/XlibInt.c
@@ -1488,7 +1488,7 @@ int _XError (
 #endif
 	rtn_val = (*_XErrorFunction)(dpy, (XErrorEvent *)&event); /* upcall */
 #ifdef XTHREADS
-	LockDisplay(dpy);
+	InternalLockDisplay(dpy, 0);
 	if (dpy->lock)
 	    (*dpy->lock->user_unlock_display)(dpy);
 #endif

Still it won't help if _XErrorFunction locks display, but I'm not sure whether any error handler performs the display's locking in practice.

Oh, wait! The user's lock is actually taken and the internal lock is unlocked before invoking _XErrorFunction. So looks like #93 (comment 257470) should be correct.

Thank you Dmitry, this fix looks very promising. My test vector has not freezed for last 1 hour or so. will continue to test and report back.

Nice, but still that shouldn't help if user's error handler will do anything that involves LockDisplay. So maybe this is still a half baked solution.

Ah, although it could be that the idea of having _XErrorFunction running unlocked is simply to allow other threads to do something while error is handled. In that case everything should be okay.

@sparaddi Please let me now the final results of the testing. I'll also test it more thoroughly and then will make a proper patch once the testing is done.

Additional testing also has not resulted in the freeze, so far. Will continue to test and update the results.

Note: Before the fix #93 (comment 257470) , during the testing observed that Qt Application was not responding to the touch events (not sure if it was touch events that were not recognized or processing was happening by the graphics thread), but observed the following in the libX11 log, _XReply: error->error_code = 0x9

The libX11 code that was instrumented for this was as follows,

	sprintf(temp_string, "_XReply: error->error_code = 0x%x", error->error_code ); 
	printf("%s\n",temp_string);
	fflush(stdout);

	/* do not die on "no such font", "can't allocate",
	   "can't grab" failures */
	switch(error->error_code)
	{

Can you please elaborate what that error 9 means?

Also before this fix, it was observed that the Qt application when left idle for few days had become very slow to respond to the touch events, again not sure where the problem lies, thought to have it documented here, just in case these symptoms were related to the changes being done. And also even this use case would be further tested after the fix #93 (comment 257470).

Okay, I'll wait for the results.

I reverted the fix that got into 1.6.9 and no problems so far from my end with #93 (comment 257470).

Error code 9 is BadDrawable in accordance to https://gitlab.freedesktop.org/xorg/lib/libx11/blob/master/src/XErrorDB#L167

Likely that "slow to respond" is a separate problem, at least I haven't experienced it.

No freeze observed so far. And looks good. But the testing was done on fix #93 (comment 257470) on top of 1.6.9 .

Thanks, I'll make a pull request during this week.

I opened !29 (closed) which has a minor difference from #93 (comment 257470):

diff --git a/src/XlibInt.c b/src/XlibInt.c
index 4e45e62b..39faa337 100644
--- a/src/XlibInt.c
+++ b/src/XlibInt.c
@@ -1488,7 +1488,7 @@ int _XError (
 #endif
 	rtn_val = (*_XErrorFunction)(dpy, (XErrorEvent *)&event); /* upcall */
 #ifdef XTHREADS
-	InternalLockDisplay(dpy, 0);
+	InternalLockDisplay(dpy, /* ignore user locks */ 1);
 	if (dpy->lock)
 	    (*dpy->lock->user_unlock_display)(dpy);
 #endif

There is no need to take the user locks because they are already taken and thus it's a NO-OP since _XDisplayLockWait() checks whether lock is taken by the same thread and bails out if it's the case.

Is the following code of calling ConditionWait required in the _XReply() in the case only where there is only one Qt application with single thread that uses the Display in the whole of that computer, if(req != current && req->reply_waiter) { ConditionWait(dpy, dpy->xcb->reply_notify); }

Basically trying to ask if the above code is removed from _XReply(), will there be any possible application freeze in the use cases where there is single Qt application with one of it's thread only using the display.

Hm.. actually there is no sync recursion in your stacktrace. I'm not sure how it happens that in your case req!=current.

BTW, I think #95 (closed) is a duplicate of a problem that @sparaddi is experiencing.

And #93 (comment 257470) should also be the correct fix for the recursive syncing.

reopened

Looks like #93 (comment 257299) part of the #93 (comment 257309) should be revisited as #93 (comment 257470) seems to fix the problem and not #93 (comment 257299)!

The first comment leads to the other, so shouldn't be a big deal. Anyway, I edited the comment.

mentioned in commit digetx/libx11@0c34ece6

mentioned in commit digetx/libx11@32308c49

mentioned in commit digetx/libx11@fd1e7c9b

mentioned in merge request !29 (closed)

mentioned in commit digetx/libx11@343c0921

mentioned in commit digetx/libx11@dc9bd4ad

mentioned in commit digetx/libx11@4c03cef6

mentioned in commit digetx/libx11@b8591078

This was fixed by !57 (merged), commit 30ccef3a, one year ago.

closed

mentioned in issue #25 (closed)

Hanging on recursive _XReply() invocation

Child items ...

Activity

Admin message

Admin message

Hanging on recursive _XReply() invocation

Activity