Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
I'm experiencing window manager (kwin) hanging on a recursive invocation of _XReply() that happens like this _XReply()->handle_error()->...->_XSeqSyncFunction()->_XReply()->ConditionWait(), see https://bugs.kde.org/show_bug.cgi?id=405592 for some stacktraces. This happens on an ARM device and somehow this is only reproducible when using touchscreen as the input. To be clear the hanging also happens using any desktop environment (checked XFCE, LXDE.. etc), so libx11 is likely to be the offender.
I made a dumb hack to libx11 to avoid the recursive _XReply() invocation and it 100% fixes the problem.
First a quick comment to the problem that your patch introduces:
Each request that libX11 sends to the X11 server (implicitly) gets a sequence number. The first request has number 1 (or 0? I'm not sure), the second one is 2 etc. Events and errors contain the sequence number that they refer to so that they can be matched. So, if you send five requests and the third one fails (while the others do not cause any replies), thanks to the sequence number you will know which request failed.
In the X11 protocol, sequence numbers are a 16 bit field. So after 65536 requests, the sequence number wraps around and you can no longer identify which request a reply/error/event refers to, because you only get the last 16 bits of the sequence number. Thus, after sending 65536 without getting a reply, libX11 has to send a request that generates a reply (e.g. GetInputFocus) to make sure the wrap-around of sequence numbers is handled correctly.
Your patch removes that code, so now sequence number wrap-arounds will no longer be handled correctly. I am not completely sure what the symptoms of this would be, but it certainly will cause problems.
Now to the actual bug here:
_XReply reads from the connection to the X11 server and handled incoming events, errors and replies. Thanks to X11 being asynchronous, the errors can be to older requests, hence there is a loop handling all events/errors it can find.
According to your backtrace, an error was received and handled. Since there was an error callback set, Xlib unlocks the display, runs the error callback, and then locks the display again. This goes through _XLockDisplay and then calls _XSeqSyncFunction. I guess this one part is not visible in your backtrace due to tail-call optimisation. Anyway, on this "lock the thing"-path, Xlib notices that sequence numbers are close to wrap-around and tries to send a GetInputFocus request. However, the earlier calls already registered themselves as "we are handling replies/errors, do not interfere!" and so the code here waits for "that other thread" to be done before it continues. Only that there is no other thread, but it is this thread itself and thus a deadlock follows.
I guess one has to teach this "send a sync"-code to somehow not try to handle events/errors again if it came from that function... or something like that. Dunno how to fix this, really.
Here is a small program that reproduces some deadlock:
#include<X11/Xlib.h>#include<X11/Xlib-xcb.h>#include<stdio.h>#include<assert.h>#include<unistd.h>staticinthandler(Display*dpy,XErrorEvent*ev){return0;}intmain(){Windowinvalid_window=42;Display*dpy;xcb_connection_t*con;XInitThreads();XSetErrorHandler(handler);dpy=XOpenDisplay(NULL);con=XGetXCBConnection(dpy);while(1){printf("0x%lx 0x%lx\n",NextRequest(dpy),XNextRequest(dpy));for(inti=0;i<(1<<17);i++){xcb_no_operation(con);}puts("done with forwarding of sequence");for(inti=0;i<10;i++){XCreateSimpleWindow(dpy,invalid_window,10,10,100,100,0,0,0);}puts("done with causing errors");//assert((NextRequest(dpy) & 0xffff) == 0xfffa);XFlush(dpy);sleep(1);for(inti=0;i<1000;i++){XCreateSimpleWindow(dpy,invalid_window,10,10,100,100,0,0,0);}puts("loop done");}XCloseDisplay(dpy);return0;}
I needed XCB for the reproducer. Without it, sync_hazard catches the problem earlier and no deadlock occurs. The backtrace for the deadlock looks like this:
(gdb) bt#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x558bd69c42c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x558bd69c35c0, cond=0x558bd69c42a0) at pthread_cond_wait.c:502#2 __pthread_cond_wait (cond=0x558bd69c42a0, mutex=0x558bd69c35c0) at pthread_cond_wait.c:655#3 0x00007f50b5375ff2 in _XReply () from /usr/lib/x86_64-linux-gnu/libX11.so.6#4 0x00007f50b5378959 in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6#5 0x00007f50b537812e in _XError () from /usr/lib/x86_64-linux-gnu/libX11.so.6#6 0x00007f50b5375077 in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6#7 0x00007f50b537511d in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6#8 0x00007f50b5376050 in _XReply () from /usr/lib/x86_64-linux-gnu/libX11.so.6#9 0x00007f50b5378959 in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6#10 0x00007f50b537812e in _XError () from /usr/lib/x86_64-linux-gnu/libX11.so.6#11 0x00007f50b5375077 in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6#12 0x00007f50b537511d in ?? () from /usr/lib/x86_64-linux-gnu/libX11.so.6#13 0x00007f50b5375a55 in _XEventsQueued () from /usr/lib/x86_64-linux-gnu/libX11.so.6#14 0x00007f50b53787f5 in _XGetRequest () from /usr/lib/x86_64-linux-gnu/libX11.so.6#15 0x00007f50b5354bc2 in XCreateSimpleWindow () from /usr/lib/x86_64-linux-gnu/libX11.so.6#16 0x0000558bd53ca1ad in main () at test.c:31
Also, here is my version of a patch that fixes the deadlock:
@psychon, I tested the patch and can confirm that it indeed fixes the problem (added a printf when recursive _XSeqSyncFunction call occurs). Thank you very much, please get the fix applied.
I meant s/_XSeqSyncFunction/sync_while_locked/ of course. Also to be more clear, I found a way to more reliably reproduce the bug with KWin and tested that it doesn't hang anymore with yours patch.
Feel free. I'd suggest to add a link to this issue for the origin of the patch.
Note that this flags field is public, so this more-or-less introduces new public API. If something else (ab)uses the flags field for its own purposes, that could lead to conflicts and problems. This is one of the reason that I am not too happy with the flags solution. For me, it was mostly a quick check on "Did we understand this issue correctly?".
Feel free. I'd suggest to add a link to this issue for the origin of the patch.
Thanks, I'll add the link.
Note that this flags field is public, so this more-or-less introduces new public API. If something else (ab)uses the flags field for its own purposes, that could lead to conflicts and problems. This is one of the reason that I am not too happy with the flags solution. For me, it was mostly a quick check on "Did we understand this issue correctly?".
Oh okay, I didn't notice that the flags are a part of the public API. Will take a closer look then.
I was able to compile the libX11-1.6.9 build from libX11-1.6.9.tar.gz in the https://www.x.org/releases/individual/lib/ and freeze it by increasing the CPU load of the Qt Application by quickly switching two buttons in the HMI through touch,
Other related components in the machine are as follows,Qt 5.9.7, KWIN 4.11.19-8, libXi 1.7.9-1 , xinput 1.6.2 and x11-server 1.20.1-5.6
[local@bues3-sun724 ~]$ pstack 2883
Thread 3 (Thread 0x7f994c3af700 (LWP 2885)):
#0 0x00007f995cafd20d in poll () from /lib64/libc.so.6
#1 (closed) 0x00007f995539c082 in _xcb_conn_wait () from /lib64/libxcb.so.1
#2 0x00007f995539de6f in xcb_wait_for_event () from /lib64/libxcb.so.1
#3 (closed) 0x00007f994eb6fe19 in QXcbEventReader::run() () from /lib64/libQt5XcbQpa.so.5
#4 0x00007f995d85be71 in QThreadPrivate::start(void*) () from /lib64/libQt5Core.so.5
#5 (closed) 0x00007f995d2f6dd5 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f995cb07ead in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f9949d6b700 (LWP 3041)):
#0 0x00007f995cafd20d in poll () from /lib64/libc.so.6
#1 (closed) 0x00007f9958bf5c4c in g_main_context_iterate.isra.22 () from /lib64/libglib-2.0.so.0
#2 0x00007f9958bf5d7c in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#4 0x00007f995da046db in QEventLoop::exec(QFlagsQEventLoop::ProcessEventsFlag) () from /lib64/libQt5Core.so.5
#5 (closed) 0x00007f995d8577f8 in QThread::exec() () from /lib64/libQt5Core.so.5
#6 0x00007f994eabb3b5 in QDBusConnectionManager::run() () from /lib64/libQt5DBus.so.5
#7 0x00007f995d85be71 in QThreadPrivate::start(void*) () from /lib64/libQt5Core.so.5
#8 (moved) 0x00007f995d2f6dd5 in start_thread () from /lib64/libpthread.so.0
#9 0x00007f995cb07ead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f995ff27900 (LWP 2883)):
#0 0x00007f995d2fa965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 (closed) 0x00007f995603a685 in _XReply (dpy=dpy@entry=0xb3dd60, rep=rep@entry=0x7ffcea17fbc0, extra=extra@entry=0, discard=discard@entry=1) at xcb_io.c:603
#2 0x00007f995603cf33 in _XSeqSyncFunction (dpy=0xb3dd60) at XlibInt.c:224
#3 (closed) 0x00007f995603c753 in _XError (dpy=dpy@entry=0xb3dd60, rep=rep@entry=0x7f9944004500) at XlibInt.c:1493
#4 0x00007f9956039867 in handle_error (dpy=0xb3dd60, err=0x7f9944004500, in_XReply=) at xcb_io.c:199
#5 (closed) 0x00007f9956039915 in handle_response (dpy=0xb3dd60, response=0x7f9944004500, in_XReply=) at xcb_io.c:324
#6 0x00007f995603a75a in _XReply (dpy=0xb3dd60, rep=0x7ffcea17fdf0, extra=0, discard=0) at xcb_io.c:656
#7 0x00007f994e69e2fe in XIQueryDevice () from /lib64/libXi.so.6
#8 (moved) 0x00007f994eb972bf in QXcbConnection::handleEnterEvent() () from /lib64/libQt5XcbQpa.so.5
#9 0x00007f994eb82c47 in QXcbWindow::handleEnterNotifyEvent(int, int, int, int, unsigned char, unsigned char, unsigned int) () from /lib64/libQt5XcbQpa.so.5
#10 0x00007f994eb83193 in QXcbWindow::handleXIEnterLeave(xcb_ge_event_t*) () from /lib64/libQt5XcbQpa.so.5
#11 (closed) 0x00007f994eb99f8c in QXcbConnection::xi2HandleEvent(xcb_ge_event_t*) () from /lib64/libQt5XcbQpa.so.5
#12 0x00007f994eb6e2a2 in QXcbConnection::handleXcbEvent(xcb_generic_event_t*) () from /lib64/libQt5XcbQpa.so.5
#13 0x00007f994eb7019e in QXcbConnection::processXcbEvents() () from /lib64/libQt5XcbQpa.so.5
#14 (closed) 0x00007f995da2f1de in QObject::event(QEvent*) () from /lib64/libQt5Core.so.5
#15 (closed) 0x00007f995e3fdd8c in QApplicationPrivate::notify_helper(QObject*, QEvent*) () from /lib64/libQt5Widgets.so.5
#16 0x00007f995e404f68 in QApplication::notify(QObject*, QEvent*) () from /lib64/libQt5Widgets.so.5
#17 0x00007f995da05be6 in QCoreApplication::notifyInternal2(QObject*, QEvent*) () from /lib64/libQt5Core.so.5
#18 (closed) 0x00007f995da08503 in QCoreApplicationPrivate::sendPostedEvents(QObject*, int, QThreadData*) () from /lib64/libQt5Core.so.5
#19 0x00007f995da54ab3 in postEventSourceDispatch(_GSource*, int ()(void), void*) () from /lib64/libQt5Core.so.5
#20 0x00007f9958bf5969 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#21 0x00007f9958bf5cc8 in g_main_context_iterate.isra.22 () from /lib64/libglib-2.0.so.0
#22 0x00007f9958bf5d7c in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#23 0x00007f995da5445c in QEventDispatcherGlib::processEvents(QFlagsQEventLoop::ProcessEventsFlag) () from /lib64/libQt5Core.so.5
#24 0x00007f995da046db in QEventLoop::exec(QFlagsQEventLoop::ProcessEventsFlag) () from /lib64/libQt5Core.so.5
#25 (closed) 0x00007f995da0cc04 in QCoreApplication::exec() () from /lib64/libQt5Core.so.5
#26 0x00000000004263a9 in main ()
[local@bues3-sun724 ~]$
Yes, I ran the make install and made sure the build is loaded into /lib64. Verified using lsof that the Qt application was loading it for sure. But I only installed libX11.so.6.3.0 in the target system, thinking that was only the culprit as indicated in the pstack.
Manually copying libs is not a correct way, in that case you probably need to point the symlinks to a new libraries. Secondly, you need to install all the libs. Thirdly, the library with a fix should be libX11-xcb.so (IIUC).
In the absence of target machine being connected to the internet (isolated network), I couldn't do as suggested as of now. But I copied all the binaries from /home/local/libx11_src/libX11-1.6.9/install_pkg/lib/ to the /usr/lib64/ folder and restarted the machine. It was the Qt Application main thread that got deadlocked while waiting for the below condition inside the if(req != current && req->reply_waiter),
ConditionWait(dpy, dpy->xcb->reply_notify);
Thread stack for reference:
Thread 1 (Thread 0x7f995ff27900 (LWP 2883)):
#0 0x00007f995d2fa965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 (closed) 0x00007f995603a685 in _XReply (dpy=dpy@entry=0xb3dd60, rep=rep@entry=0x7ffcea17fbc0, extra=extra@entry=0, discard=discard@entry=1) at xcb_io.c:603
#2 0x00007f995603cf33 in _XSeqSyncFunction (dpy=0xb3dd60) at XlibInt.c:224
#3 (closed) 0x00007f995603c753 in _XError (dpy=dpy@entry=0xb3dd60, rep=rep@entry=0x7f9944004500) at XlibInt.c:1493
#4 0x00007f9956039867 in handle_error (dpy=0xb3dd60, err=0x7f9944004500, in_XReply=) at xcb_io.c:199
#5 (closed) 0x00007f9956039915 in handle_response (dpy=0xb3dd60, response=0x7f9944004500, in_XReply=) at xcb_io.c:324
#6 0x00007f995603a75a in _XReply (dpy=0xb3dd60, rep=0x7ffcea17fdf0, extra=0, discard=0) at xcb_io.c:656
#7 0x00007f994e69e2fe in XIQueryDevice () from /lib64/libXi.so.6
#8 (moved) 0x00007f994eb972bf in QXcbConnection::handleEnterEvent() () from /lib64/libQt5XcbQpa.so.5
But note after this patch is applied, the reproducibility has become lot more difficult (or the issue is rare). I even added log as below to be sure it hits the newly modified code and it did hit it, if(req != current && req->reply_waiter) { sprintf(temp_string, "_XReply: dpy->xcb->reply_notify = 0x%lx", dpy->xcb->reply_notify ); printf("%s\n",temp_string); printf("before ConditionWait(dpy, dpy->xcb->reply_notify);"); fflush(stdout);
@sparaddi You could also try to compile #93 (comment 160109). If it works, then likely that a proper library is being picked up and thus meaning that probably something else causes your problem.
Isn't libX11-1.6.9.tar.gz in the https://www.x.org/releases/individual/lib/ as the result of the fix being mentioned in your reply above? If so, that is what I have loaded. I even added log as below to be sure it hits the newly modified code and it did hit it,
if(req != current && req->reply_waiter)
{
sprintf(temp_string, "_XReply: dpy->xcb->reply_notify = 0x%lx", dpy->xcb->reply_notify );
printf("%s\n",temp_string);
printf("before ConditionWait(dpy, dpy->xcb->reply_notify);");
fflush(stdout);
It could be that there is some other problem in libX11 that causes a similar lockup, or maybe current fix just doesn't cover all possible cases. You'll have to collect more details about the conditions under which the lockup happens, but firstly it will be nice to ensure that you're using a proper library version by testing the reproducer from Uli.
Sorry, i couldn't understand "proper library version by testing the reproducer from Uli." particularly the "by testing the reproducer from Uli." part of it.
Yes, I am able to run it and as I see the following output continuously. And able to freeze the Qt HMI application while this test application was running too, and this test application didn't freeze and was continuously outputting as follows (loop done)
done with forwarding of sequence
done with causing errors
loop done
0x28d021b 0x28d021b
done with forwarding of sequence
done with causing errors
loop done
0x28f0610 0x28f0610
done with forwarding of sequence
done with causing errors
loop done
0x2910a05 0x2910a05
done with forwarding of sequence
done with causing errors
loop done
And the stack output of the Qt HMI Application is as follows,
[local@bues3-sun724 ~]$ pstack 14821
Thread 3 (Thread 0x7f3373067700 (LWP 14823)):
#0 0x00007f33837b520d in poll () from /lib64/libc.so.6
#1 (closed) 0x00007f337c054082 in _xcb_conn_wait () from /lib64/libxcb.so.1
#2 0x00007f337c055e6f in xcb_wait_for_event () from /lib64/libxcb.so.1
#3 (closed) 0x00007f3375827e19 in QXcbEventReader::run() () from /lib64/libQt5XcbQpa.so.5
#4 0x00007f3384513e71 in QThreadPrivate::start(void*) () from /lib64/libQt5Core.so.5
#5 (closed) 0x00007f3383faedd5 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f33837bfead in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f3370a23700 (LWP 14824)):
#0 0x00007f33837b520d in poll () from /lib64/libc.so.6
#1 (closed) 0x00007f337f8adc4c in g_main_context_iterate.isra.22 () from /lib64/libglib-2.0.so.0
#2 0x00007f337f8add7c in g_main_context_iteration () from /lib64/libglib-2.0.so.0
Looks like the real problem is that the recursive XReply() handles requests in a wrong order and that causes the lockup. The dequeue_pending_request() suggests that requests should be handled in a LIFO order and not FIFO.
I don't quite understand why _XErrorFunction() should be called from _XError() with display being unlocked.. if error handler will try to issue a request, then it should lock up itself like it happens in a case of syncing that happens on a display's re-locking after the _XErrorFunction() completion.
Thus I suppose that this should work for everyone:
Although, maybe it won't hurt to leave the user_lock_display untouched. I'm not very familiar with that code and don't really know how the locking works in libX11, will be nice if somebody who is more familiar with it all could take a look.
Interesting.. I just spotted InternalLockDisplay() and the comment says that it's useful for XReply(). It's the same LockDisplay() but without syncing :)
Oh, wait! The user's lock is actually taken and the internal lock is unlocked before invoking _XErrorFunction. So looks like #93 (comment 257470) should be correct.
Ah, although it could be that the idea of having _XErrorFunction running unlocked is simply to allow other threads to do something while error is handled. In that case everything should be okay.
@sparaddi Please let me now the final results of the testing. I'll also test it more thoroughly and then will make a proper patch once the testing is done.
Additional testing also has not resulted in the freeze, so far. Will continue to test and update the results.
Note: Before the fix #93 (comment 257470) , during the testing observed that Qt Application was not responding to the touch events (not sure if it was touch events that were not recognized or processing was happening by the graphics thread), but observed the following in the libX11 log,
_XReply: error->error_code = 0x9
The libX11 code that was instrumented for this was as follows,
sprintf(temp_string, "_XReply: error->error_code = 0x%x", error->error_code ); printf("%s\n",temp_string); fflush(stdout); /* do not die on "no such font", "can't allocate", "can't grab" failures */ switch(error->error_code) {
Can you please elaborate what that error 9 means?
Also before this fix, it was observed that the Qt application when left idle for few days had become very slow to respond to the touch events, again not sure where the problem lies, thought to have it documented here, just in case these symptoms were related to the changes being done. And also even this use case would be further tested after the fix #93 (comment 257470).
There is no need to take the user locks because they are already taken and thus it's a NO-OP since _XDisplayLockWait() checks whether lock is taken by the same thread and bails out if it's the case.
Is the following code of calling ConditionWait required in the _XReply() in the case only where there is only one Qt application with single thread that uses the Display in the whole of that computer,
if(req != current && req->reply_waiter)
{
ConditionWait(dpy, dpy->xcb->reply_notify);
}
Basically trying to ask if the above code is removed from _XReply(), will there be any possible application freeze in the use cases where there is single Qt application with one of it's thread only using the display.