XInitThreads crashes from libX11-1.8.7

changed the description

I do not know if it will be useful, but I want to share my backtrace (from Arch Linux):

coredump file: https://drive.google.com/file/d/13gbDifmeXPnKgPZOl1RxQwNbpe-_Sbtb/view

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {206158430256}}
        ret = <optimized out>
#1  0x00007f25609dd8a3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
No locals.
#2  0x00007f256098d668 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x00007f25609754b8 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x20, sa_sigaction = 0x20}, sa_mask = {__val = {94800337930352, 139797762939248, 139797762939248, 112, 1,
              48811969520, 0, 139798157187304, 0, 5, 139798157187152, 139798512820336, 139798512611776, 18446744073709551048, 11, 139798457840863}},
          sa_flags = 1621017427, sa_restorer = 0x7f2560b26070 <_IO_file_jumps>}
#4  0x00007f25609753dc in __assert_fail_base (fmt=0x7f2560aeeae8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x7f255d6b7528 "!xcb_xlib_unknown_req_in_deq", file=file@entry=0x7f255d6b74df "xcb_io.c", line=line@entry=175,
    function=function@entry=0x7f255d6ca0f0 <__PRETTY_FUNCTION__.6> "dequeue_pending_request") at assert.c:92
        str = 0x7f2534001970 "\001@S\362\a"
        total = 4096
#5  0x00007f2560985d26 in __assert_fail (assertion=assertion@entry=0x7f255d6b7528 "!xcb_xlib_unknown_req_in_deq", file=file@entry=0x7f255d6b74df "xcb_io.c",
    line=line@entry=175, function=function@entry=0x7f255d6ca0f0 <__PRETTY_FUNCTION__.6> "dequeue_pending_request") at assert.c:101
No locals.
#6  0x00007f255d64ecef in dequeue_pending_request (dpy=dpy@entry=0x56386c776520, req=req@entry=0x5638851a4bc0)
    at /usr/src/debug/libx11/libX11-1.8.7/src/xcb_io.c:175
        xcb_xlib_unknown_req_in_deq = 1
        __PRETTY_FUNCTION__ = "dequeue_pending_request"
#7  0x00007f255d651c95 in poll_for_response (dpy=dpy@entry=0x56386c776520) at /usr/src/debug/libx11/libX11-1.8.7/src/xcb_io.c:381
        error = 0x0
        request = <optimized out>
        poll_queued_only = <optimized out>
        xcb_xlib_threads_sequence_lost = <optimized out>
        response = 0x0
        event = <optimized out>
        req = 0x5638851a4bc0
        __PRETTY_FUNCTION__ = "poll_for_response"
#8  0x00007f255d6549b2 in _XEventsQueued (dpy=0x56386c776520, mode=<optimized out>) at /usr/src/debug/libx11/libX11-1.8.7/src/xcb_io.c:441
        response = <optimized out>
#9  0x00007f255d631cdf in XFlush (dpy=0x56386c776520) at /usr/src/debug/libx11/libX11-1.8.7/src/Flush.c:39
No locals.
#10 0x00005638622a604c in DisplayServerX11::_wait_for_events (this=0x56386c757b70) at platform/linuxbsd/x11/display_server_x11.cpp:4036
        x11_fd = 3
        in_fds = {fds_bits = {0, 94800337797656, 139798157187744, 94800164983556, 0, 94800337797656, 139798157187776, 94800164985674, 0, 139798157187888,
--Type <RET> for more, q to quit, c to continue without paging--c
            139798157187808, 8590000127, 139798157187840, 94800164983301, 9856604928, 94800337797712}}
        tv = {tv_sec = 139798157187712, tv_usec = 94800164983483}
        num_ready_fds = 32549
        __FUNCTION__ = "_wait_for_events"
#11 0x00005638622a6207 in DisplayServerX11::_poll_events (this=0x56386c757b70) at platform/linuxbsd/x11/display_server_x11.cpp:4062
No locals.
#12 0x00005638622a5fde in DisplayServerX11::_poll_events_thread (ud=0x56386c757b70) at platform/linuxbsd/x11/display_server_x11.cpp:4024
        display_server = 0x56386c757b70
#13 0x000056386633b478 in Thread::callback (p_caller_id=7, p_settings=..., p_callback=0x5638622a5fbe <DisplayServerX11::_poll_events_thread(void*)>,
    p_userdata=0x56386c757b70) at core/os/thread.cpp:61
No locals.
#14 0x000056386633c26f in std::__invoke_impl<void, void (*)(unsigned long, Thread::Settings const&, void (*)(void*), void*), unsigned long, Thread::Settings, void (*)(void*), void*> (__f=@0x56386c87cdf8: 0x56386633b3ea <Thread::callback(unsigned long, Thread::Settings const&, void (*)(void*), void*)>)
    at /usr/include/c++/13.2.1/bits/invoke.h:61
No locals.
#15 0x000056386633c138 in std::__invoke<void (*)(unsigned long, Thread::Settings const&, void (*)(void*), void*), unsigned long, Thread::Settings, void (*)(void*), void*> (__fn=@0x56386c87cdf8: 0x56386633b3ea <Thread::callback(unsigned long, Thread::Settings const&, void (*)(void*), void*)>)
    at /usr/include/c++/13.2.1/bits/invoke.h:96
No locals.
#16 0x000056386633bff3 in std::thread::_Invoker<std::tuple<void (*)(unsigned long, Thread::Settings const&, void (*)(void*), void*), unsigned long, Thread::Settings, void (*)(void*), void*> >::_M_invoke<0ul, 1ul, 2ul, 3ul, 4ul> (this=0x56386c87cdd8) at /usr/include/c++/13.2.1/bits/std_thread.h:292
No locals.
#17 0x000056386633bf58 in std::thread::_Invoker<std::tuple<void (*)(unsigned long, Thread::Settings const&, void (*)(void*), void*), unsigned long, Thread::Settings, void (*)(void*), void*> >::operator() (this=0x56386c87cdd8) at /usr/include/c++/13.2.1/bits/std_thread.h:299
No locals.
#18 0x000056386633bf3c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(unsigned long, Thread::Settings const&, void (*)(void*), void*), unsigned long, Thread::Settings, void (*)(void*), void*> > >::_M_run (this=0x56386c87cdd0) at /usr/include/c++/13.2.1/bits/std_thread.h:244
No locals.
#19 0x0000563866ee5083 in execute_native_thread_routine ()
No symbol table info available.
#20 0x00007f25609db9eb in start_thread (arg=<optimized out>) at pthread_create.c:444
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139798511466256, -3922758728983655173, -568, 11, 140729874390912, 139798148800512, 4018646037397622011,
                4018704568471375099}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#21 0x00007f2560a5f7cc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Hello, Godot maintainer here. We still get regular bug reports about this, and this doesn't seem to be fixed in latest libX11 1.8.9.

Every version since 1.8.3 seems to have exposed new random threading related crashes.

Here are three Fedora automated crash reports for Godot, all crashing in libX11 code. I would really appreciate some input from upstream maintainers to understand whether this is something we need to work around in Godot (and how), or whether this will eventually be fixed upstream.

Hi, Godot user here on Arch Linux. I'm able to reproduce this fairly reliably (within a few minutes), but only with the official Arch Linux binary package (version 1.8.9). If I build that package myself, using the exact same PKGBUILD recipe, the problem disappears, so I have to rely on gdb and can't add any kind of logging.

The root cause seems to be a lack of locking somewhere.

In xcb_io.c, in in poll_for_response, we set:

        req = dpy->xcb->pending_requests;

There is no code that modifies the req pointer in the meantime. Then, if there is actually a pending request and some other conditions hold, the pending requests is dequeued:

        dequeue_pending_request(dpy, req);

And the first thing that function does, is to fail the assertion:

    if (req != dpy->xcb->pending_requests)
        throw_thread_fail_assert("Unknown request in queue while "
                                 "dequeuing",
                                 xcb_xlib_unknown_req_in_deq);

Since req is a local variable and hasn't been changed, this must mean that dpy->xcb->pending_requests has been changed in the meantime. The culprit must have been either some invalid memory access on the same thread, or a race condition from a different thread. My money is on the latter. (It could theoretically also have been some callback that performed a reentrant libx11 call, but I don't see any place where callbacks are invoked here; also, it would imply a lack of locking somewhere, same as a threading issue.)

Indeed, the debugger shows values in *req that are clearly bogus, suggesting that it's been freed and the memory has been overwritten:

(gdb) p *dpy->xcb->pending_requests
$8 = {next = 0x0, sequence = 45503, reply_waiter = 0}
(gdb) p *req
$7 = {next = 0x7eee3a0bd, sequence = 14437195704497219814, reply_waiter = 0}

It should be noted that we are in an XFlush() call, which is a critical section, calling LockDisplay() at the start and UnlockDisplay() at the end. So if this is a threading issue, we'd want to look for places that modify pending_requests without issuing such a lock.

There are only two such places that matter: append_pending_request and dequeue_pending_request. So I set a conditional breakpoint in both, with the condition dpy->lock->mutex->__data->__owner == 0 (relying on some pthreads internals to check if the mutex is locked). After a few minutes, the breakpoint is hit, yielding the following stack trace:

#0  dequeue_pending_request (dpy=dpy@entry=0x55555cd1fde0, req=req@entry=0x55556a1df6f0)
    at /usr/src/debug/libx11/libX11-1.8.9/src/xcb_io.c:174
#1  0x00007ffff7103343 in _XReply (dpy=0x55555cd1fde0, rep=0x7fffffffdb00, extra=0, discard=0)
    at /usr/src/debug/libx11/libX11-1.8.9/src/xcb_io.c:736
#2  0x00007ffff70e40f4 in XGetWindowProperty (dpy=0x55555cd1fde0, w=25165826, property=372, offset=0, 
    length=32, delete=<optimized out>, req_type=4, actual_type=0x7fffffffdbb8, actual_format=0x7fffffffdbb4, 
    nitems=0x7fffffffdbc0, bytesafter=0x7fffffffdbc8, prop=0x7fffffffdbd0)
    at /usr/src/debug/libx11/libX11-1.8.9/src/GetProp.c:69
#3  0x0000555555af1360 in DisplayServerX11::_window_minimize_check (this=this@entry=0x55555ccfc9f0, 
    p_window=p_window@entry=0) at platform/linuxbsd/x11/display_server_x11.cpp:2375
#4  0x0000555555af167f in DisplayServerX11::window_get_mode (this=0x55555ccfc9f0, p_window=0)
    at platform/linuxbsd/x11/display_server_x11.cpp:2705
#5  0x0000555555aeba48 in DisplayServerX11::can_any_window_draw (this=0x55555ccfc9f0)
    at platform/linuxbsd/x11/display_server_x11.cpp:2912
#6  0x0000555555b45426 in Main::iteration () at main/main.cpp:3685
#7  0x0000555555ad7311 in OS_LinuxBSD::run (this=this@entry=0x7fffffffddb0)
    at platform/linuxbsd/os_linuxbsd.cpp:958
#8  0x0000555555ac5176 in main (argc=<optimized out>, argv=0x7fffffffe398)
    at platform/linuxbsd/godot_linuxbsd.cpp:74

When continuing the program after the breakpoint is hit, it sometimes immediately aborts with the aforementioned message Unknown request in queue while dequeuing, but sometimes the breakpoint is triggered a second time before the abort actually happens.

The API function XGetWindowProperty called from Godot does lock the mutex, but _XReply transiently unlocks it for a while. And apparently, by the time dequeue_pending_request is called here, the mutex is somehow not locked.

This is as far as I got. I tried setting more breakpoints and dprintfs in _XReply to find out where exactly the lock is lost, but these seem to interfere with my ability to trigger the crash.

I hope some hero from the libx11 team can figure out what's going on here!

mentioned in issue #233

XInitThreads crashes from libX11-1.8.7

Child items ...

Activity

Admin message

Admin message

XInitThreads crashes from libX11-1.8.7

Activity