Legacy GUI element, forwarded via X2Go, causes X server crash
Accessing a particular application remotely with X2Go (a scientific app called ncview which uses the legacy Xaw toolkit rather than GTK+ or Qt) and then clicking a certain button causes the entire X11 server to crash. It is hard to describe, so please see this representation in animated GIF form. Note that the crash seems to happen when the mouse key is depressed, not when it is released — in fact, if one re-connects with X2Go, the "Set range" window is still open (whereas the "Cancel" button should have closed it).
Here is a backtrace, with a link to the full X server log below:
(EE) Backtrace: (EE) 0: /usr/bin/X (xorg_backtrace+0x55) [0x55a7b7f59165] (EE) 1: /usr/bin/X (0x55a7b7da8000+0x1b4de9) [0x55a7b7f5cde9] (EE) 2: /lib64/libpthread.so.0 (0x7f467110f000+0xf5d0) [0x7f467111e5d0] (EE) 3: /usr/bin/X (miHandleValidateExposures+0x29) [0x55a7b7f51989] (EE) 4: /usr/bin/X (miMoveWindow+0x1c5) [0x55a7b7f51c15] (EE) 5: /usr/bin/X (0x55a7b7da8000+0xe448c) [0x55a7b7e8c48c] (EE) 6: /usr/bin/X (ConfigureWindow+0x5be) [0x55a7b7e2f1ae] (EE) 7: /usr/bin/X (0x55a7b7da8000+0x569b8) [0x55a7b7dfe9b8] (EE) 8: /usr/bin/X (0x55a7b7da8000+0x5c35b) [0x55a7b7e0435b] (EE) 9: /usr/bin/X (0x55a7b7da8000+0x603aa) [0x55a7b7e083aa] (EE) 10: /lib64/libc.so.6 (__libc_start_main+0xf5) [0x7f4670d643d5] (EE) 11: /usr/bin/X (0x55a7b7da8000+0x4a4ce) [0x55a7b7df24ce] (EE) (EE) Segmentation fault at address 0x0
And a truncated but detailed one based on my adventures with gdb (more on this later):
0 0x00007f945b7d7207 in raise () from /lib64/libc.so.6 1 0x00007f945b7d88f8 in abort () from /lib64/libc.so.6 2 0x00005592f1c83bfa in OsAbort () at utils.c:1350 3 0x00005592f1c89793 in AbortServer () at log.c:877 4 0x00005592f1c8a5dd in FatalError (f=f@entry=0x5592f1cba9d0 "Caught signal %d (%s). Server aborting\n") at log.c:1015 5 0x00005592f1c80e69 in OsSigHandler (signo=11, sip=<optimized out>, unused=<optimized out>) at osinit.c:156 6 <signal handler called> 7 0x00005592f1c759a9 in RegionNil (reg=<optimized out>) at ../include/regionstr.h:74 8 RegionNotEmpty (_pReg=0x5592f49f54a0) at ../include/regionstr.h:182 9 miHandleValidateExposures (pWin=0x5592f4925310) at miwindow.c:221 10 0x00005592f1c75c35 in miMoveWindow (pWin=0x5592f49f6e60, x=<optimized out>, y=265, pNextSib=<optimized out>, kind=VTOther) at miwindow.c:296 11 0x00005592f1bb04ac in compMoveWindow (pWin=0x5592f49f6e60, x=<optimized out>, y=<optimized out>, pSib=<optimized out>, kind=<optimized out>) [... snip ...]
In my workplace, we have had a number of people affected by the same problem. Everything up to and including the miHandleValidateExposures function call is consistent across computers with Nvidia, Intel, and AMD graphics cards, though the lower-level calls (e.g., RegionNil) vary slightly. The same problem even affects the bundled X server which is provided with TigerVNC. The problem has also appeared on the CentOS forums. CentOS 7.6 (and Red Hat Enterprise Linux, where I captured this backtrace) is using the X.org server version 1.20.1 with backports.
After analyzing core dumps, I posted my findings to the CentOS bug linked above, but I knew it would be more helpful to pursue a live debugging session. Although I had trouble launching Xorg proper with gdb, I was able to do so with TigerVNC's Xvnc server. Given the identical crash patterns, I believe that my results are valid for both implementations.
As noted previously, the miHandleValidateExposures function is always implicated, but I believe that the bug occurs long before it is called. This is my (admittedly limited) understanding of the call path which leads to my crash:
- the incoming X11 request is of this type, with value-mask = 0x10 for changing the border-width of an X window (although I had some trouble understanding the protocol document and could be mistaken)
- the key argument to this function is pWin of type WindowPtr, which I will call the problem window
- the problem window has several children in a nested-linked-list structure (accessible by firstChild and nextSib pointers)
- all of the remaining functions in this sequence are called from, and return back to, this function
- this has something to do with figuring out which windows are on top of other windows
- the problem window and every one of its descendants each has its own _Validate union, comprising a before structure and an after structure — by the time this function finishes, every window's _Validate has been created with before semantics
- this function, and the ones it calls, include code which could be accessing the after semantics of the _Validate union described previously
- however: strangely, the _Validate unions for the problem window and its descendents do not seem to be modified during this — indeed, they are never modified between miMarkOverlappedWindows (when they are set using before semantics!) and the crash during miHandleValidateExposures
- this function does a depth-first search through the problem window and its descendants, and performs some operations related to "exposure regions"
- inevitably, one of these windows has a garbage pointer in its _Validate union which produces a segfault — almost all of the time, it is the first child of the first child of the problem window but it has been different siblings and "generations" from time to time
- for reference, _Validate is defined in mi/mivalidate.h and the problematic member (which is frequently a bad pointer) is _Validate->after->borderExposed->data
It seems strange that the _Validate is written with before semantics and then accessed later, unmodified, using after semantics. According to Git history, none of these functions have been updated in years, which leaves me stumped. Given the unusual (but completely reliable!) steps to reproduce, I am willing to debug this further, but where should I look next?