Storing pointer to temporary value inside the Iris driver.
System information
- OS: Ubuntu 20.04
- GPU: Intel Corporation UHD Graphics 620
- Kernel version: Linux 5.4.0-56-generic # 62-Ubuntu SMP Mon Nov 23 19:20:19 UTC 2020 x86_64 GNU/Linux
- Mesa version: 4.6 (Compatibility Profile) Mesa 20.0.8
- Xserver version (if applicable): 1.20.8
- Desktop manager and compositor: Mate (defaults for Ubuntu Mate)
Describe the issue
I am developing a language, and while working on the graphics library there I started experiencing crashes inside the file /usr/lib/x86_64-linux-gnu/dri/idris_dri.so
, after upgrading to Ubuntu 20.04. Since the crash was there, I suspected the crash was due to Ubuntu switching the default driver for my Intel graphics from the old i965 driver to the new iris driver. Indeed, forcing mesa to use the old driver with MESA_LOADER_DRIVER_OVERRIDE=i965
solves the problem.
The above observation alone does not necessarily mean that the bug is not in my code, it might as well be a concurrency bug or something similar. So I continued to investigate.
The relevant details of the environment I call OpenGL from are as follows:
- I'm using Cairo to render using their experimental OpenGL backend (as such, I don't exclude an usage error there, but it seems stable enough so far).
- The language implements its own user level threads (or green threads). As such, the runtime will allocate and deallocate stacks from time to time, even though code is technically running on the same Linux thread (I will note why this is important below).
After digging around some more, I found that the crash always occurred at the same point inside Mesa (using the debug symbols from the Ubuntu repository), and always as Cairo was calling the function glDrawArrays
with the following stack trace inside Mesa:
#2 0x00007ffff7c473c0 in <signal handler called> () at /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007fffdd47a6ec in color_equals (a=0x7fffe596da40, b=0x7fffbcc95e60) at ../src/gallium/drivers/iris/iris_border_color.c:56
#4 0x00007fffdcdf1e62 in hash_table_search (ht=0x7fffe01d8a20, hash=hash@entry=2382506810, key=key@entry=0x7fffe596da40) at ../src/util/hash_table.c:276
#5 0x00007fffdcdf2749 in _mesa_hash_table_search_pre_hashed (ht=<optimized out>, hash=hash@entry=2382506810, key=key@entry=0x7fffe596da40) at ../src/util/hash_table.c:307
#6 0x00007fffdd47a8ea in iris_upload_border_color (ice=ice@entry=0x7fffe0029f70, color=0x7fffe596da40) at ../src/gallium/drivers/iris/iris_border_color.c:139
#7 0x00007fffdd4ab4d2 in iris_upload_sampler_states (ice=ice@entry=0x7fffe0029f70, stage=stage@entry=MESA_SHADER_FRAGMENT) at ../src/gallium/drivers/iris/iris_state.c:2060
#8 0x00007fffdd4b978b in iris_upload_dirty_render_state (draw=<optimized out>, batch=<optimized out>, ice=<optimized out>) at ../src/gallium/drivers/iris/iris_state.c:5598
#9 iris_upload_render_state (ice=<optimized out>, batch=<optimized out>, draw=<optimized out>) at ../src/gallium/drivers/iris/iris_state.c:6244
#10 0x00007fffdd7b3fef in iris_simple_draw_vbo (draw=0x7fffe596e500, ice=0x7fffe0029f70) at ../src/gallium/drivers/iris/iris_draw.c:221
#11 iris_draw_vbo (ctx=0x7fffe0029f70, info=0x7fffe596e500) at ../src/gallium/drivers/iris/iris_draw.c:270
#12 0x00007fffdd00229b in u_vbuf_draw_vbo (mgr=<optimized out>, info=<optimized out>) at ../src/gallium/auxiliary/util/u_vbuf.c:1512
#13 0x00007fffdcaa7bc7 in st_draw_vbo (ctx=<optimized out>, prims=0x7fffe596e760, nr_prims=<optimized out>, ib=0x0, index_bounds_valid=<optimized out>, min_index=<optimized out>, max_index=<optimized out>, tfb_vertcount=0x0, stream=0, indirect=0x0)
at ../src/mesa/state_tracker/st_draw.c:268
#14 0x00007fffdcce848d in _mesa_draw_arrays (drawID=0, baseInstance=<optimized out>, numInstances=<optimized out>, count=<optimized out>, start=<optimized out>, mode=<optimized out>, ctx=<optimized out>) at ../src/mesa/main/draw.c:374
#15 _mesa_draw_arrays (ctx=<optimized out>, mode=<optimized out>, start=<optimized out>, count=<optimized out>, numInstances=<optimized out>, baseInstance=<optimized out>, drawID=0) at ../src/mesa/main/draw.c:351
#16 0x00007fffdcce8547 in _mesa_DrawArrays (mode=4, start=0, count=252) at ../src/mesa/main/draw.c:531
#17 0x00007fffe780c97c in _cairo_gl_composite_draw_triangles (ctx=0x7fffe0070f00, count=252) at cairo-gl-composite.c:900
#18 _cairo_gl_composite_draw_triangles (count=252, ctx=0x7fffe0070f00) at cairo-gl-composite.c:896
#19 _cairo_gl_composite_draw_triangles_with_clip_region (count=252, ctx=0x7fffe0070f00) at cairo-gl-composite.c:932
#20 _cairo_gl_composite_flush (ctx=0x7fffe0070f00) at cairo-gl-composite.c:958
#21 0x00007fffe780d58c in _cairo_gl_composite_setup_clipping (vertex_size=16, ctx=0x7fffe0070f00, setup=0x7fffe596e900) at cairo-gl-composite.c:739
#22 _cairo_gl_composite_begin (setup=setup@entry=0x7fffe596e900, ctx_out=ctx_out@entry=0x7fffe596e8f0) at cairo-gl-composite.c:861
#23 0x00007fffe7810a83 in render_glyphs (dst=<optimized out>, dst_x=0, dst_y=dst_y@entry=0, op=<optimized out>, source=<optimized out>, info=info@entry=0x7fffe596ebc0, has_component_alpha=0x7fffe596ea94, clip=0x0) at cairo-gl-glyphs.c:294
#24 0x00007fffe7811302 in _cairo_gl_composite_glyphs_with_clip (clip=0x0, info=0x7fffe596ebc0, dst_y=0, dst_x=<optimized out>, src_y=<optimized out>, src_x=<optimized out>, _src=<optimized out>, op=<optimized out>, _dst=<optimized out>) at cairo-gl-glyphs.c:464
#25 _cairo_gl_composite_glyphs_with_clip (_dst=<optimized out>, op=<optimized out>, _src=<optimized out>, src_x=<optimized out>, src_y=<optimized out>, dst_x=<optimized out>, dst_y=0, info=0x7fffe596ebc0, clip=0x0) at cairo-gl-glyphs.c:434
#26 0x00007fffe7811378 in _cairo_gl_composite_glyphs (_dst=<optimized out>, op=<optimized out>, _src=<optimized out>, src_x=<optimized out>, src_y=<optimized out>, dst_x=<optimized out>, dst_y=0, info=0x7fffe596ebc0) at cairo-gl-glyphs.c:482
#27 0x00007fffe77d5532 in clip_and_composite
In this particular example, the issue is usually triggered when I am rendering text. Also note that line numbers here are for whatever version of the source that Ubuntu ships, which differs slightly from master
.
After a while I realized that the following sequence triggers the bug:
- One user level thread renders a frame.
- That user level thread finishes its work and is destroyed. Thus, that stack is deallocated.
- Another user level thread renders another frame. The crash above occurs. It is caused by a segmentation fault. Mesa attempts to read an address from the stack of the user level thread that terminated in step 2 (as these are user level threads, they will appear as the same thread to Mesa, except for a different stack pointer, which can happen anyway due to different call stack depths).
From these observations I suspect that the Iris driver is accidentally saving a pointer to a stack allocated variable somewhere and attempts to access it at a later point. By further inspecting the stacktrace and reading the code (taken from master
as of yesterday, so the bug is still there), i found an issue in src/gallium/drivers/iris/iris_state.c
, in the function iris_upload_sampler_state
. In most cases, it takes colors (union pipe_color_union
) and stores them in a hash map through the function iris_upload_border_color
. These are stored by pointer in the hash map. This is fine, since the colors are stored inside some other structure that is allocated separately. However, there is logic to swizzle colors in case some special formats are used. This logic copies the color into a temporary variable allocated on the stack, and then passes a pointer to that temporary to iris_upload_border_color
, which stores that pointer in the hash table. At this point, the hash table will not work as intended as the temporary value will be trashed in the next iteration of the loop. This may lead to incorrect results, but will likely not result in anything other than wasting upload slots. Since the hash table used by iris_upload_border_color
is not cleared, however, it means that the hash table lookup will try to read the stack-allocated temporary in future calls to glDrawArrays
(or other cases as well), which causes the crash in my case, since the previous stack was deallocated. This also explains why I only saw the issue when drawing text: I believe the affected format (faked A/LA) are only used by Cairo when drawing text (as it is monocrome).
This is why the crash is a bit difficult to reproduce and seems sporadic. It requires the following to happen at around the same time:
- The last few states of a previous frame needs to involve a texture with a faked background color (e.g. text)
- The next state needs to involve a texture with a background color that hashes to the same bucket as the problematic one from before (e.g. the same color)
- The previous stack must have been deleted between these two calls (note: the current implementation may still produce incorrect results even if this is not true).
So to summarize: iris_upload_sampler_state
passes a pointer to a temporary variable to iris_upload_border_color
, which saves this pointer in a hash table. This means the hash table will contain stale pointers, which usually only produces incorrect results, but crashes in my case since I do user level stack switching (I understand that user level stack switching might not be a supported use case, but the problem is still relevant to fix, as it is incorrect in the general case as well).
Regression
As noted above, the i965 driver works without problem, pointing towards an issue in the iris driver only.
Log files as attachment
I have a stack trace provided above, and a fairly detailed breakdown of the cause of the crash.
Any extra information would be greatly appreciated
I am able to reproduce this issue fairly reliably (it is a bit timing dependent, as steps 2 and 3 need to happen in the right order, but it happens at least 9 times out of 10). The code for this is, however, quite large as I previously failed to create a minimal reproduction of this (this would require porting the user level scheduler into a separate project). Hopefully, the above description of my findings are enough. Otherwise, I can provide a (large) reproduction upon request (it contains some material I am not sure about the copyright of, which is why I don't post it right away).