Random failures in piglit tests
For a long time I've been observing random failures in piglit tests, which are very difficult to reproduce. Basically on a given piglit run with a few thousand tests, one or more will randomly fail. Running the same suite multiple times, even without recompiling mesa or any other change, either rebooting the board or not during the runs, a different set of tests will fail. The failed tests may be both vs and fs focused tests that use respectively trivial fs and vs shaders, so it doesn't seem like an issue with ppir nor gpir.
This is how running piglit 20 times in a loop looks like on my Pine64 (search for fail):
Result summary
Currently showing: changes
Show: all | enabled | fixes | problems | disabled | changes | regressions
| skips
unstable1 unstable2 unstable3 unstable4 unstable5 unstable6 unstable7 unstable8 unstable9 unstable10 unstable11 unstable12 unstable13 unstable14 unstable15 unstable16 unstable17 unstable18 unstable19 unstable20
(info) (info) (info) (info) (info) (info) (info) (info) (info) (info) (info) (info) (info) (info) (info) (info) (info) (info) (info) (info)
all 1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334 1151/1334 1152/1334 1151/1334 1151/1334 1151/1334 1152/1334 1152/1334 1152/1334 1152/1334 1152/1334
spec 1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334 1151/1334 1152/1334 1151/1334 1151/1334 1151/1334 1152/1334 1152/1334 1152/1334 1152/1334 1152/1334
glsl-1.10 1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334 1151/1334 1152/1334 1151/1334 1151/1334 1151/1334 1152/1334 1152/1334 1152/1334 1152/1334 1152/1334
execution 1105/1286 1105/1286 1105/1286 1104/1286 1105/1286 1105/1286 1105/1286 1104/1286 1105/1286 1105/1286 1105/1286 1105/1286 1104/1286 1104/1286 1104/1286 1105/1286 1105/1286 1105/1286 1105/1286 1105/1286
built-in-functions 990/1098 990/1098 990/1098 989/1098 990/1098 990/1098 990/1098 990/1098 990/1098 990/1098 990/1098 990/1098 989/1098 989/1098 989/1098 990/1098 990/1098 990/1098 990/1098 990/1098
fs-length-float pass pass pass pass pass pass pass pass pass pass pass pass pass fail pass pass pass pass pass pass
fs-op-uplus-vec4 pass pass pass pass pass pass pass pass pass pass pass pass fail pass pass pass pass pass pass pass
vs-op-add-vec3-float pass pass pass pass pass pass pass pass pass pass pass pass pass pass fail pass pass pass pass pass
vs-op-uplus-ivec2 pass pass pass fail pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass pass
samplers 9/35 9/35 9/35 9/35 9/35 9/35 9/35 8/35 9/35 9/35 9/35 9/35 9/35 9/35 9/35 9/35 9/35 9/35 9/35 9/35
uniform-structs pass pass pass pass pass pass pass fail pass pass pass pass pass pass pass pass pass pass pass pass
linker 31/31 31/31 31/31 31/31 31/31 31/31 31/31 31/31 31/31 31/31 30/31 31/31 31/31 31/31 31/31 31/31 31/31 31/31 31/31 31/31
override-builtin-uniform-02 pass pass pass pass pass pass pass pass pass pass fail pass pass pass pass pass pass pass pass pass
I have tried multiple things to debug it:
- Comparing logs of LIMA_DEBUG on a failed run and a passed run look exactly the same (e.g. shaders input and compiled shaders are exactly the same).
- Running piglit tests in parallel or sequentially: no difference.
- Running different kernels going back to 5.2: apparently no difference.
- Running different mesa versions from 19.1 to current master: apparently no difference.
- Running all tests with valgrind and comparing valgrind output between a randomly failed task and successful ones: no diff in valgrind results.
- Running all tests outputting command stream to stdout and comparing command stream dumps between a randomly failed test and successful ones: no diff.
- Running all tests with the ppir spilling stack fix (which could be corrupting the following tests): apparently no difference.
- Locally fixing static analysis reports of uninitialized variable corner cases that aren't supposed to happen: apparently no difference.
My last attempt at it was to hack shader_runner to output a png image of the result after every probe, on every shader_runner test, to see an image of how the buffer looks like when the probe failed. Running piglit a few times and reproducing the issue, looking at the result, the image looks correct. So I suspect it has something to do with glReadPixels
at the time the probe commands execute. Adding a sleep(1) in lima_transfer_map
right after lima_bo_wait
and I finally seem to be unable to reproduce the random failures anymore.
The same 20 times loop as above with just that sleep(1):
Result summary
Currently showing: changes
Show: index | fixes | disabled | regressions | problems | skips | changes
| enabled
No changes
Maybe we need some additional synchronization in either lima_bo_wait or lima_transfer_map before reading the buffer?