Random failures in piglit tests

For a long time I've been observing random failures in piglit tests, which are very difficult to reproduce. Basically on a given piglit run with a few thousand tests, one or more will randomly fail. Running the same suite multiple times, even without recompiling mesa or any other change, either rebooting the board or not during the runs, a different set of tests will fail. The failed tests may be both vs and fs focused tests that use respectively trivial fs and vs shaders, so it doesn't seem like an issue with ppir nor gpir.

This is how running piglit 20 times in a loop looks like on my Pine64 (search for fail):

                                 Result summary

   Currently showing: changes

   Show: all | enabled | fixes | problems | disabled | changes | regressions
   | skips

                            unstable1 unstable2 unstable3 unstable4 unstable5 unstable6 unstable7 unstable8 unstable9 unstable10 unstable11 unstable12 unstable13 unstable14 unstable15 unstable16 unstable17 unstable18 unstable19 unstable20 
                            (info)    (info)    (info)    (info)    (info)    (info)    (info)    (info)    (info)    (info)     (info)     (info)     (info)     (info)     (info)     (info)     (info)     (info)     (info)     (info)     
all                         1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334  1151/1334  1152/1334  1151/1334  1151/1334  1151/1334  1152/1334  1152/1334  1152/1334  1152/1334  1152/1334  
spec                        1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334  1151/1334  1152/1334  1151/1334  1151/1334  1151/1334  1152/1334  1152/1334  1152/1334  1152/1334  1152/1334  
glsl-1.10                   1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334 1152/1334 1151/1334 1152/1334 1152/1334  1151/1334  1152/1334  1151/1334  1151/1334  1151/1334  1152/1334  1152/1334  1152/1334  1152/1334  1152/1334  
execution                   1105/1286 1105/1286 1105/1286 1104/1286 1105/1286 1105/1286 1105/1286 1104/1286 1105/1286 1105/1286  1105/1286  1105/1286  1104/1286  1104/1286  1104/1286  1105/1286  1105/1286  1105/1286  1105/1286  1105/1286  
built-in-functions          990/1098  990/1098  990/1098  989/1098  990/1098  990/1098  990/1098  990/1098  990/1098  990/1098   990/1098   990/1098   989/1098   989/1098   989/1098   990/1098   990/1098   990/1098   990/1098   990/1098   
fs-length-float             pass      pass      pass      pass      pass      pass      pass      pass      pass      pass       pass       pass       pass       fail       pass       pass       pass       pass       pass       pass       
fs-op-uplus-vec4            pass      pass      pass      pass      pass      pass      pass      pass      pass      pass       pass       pass       fail       pass       pass       pass       pass       pass       pass       pass       
vs-op-add-vec3-float        pass      pass      pass      pass      pass      pass      pass      pass      pass      pass       pass       pass       pass       pass       fail       pass       pass       pass       pass       pass       
vs-op-uplus-ivec2           pass      pass      pass      fail      pass      pass      pass      pass      pass      pass       pass       pass       pass       pass       pass       pass       pass       pass       pass       pass       
samplers                    9/35      9/35      9/35      9/35      9/35      9/35      9/35      8/35      9/35      9/35       9/35       9/35       9/35       9/35       9/35       9/35       9/35       9/35       9/35       9/35       
uniform-structs             pass      pass      pass      pass      pass      pass      pass      fail      pass      pass       pass       pass       pass       pass       pass       pass       pass       pass       pass       pass       
linker                      31/31     31/31     31/31     31/31     31/31     31/31     31/31     31/31     31/31     31/31      30/31      31/31      31/31      31/31      31/31      31/31      31/31      31/31      31/31      31/31      
override-builtin-uniform-02 pass      pass      pass      pass      pass      pass      pass      pass      pass      pass       fail       pass       pass       pass       pass       pass       pass       pass       pass       pass

I have tried multiple things to debug it:

Comparing logs of LIMA_DEBUG on a failed run and a passed run look exactly the same (e.g. shaders input and compiled shaders are exactly the same).
Running piglit tests in parallel or sequentially: no difference.
Running different kernels going back to 5.2: apparently no difference.
Running different mesa versions from 19.1 to current master: apparently no difference.
Running all tests with valgrind and comparing valgrind output between a randomly failed task and successful ones: no diff in valgrind results.
Running all tests outputting command stream to stdout and comparing command stream dumps between a randomly failed test and successful ones: no diff.
Running all tests with the ppir spilling stack fix (which could be corrupting the following tests): apparently no difference.
Locally fixing static analysis reports of uninitialized variable corner cases that aren't supposed to happen: apparently no difference.

My last attempt at it was to hack shader_runner to output a png image of the result after every probe, on every shader_runner test, to see an image of how the buffer looks like when the probe failed. Running piglit a few times and reproducing the issue, looking at the result, the image looks correct. So I suspect it has something to do with glReadPixels at the time the probe commands execute. Adding a sleep(1) in lima_transfer_map right after lima_bo_wait and I finally seem to be unable to reproduce the random failures anymore.

The same 20 times loop as above with just that sleep(1):

                                 Result summary

   Currently showing: changes

   Show: index | fixes | disabled | regressions | problems | skips | changes
   | enabled

                                   No changes

Maybe we need some additional synchronization in either lima_bo_wait or lima_transfer_map before reading the buffer?

Admin message

Random failures in piglit tests