Skip to content

iris: fix race condition during busy tracking

Paulo Zanoni requested to merge pzanoni/mesa:iris-bo-deps-lock-fix into main

The Iris code that deals with implicit tracking is protected by bufmgr->bo_deps_lock. Before this patch, we hold this lock during update_batch_syncobjs() but don't keep it held until we actually submit the batch in the execbuf ioctl. This can lead to the following race condition:

  • Context C1 generates a batch B1 that signals syncobj S1.
  • Context C2 generates a batch B2 that depends on something that B1 from C1 is using, so we mark B2 as having to wait syncobj S1.
  • C2 calls submit_batch() before C1 does it.
  • The Kernel detects it was told to wait on syncobj S1 that was never even submitted, so it returns EINVAL to the execbuf ioctl.
  • We run abort() at the end of _iris_batch_flush().
    • If DEBUG is defined, we also print: iris: Failed to submit batchbuffer: Invalid argument

I couldn't figure out a way to reproduce this issue with real workloads, but I was able to write a small reproducer to trigger this. Basically it's a little GL program that has lots of contexts running in different threads submitting compute shaders that keep using the same SSBOs. I'll submit this as a piglit test.

The solution itself is quite simple: just keep bo_deps_lock held all the way from update_batch_syncobjs() until ioctl(). In order to make that easier we just call update_batch_syncobjs() a little later. We have to drop the lock as soon as the ioctl returns because removing the references on the buffers would trigger other functions to try to grab the lock again, leading to deadlocks.

Thanks to Kenneth Graunke for pointing out this issue.

Cc: mesa-stable Fixes: 89a34cb8 ("iris: switch to explicit busy tracking") Signed-off-by: Paulo Zanoni

Merge request reports