llvmpipe: sampler matrix cache is slow because of mutex lock: 650% speedup by using RCU
Overview
get_sample_function
in lp_texture_handle
is very slow because of the simple_mtx
overhead. This cost is paid once during the first pipeline execution, when the sampling functions are first jit'ed, but it's still way too much overhead.
I've implemented an RCU-like trick to remove the lock on reading the sample function cache, since it's a "mostly read" hash table. The table is updated under lock, but the reader only needs to read an atomic pointer, which is swapped in to a newer version by an updating thread. Disposal of the old tables is done when the cache is cleared and no more readers are possibly left, to avoid deleting a hash table that might still be in use.
Test environment
MacBook pro with M3 Pro, macOS Sonoma 14.5
OpenUSD on the feature-hgi-vulkan branch
, with some public (but not yet merged) changes for macOS and Lavapipe support.
Test case
Enabling USDView dome lighting, waiting for the change to take effect.
Results
libvulkan_lvp.dylib`get_sample_function
- simple_mtx: 752'888 samples, ~130s realtime, ~66% CPU
- rcu: 8'794 samples, ~20s realtime, ~95% CPU
sampling rate: 997 Hz
realtime speedup: 6.5x
PS: I tried a rwlock, it's actually slower because it has more locking overhead!