gallium: implement async pbo shader compiles
this solves two problems
Problem 1: compile speed
the compute-based pbo download shaders are much faster than the fragment-based shaders in some cases, and they are also much faster than the cpu fallback in many cases, but they have always had a huge drawback: the actual time to generate the nir of the shaders and then compile them is potentially several seconds because of now long some of the basic nir passes take
to avoid this, I've finally gotten around to implementing the following handling:
- add a pipe_screen method for punting nir creation to the driver's shader compiler thread
- utilize parallel shader compile functionality if available
combined, there is no longer the slightest of stuttering when these ubershaders are used, and unit tests which previously would hang while compiling the shaders to do a single download now just use the cpu fallback path to complete instantly
Problem 2: execution speed
in cases (like CTS) where many, many different types of pbo download occur only a few times, it's desirable to have fewer shader variants that can handle many download operations in order to keep compile times down, thus speeding up runtimes
in cases (like games) where only 1-2 types of specialized pbo download occur very frequently, however, it's much more optimal to have the most specialized shader possible so that optimizations can speed up the downloads even further, as the ubershaders have thousands upon thousands of ssa values along with dozens of branching conditionals
thus, I've implemented ubershader specialization, where initially the download ubershader will be created as usual, but then if it's used a certain number of times for a specific download (currently 5), a specialized variant will be asynchronously compiled and then utilized for future download operations