gallium: implement async pbo shader compiles

Mike Blumenkrantz requested to merge zmike/mesa:pbo-async into main

this solves two problems

Problem 1: compile speed

the compute-based pbo download shaders are much faster than the fragment-based shaders in some cases, and they are also much faster than the cpu fallback in many cases, but they have always had a huge drawback: the actual time to generate the nir of the shaders and then compile them is potentially several seconds because of now long some of the basic nir passes take

to avoid this, I've finally gotten around to implementing the following handling:

  • add a pipe_screen method for punting nir creation to the driver's shader compiler thread
  • utilize parallel shader compile functionality if available

combined, there is no longer the slightest of stuttering when these ubershaders are used, and unit tests which previously would hang while compiling the shaders to do a single download now just use the cpu fallback path to complete instantly

Problem 2: execution speed

in cases (like CTS) where many, many different types of pbo download occur only a few times, it's desirable to have fewer shader variants that can handle many download operations in order to keep compile times down, thus speeding up runtimes

in cases (like games) where only 1-2 types of specialized pbo download occur very frequently, however, it's much more optimal to have the most specialized shader possible so that optimizations can speed up the downloads even further, as the ubershaders have thousands upon thousands of ssa values along with dozens of branching conditionals

thus, I've implemented ubershader specialization, where initially the download ubershader will be created as usual, but then if it's used a certain number of times for a specific download (currently 5), a specialized variant will be asynchronously compiled and then utilized for future download operations

Merge request reports