anv: Add support for pre-gather of UBO constant data
Performance of constant data on Intel is extremely dependent on our push hardware. We've done a number of things over the years to try and use the push hardware more and more including pushing chunks of UBOs. This patch goes a whole additional level. With this patch, we can start pushing data from an effectively arbitrary number of UBOs and pack it tightly into the shader at a DWORD granularity. This has a couple advantages over previous methods:
We're no longer limited by the number of 3DSTATE_CONSTANT_* ranges in the hardware packet. Instead, our only limit is the number of DWORDS we can fit in 64 EU registers?" This also means we could potentially use it to start pushing UBO data into compute shaders though that isn't implemented in this version.
We can push UBO data from higher in the buffer binding. Our old range-based pushing code had 8 bits for the push range start. This was entirely a software limitation but it meant that anything above 8K in the UBO binding was unpushable.
It compacts the data pushed into the shader. Since the previous method operated on 32B ranges, if you had a UBO laid out with std140 and an array of floats, each of those floats would be vec4 aligned and each 32B range would only contain two used dwords. This bloats the shader space and can lead to a lot of unnecessary spilling. By compacting things down, we can make much more efficient use of the register file.
The way this works is that we first have a NIR pass which analyzes the shader and constructs a set of brw_ubo_gather structures which describe the gather. We then run a heuristic on it to decide if gather is actually worth bothering with or not. If we do decide to gather, the NIR shader is re-written to use push data rather than the UBO and the gather information is saved in the shader binary.
When we go to do the draw call, we kick off a gather shader if we haven't already done so. The gather shader is a vertex shader with rasterization disabled which takes a stream of uvec4, each of which describes a gather operation which may gather up to 128B of data per invocation. (We may want to tune this some.) The gather operation is described by a pair of 48-bit addresses (source and destination) along with a 32-bit bitfield of dwords to copy. The shader then walks this bitfield, fetches up to 4 dwords at a time as scalars, and then writes them into the destination with a vector write. The 3DPRIMITIVE used to kick off the gather shader is actually an indirect draw because this lets us append stuff to its gather list on future draws without having to re-emit the gather shader and stall the pipeline again.
Fossil-db stats on ICL:
Instructions in all programs: 263079329 -> 244988798 (-6.9%) SENDs in all programs: 15039902 -> 13178594 (-12.4%) Loops in all programs: 149754 -> 149754 (+0.0%) Cycles in all programs: 84824941178 -> 82360372531 (-2.9%) Spills in all programs: 200694 -> 184079 (-8.3%) Fills in all programs: 279768 -> 281042 (+0.5%)