anv: Add support for pre-gather of UBO constant data

Performance of constant data on Intel is extremely dependent on our push
hardware.  We've done a number of things over the years to try and use
the push hardware more and more including pushing chunks of UBOs.  This
patch goes a whole additional level.  With this patch, we can start
pushing data from an effectively arbitrary number of UBOs and pack it
tightly into the shader at a DWORD granularity.  This has a couple
advantages over previous methods:

 1. We're no longer limited by the number of 3DSTATE_CONSTANT_* ranges
    in the hardware packet.  Instead, our only limit is the number of
    DWORDS we can fit in 64 EU registers?"  This also means we could
    potentially use it to start pushing UBO data into compute shaders
    though that isn't implemented in this version.

 2. We can push UBO data from higher in the buffer binding.  Our old
    range-based pushing code had 8 bits for the push range start.  This
    was entirely a software limitation but it meant that anything above
    8K in the UBO binding was unpushable.

 3. It compacts the data pushed into the shader.  Since the previous
    method operated on 32B ranges, if you had a UBO laid out with std140
    and an array of floats, each of those floats would be vec4 aligned
    and each 32B range would only contain two used dwords.  This bloats
    the shader space and can lead to a lot of unnecessary spilling.  By
    compacting things down, we can make much more efficient use of the
    register file.

The way this works is that we first have a NIR pass which analyzes the
shader and constructs a set of brw_ubo_gather structures which describe
the gather.  We then run a heuristic on it to decide if gather is
actually worth bothering with or not.  If we do decide to gather, the
NIR shader is re-written to use push data rather than the UBO and the
gather information is saved in the shader binary.

When we go to do the draw call, we kick off a gather shader if we
haven't already done so.  The gather shader is a vertex shader with
rasterization disabled which takes a stream of uvec4, each of which
describes a gather operation which may gather up to 128B of data per
invocation.  (We may want to tune this some.)  The gather operation is
described by a pair of 48-bit addresses (source and destination) along
with a 32-bit bitfield of dwords to copy.  The shader then walks this
bitfield, fetches up to 4 dwords at a time as scalars, and then writes
them into the destination with a vector write.  The 3DPRIMITIVE used to
kick off the gather shader is actually an indirect draw because this
lets us append stuff to its gather list on future draws without having
to re-emit the gather shader and stall the pipeline again.

Fossil-db stats on ICL:

    Instructions in all programs: 263079329 -> 244988798 (-6.9%)
    SENDs in all programs: 15039902 -> 13178594 (-12.4%)
    Loops in all programs: 149754 -> 149754 (+0.0%)
    Cycles in all programs: 84824941178 -> 82360372531 (-2.9%)
    Spills in all programs: 200694 -> 184079 (-8.3%)
    Fills in all programs: 279768 -> 281042 (+0.5%)
49 jobs for !4745 with review/anv-ubo-gather
latest detached
Status Job ID Name Coverage
  Container
manual #2439245
aarch64 manual
arm_build
manual #2439246
aarch64 manual
arm_test
manual #2439247
windows shell 1809 mesa manual
windows_build_vs2019
manual #2439241
manual
x86_build
manual #2439244
manual
x86_build_old
manual #2439242
manual
x86_test-gl
manual #2439243
manual
x86_test-vk
 
  Meson X86 64
created #2439251
meson-clang
created #2439250
meson-classic
created #2439252
meson-clover
created #2439253
meson-clover-old-llvm
created #2439249
meson-gallium
created #2439248
meson-testing
created #2439254
meson-vulkan
 
  Scons
created #2439255
allowed to fail
scons-win64
 
  Meson Misc
created #2439257
aarch64
meson-arm64
created #2439258
aarch64
meson-arm64-build-test
created #2439256
aarch64
meson-armhf
created #2439260
meson-i386
created #2439263
meson-mingw32-x86_64
created #2439262
kvm
meson-ppc64el
created #2439261
kvm
meson-s390x
created #2439259
windows docker 1809 mesa
meson-windows-vs2019
 
  Llvmpipe
created #2439267
llvmpipe-gles2
created #2439268
llvmpipe-traces
created #2439265
piglit-glslparser
created #2439264
piglit-quick_gl
created #2439266
piglit-quick_shader
 
  Softpipe
created #2439269
softpipe-gles2
created #2439272
softpipe-gles31 1/4
created #2439270
softpipe-gles3 1/2
created #2439273
softpipe-gles31 2/4
created #2439274
softpipe-gles31 3/4
created #2439275
softpipe-gles31 4/4
created #2439271
softpipe-gles3 2/2
 
  Freedreno
created #2439279
google-freedreno-db410c
arm64_a306_gles2
created #2439280
google-freedreno-db820c
arm64_a530_gles2
created #2439276
mesa-cheza
arm64_a630_gles2
created #2439278
mesa-cheza
arm64_a630_gles3
created #2439277
mesa-cheza
arm64_a630_gles31
 
  Panfrost
created #2439281
mesa-ci-aarch64-lava-collabora
panfrost-t720-gles2:arm64
created #2439282
mesa-ci-aarch64-lava-collabora
panfrost-t760-gles2:armhf
created #2439283
mesa-ci-aarch64-lava-collabora
panfrost-t860-gles2:arm64
created #2439284
mesa-ci-aarch64-lava-collabora
panfrost-t860-gles3:arm64
 
  Radv
created #2439285
radv-fossils
 
  Virgl
created #2439286
virgl-gles2
created #2439287
virgl-gles3
created #2439288
virgl-gles31
created #2439289
virgl-traces