v3d: use unifa/ldunifa for UBO loads from uniform addresses
This improves performance of the UE4 Shooter demo by ~20%-30%.
What we do here is to write the uniform buffer address we want to load from with unifa and then, after a 3 slot delay, we can read up to a vec4 by issuing consecutive ldunifa signals. This allows us to read a vec4 in a total of of 8 cycles, while also freeing up space in the TMU FIFOs, leading to less stalls due to TMU fifo overflows now that we have pipelining.
Edited by Iago Toral