gallivm/nir: Fix scalar load and broadcast logic, speed up deqp-vk by doing it more places
Ignore the first commit here and look at !14994 (merged) for that.
For the rest of this: Clean up some gallivm logic for handling "my memory access offset is uniform and I could just read a scalar and broadcast it", then extend it to some more memory accesses. The payoff is reducing runtime of one of our slowest VK tests by 24.4002% +/- 1.94375% (n=7).
Be extra suspicious of a57cd6e0 -- I've tried to do my best to figure out what all the masks are that contribute, but I may have missed something.