broadcom/compiler: don't try too hard to hide TMU latency
Hiding latency is a balancing game: if you do too little you end up stalling on TMU reads, but if you do too much you end up postponing critical paths for no benefit. The thread switching model in V3D is already quite effective at hiding latency and we don't usually need to do a lot more to hide TMU latency effectively.
Based on empirical testing with a bunch of Vulkan samples it seems we have been trying too hard to hide latency until now, to the point where it was detrimental to performance. This series has two patches: one for the NIR scheduler to reduce the estimated latency of the intrinsics that generate TMU reads and another for the QPU sheduler to stop schedulng TMU setup/reads to hide latency. Combined they show a small but consistent improvement in all samples we tested, even if we force all shaders to only use 2 threads!
I also confirmed that even if we don't need to do a lot for latency, we still benefit for doing something: reducing the estimated latency for TMU intrinsics to 1 when scheduling NIR does cause worse performance, meaning that the thread switching mechanism is not enough to hide all latency on its own, as we already knew.
Here are the results of a few samples for reference:
UE shooter (high): 18.390548 fps -> 19.028731 fps
UE shooter (low): 29.309201 fps -> 31.393861 fps
UE vehicle: 12.708788 fps -> 13.027761 fps fps
UE Sun Temple: 25.823155 fps -> 26.039792 fps
VkQuake: 57.167518 fps -> 58.380585 fps
RBDoom3-BFG: 13.483675 fps -> 13.632225 fps
Sponza: 30.34 fps -> 31.15 fps
And some shader-db stats:
total instructions in shared programs: 12639282 -> 12457633 (-1.44%)
instructions in affected programs: 6433993 -> 6252344 (-2.82%)
helped: 30169
HURT: 6306
total threads in shared programs: 416378 -> 416366 (<.01%)
threads in affected programs: 2802 -> 2790 (-0.43%)
helped: 463
HURT: 469
total uniforms in shared programs: 3710998 -> 3704041 (-0.19%)
uniforms in affected programs: 279670 -> 272713 (-2.49%)
helped: 1512
HURT: 2621
total max-temps in shared programs: 2158527 -> 2147809 (-0.50%)
max-temps in affected programs: 480629 -> 469911 (-2.23%)
helped: 7212
HURT: 10248
total spills in shared programs: 3231 -> 2202 (-31.85%)
spills in affected programs: 2541 -> 1512 (-40.50%)
helped: 80
HURT: 2
total fills in shared programs: 4658 -> 3059 (-34.33%)
fills in affected programs: 3679 -> 2080 (-43.46%)
helped: 89
HURT: 1
total sfu-stalls in shared programs: 34450 -> 21227 (-38.38%)
sfu-stalls in affected programs: 20157 -> 6934 (-65.60%)
helped: 8186
HURT: 1015
total inst-and-stalls in shared programs: 12673732 -> 12478860 (-1.54%)
inst-and-stalls in affected programs: 6460462 -> 6265590 (-3.02%)
helped: 30523
HURT: 6021
total nops in shared programs: 321687 -> 315134 (-2.04%)
nops in affected programs: 107039 -> 100486 (-6.12%)
helped: 10703
HURT: 9474