freedreno/a6xx: redesign tex prefetch
Move from nir pass to an ir3 pass that runs after pre-RA sched pass. This should bring several benefits:
- should prioritize tex fetches which kill a varying (since pre-RA sched tries to minimize register pressure)
- can do a better job of not moving a tex fetch ahead of a kill, ie. won't just rely on order of instructions in nir, but their actual dependency graph
- I think we want to balance the # of prefetch against shader size? The blob is definitely doing something other than "prefetch as much as possible", and there are some scenarios (like gl_fill2) where we are better off prefetching less than 4.