freedreno/a6xx: redesign tex prefetch

Move from nir pass to an ir3 pass that runs after pre-RA sched pass. This should bring several benefits:

should prioritize tex fetches which kill a varying (since pre-RA sched tries to minimize register pressure)
can do a better job of not moving a tex fetch ahead of a kill, ie. won't just rely on order of instructions in nir, but their actual dependency graph
I think we want to balance the # of prefetch against shader size? The blob is definitely doing something other than "prefetch as much as possible", and there are some scenarios (like gl_fill2) where we are better off prefetching less than 4.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information