aco: try to group together VMEM loads of the same resource

This tries to group VMEM loads of the same resource together, without any instruction in-between. I think this is faster because it prevents switching to another wave which helps with cache thrashing and starts the load earlier. Another theory is that it allows for the loads to be coalesced.

some future ideas:

  • see if s_clause is needed to get a performance improvement on Navi
  • experiment with grouping together SMEM loads (because we can combine additions into SMEM, we happen to already do this sometimes on Vega)
  • experiment with grouping together stores

benchmark results on a Vega 64 (tested with load/store vectorizer):

This depends on !2364 (merged)

