This tries to group VMEM loads of the same resource together, without any instruction in-between. I think this is faster because it prevents switching to another wave which helps with cache thrashing and starts the load earlier. Another theory is that it allows for the loads to be coalesced.
some future ideas:
- see if
s_clauseis needed to get a performance improvement on Navi
- experiment with grouping together SMEM loads (because we can combine additions into SMEM, we happen to already do this sometimes on Vega)
- experiment with grouping together stores
benchmark results on a Vega 64 (tested with load/store vectorizer): https://gitlab.freedesktop.org/snippets/695
This depends on !2364 (merged)