ac/nir: Remove byte permute from prefix sum of the repack sequence.
This is a pre-requisite for being able to use ac_nir_lower_ngg
with the LLVM backends. Unfortunately, LLVM 12 and older don't expose the v_perm_b32
instruction for generic byte-permute, so I had to think of a good alternative. The new sequence can hopefully benefit RadeonSI too when it starts using ac_nir_lower_ngg
.
It turns out that this alternative is slightly better than the original, too.
Additionally some cleanup to ACO is also included here to patch up a few things I noticed along the way.
For GPUs that support v_dot (Navi 2x)
When v_dot4_u32_u8 is available, we right-shift a series of 0x01 bytes. This will yield 0x01 at wanted byte positions and 0x00 at unwanted positions, therefore v_dot can get rid of the unneeded values. This sequence is preferable because it better hides the latency of the LDS.
Wave64 with up to 4 waves (same sequence used for Wave32 with up to 4 waves):
Wave32 with 5-8 waves:
For GPUs that don't support v_dot (Navi 10)
If the v_dot instruction can't be used, we left-shift the packed bytes. This will shift out the unneeded bytes and shift in zeroes instead, then we sum them using v_sad_u8.
Wave64 with up to 4 waves (same sequence used for Wave32 with up to 4 waves):
Wave32 with 5-8 waves: