ac/nir: Remove byte permute from prefix sum of the repack sequence. (!12786) · Merge requests · Mesa / mesa

Timur Kristóf requested to merge Venemo/mesa:radv-repack-no-byte-permute into main Sep 09, 2021

This is a pre-requisite for being able to use ac_nir_lower_ngg with the LLVM backends. Unfortunately, LLVM 12 and older don't expose the v_perm_b32 instruction for generic byte-permute, so I had to think of a good alternative. The new sequence can hopefully benefit RadeonSI too when it starts using ac_nir_lower_ngg.

It turns out that this alternative is slightly better than the original, too.

Additionally some cleanup to ACO is also included here to patch up a few things I noticed along the way.

For GPUs that support v_dot (Navi 2x)

When v_dot4_u32_u8 is available, we right-shift a series of 0x01 bytes. This will yield 0x01 at wanted byte positions and 0x00 at unwanted positions, therefore v_dot can get rid of the unneeded values. This sequence is preferable because it better hides the latency of the LDS.

Wave64 with up to 4 waves (same sequence used for Wave32 with up to 4 waves):

Wave32 with 5-8 waves:

For GPUs that don't support v_dot (Navi 10)

If the v_dot instruction can't be used, we left-shift the packed bytes. This will shift out the unneeded bytes and shift in zeroes instead, then we sum them using v_sad_u8.

Wave64 with up to 4 waves (same sequence used for Wave32 with up to 4 waves):

Wave32 with 5-8 waves:

ac/nir: Remove byte permute from prefix sum of the repack sequence.

For GPUs that support v_dot (Navi 2x)

For GPUs that don't support v_dot (Navi 10)

Merge request reports