Skip to content

ac/nir: Remove byte permute from prefix sum of the repack sequence.

Timur Kristóf requested to merge Venemo/mesa:radv-repack-no-byte-permute into main

This is a pre-requisite for being able to use ac_nir_lower_ngg with the LLVM backends. Unfortunately, LLVM 12 and older don't expose the v_perm_b32 instruction for generic byte-permute, so I had to think of a good alternative. The new sequence can hopefully benefit RadeonSI too when it starts using ac_nir_lower_ngg.

It turns out that this alternative is slightly better than the original, too.

Additionally some cleanup to ACO is also included here to patch up a few things I noticed along the way.

For GPUs that support v_dot (Navi 2x)

When v_dot4_u32_u8 is available, we right-shift a series of 0x01 bytes. This will yield 0x01 at wanted byte positions and 0x00 at unwanted positions, therefore v_dot can get rid of the unneeded values. This sequence is preferable because it better hides the latency of the LDS.

Wave64 with up to 4 waves (same sequence used for Wave32 with up to 4 waves):

Screenshot_from_2021-09-09_10-15-00

Wave32 with 5-8 waves:

Screenshot_from_2021-09-09_10-42-33

For GPUs that don't support v_dot (Navi 10)

If the v_dot instruction can't be used, we left-shift the packed bytes. This will shift out the unneeded bytes and shift in zeroes instead, then we sum them using v_sad_u8.

Wave64 with up to 4 waves (same sequence used for Wave32 with up to 4 waves):

Screenshot_from_2021-09-09_10-16-02

Wave32 with 5-8 waves:

Screenshot_from_2021-09-09_10-16-54

Merge request reports