Skip to content

aco: Use packed bytes approach to optimize the workgroup scan further.

Timur Kristóf requested to merge Venemo/mesa:aco-packed-wg-scan into main

In the first version of the workgroup scan, we had each wave store a single dword to LDS, then each lane loaded the dword from the corresponding wave and we used DPP to get the reduction and scan result. I changed this to the current SALU approach, where I assumed that DPP is costly and readlane+SALU is cheap.

However, the current approach in main turns out to be also suboptimal because:

  • for Wave32 it requires a TON of SALU instructions
  • according to RGP readlane is not so cheap when the SALU has to wait for it
  • each wave only really has up to 64 (or 32) live vertices, so we waste a dword of LDS space when it could fit in a byte
  • each wave has to wait immediately after the LDS read

Screenshot_from_2021-05-31_15-52-38

About the new version proposed in this MR:

Compared to the current version in main, about 20 SALU instructions are removed at the cost of only 2 more VALU instructions.

  • uses only 1 byte / wave, meaning: Wave64 gets away with only 1 LDS dword (and Wave32: 2 dwords) in total
  • can do some of the processing before the wait for the LDS read
  • has overall fewer instructions and lower latency
  • each lane calculates the reduction result for the corresponding wave, but this is done using byte and lane permute instructions instead of DPP additions.

Screenshot_from_2021-06-09_11-06-25

NOTE: On the RGP screenshots, ignore the high latency around s_barrier and the instructions near it - that is just happenstance, not related to any change made here.

Explanation of the sequence:

  1. Each wave stores its result in a single byte - for Wave64 there are up to 4 waves, so 1 dword
  2. Load the 1 dword
  3. Compute a mask to be used for the byte-permute instruction v_perm_b32
  4. Each lane computes the scan result for the corresponding wave by masking out the unneeded bytes using v_perm_b32, and then horizontally adding the remaining bytes using v_sad_u8.
  5. Use v_readlane_b32 to read the current wave's result and the total.

NOTE: in Wave32 the basic idea is the same, but it's a bit more complicated because there are two dwords.

Edited by Timur Kristóf

Merge request reports