aco: Use packed bytes approach to optimize the workgroup scan further.
In the first version of the workgroup scan, we had each wave store a single dword to LDS, then each lane loaded the dword from the corresponding wave and we used DPP to get the reduction and scan result. I changed this to the current SALU approach, where I assumed that DPP is costly and readlane+SALU is cheap.
However, the current approach in main turns out to be also suboptimal because:
- for Wave32 it requires a TON of SALU instructions
- according to RGP readlane is not so cheap when the SALU has to wait for it
- each wave only really has up to 64 (or 32) live vertices, so we waste a dword of LDS space when it could fit in a byte
- each wave has to wait immediately after the LDS read
About the new version proposed in this MR:
Compared to the current version in main, about 20 SALU instructions are removed at the cost of only 2 more VALU instructions.
- uses only 1 byte / wave, meaning: Wave64 gets away with only 1 LDS dword (and Wave32: 2 dwords) in total
- can do some of the processing before the wait for the LDS read
- has overall fewer instructions and lower latency
- each lane calculates the reduction result for the corresponding wave, but this is done using byte and lane permute instructions instead of DPP additions.
NOTE: On the RGP screenshots, ignore the high latency around s_barrier
and the instructions near it - that is just happenstance, not related to any change made here.
Explanation of the sequence:
- Each wave stores its result in a single byte - for Wave64 there are up to 4 waves, so 1 dword
- Load the 1 dword
- Compute a mask to be used for the byte-permute instruction
v_perm_b32
- Each lane computes the scan result for the corresponding wave by masking out the unneeded bytes using
v_perm_b32
, and then horizontally adding the remaining bytes usingv_sad_u8
. - Use
v_readlane_b32
to read the current wave's result and the total.
NOTE: in Wave32 the basic idea is the same, but it's a bit more complicated because there are two dwords.