aco: Use packed bytes approach to optimize the workgroup scan further. (!11072) · Merge requests · Mesa / mesa

Timur Kristóf requested to merge Venemo/mesa:aco-packed-wg-scan into main May 28, 2021

In the first version of the workgroup scan, we had each wave store a single dword to LDS, then each lane loaded the dword from the corresponding wave and we used DPP to get the reduction and scan result. I changed this to the current SALU approach, where I assumed that DPP is costly and readlane+SALU is cheap.

However, the current approach in main turns out to be also suboptimal because:

for Wave32 it requires a TON of SALU instructions
according to RGP readlane is not so cheap when the SALU has to wait for it
each wave only really has up to 64 (or 32) live vertices, so we waste a dword of LDS space when it could fit in a byte
each wave has to wait immediately after the LDS read

About the new version proposed in this MR:

Compared to the current version in main, about 20 SALU instructions are removed at the cost of only 2 more VALU instructions.

uses only 1 byte / wave, meaning: Wave64 gets away with only 1 LDS dword (and Wave32: 2 dwords) in total
can do some of the processing before the wait for the LDS read
has overall fewer instructions and lower latency
each lane calculates the reduction result for the corresponding wave, but this is done using byte and lane permute instructions instead of DPP additions.

NOTE: On the RGP screenshots, ignore the high latency around s_barrier and the instructions near it - that is just happenstance, not related to any change made here.

Explanation of the sequence:

Each wave stores its result in a single byte - for Wave64 there are up to 4 waves, so 1 dword
Load the 1 dword
Compute a mask to be used for the byte-permute instruction v_perm_b32
Each lane computes the scan result for the corresponding wave by masking out the unneeded bytes using v_perm_b32, and then horizontally adding the remaining bytes using v_sad_u8.
Use v_readlane_b32 to read the current wave's result and the total.

NOTE: in Wave32 the basic idea is the same, but it's a bit more complicated because there are two dwords.

Edited Jun 09, 2021 by Timur Kristóf

aco: Use packed bytes approach to optimize the workgroup scan further.

However, the current approach in main turns out to be also suboptimal because:

About the new version proposed in this MR:

Explanation of the sequence:

Merge request reports