aco: Slight optimization to workgroup exclusive scan and reduction
The exclusive workgroup scan is used by NGG GS (when the number of vertices and primitives isn't constant). I plan to use the same logic in NGG culling for repacking the vertices. I was reading RadeonSI's code which gave me the idea to get rid of LDS bank conflicts here.
The exclusive scan works like this:
Every wave calculates the number of live vertices (by counting
1bits in a 64-bit lane mask)
The first lane of every wave writes this number to LDS
Every wave reads all the values from the LDS, and uses it to produce two outputs:
- Total number of vertices in the subgroup (workgroup reduction)
- Which invocation will export the current thread's vertex (workgroup exclusive scan)
This MR improves the 3rd step.
Previously, every wave activated the first N lanes (N = number of waves in the workgroup), which read from LDS what was stored in the 2nd step. Then, the data was processed by VALU using DPP instructions.
With this MR, to avoid bank conflicts, the first lane reads all the data. Then it uses
v_readlane to load the data into scalar registers and processes it using SALU instructions only. Although the instruction count is a bit more, the SALU instructions require less cycles and this approach also eliminates branches.