Skip to content

aco: Slight optimization to workgroup exclusive scan and reduction

Timur Kristóf requested to merge Venemo/mesa:aco-ngg-wgscan-opt into master

The exclusive workgroup scan is used by NGG GS (when the number of vertices and primitives isn't constant). I plan to use the same logic in NGG culling for repacking the vertices. I was reading RadeonSI's code which gave me the idea to get rid of LDS bank conflicts here.

The exclusive scan works like this:

  1. Every wave calculates the number of live vertices (by counting 1 bits in a 64-bit lane mask)

  2. The first lane of every wave writes this number to LDS

  3. Every wave reads all the values from the LDS, and uses it to produce two outputs:

    • Total number of vertices in the subgroup (workgroup reduction)
    • Which invocation will export the current thread's vertex (workgroup exclusive scan)

This MR improves the 3rd step.

Previously, every wave activated the first N lanes (N = number of waves in the workgroup), which read from LDS what was stored in the 2nd step. Then, the data was processed by VALU using DPP instructions.

With this MR, to avoid bank conflicts, the first lane reads all the data. Then it uses v_readlane to load the data into scalar registers and processes it using SALU instructions only. Although the instruction count is a bit more, the SALU instructions require less cycles and this approach also eliminates branches.

Merge request reports