aco: Implement NGG GS
This MR makes it possible for ACO to support NGG GS and enables it by default in RADV. To achieve this goal, some new features are added to NIR, some existing code is reorganized in ACO, and finally the actual NGG GS specifics are added.
- NIR gains the ability to count emitted primitives, and vertices per each primitive.
- NIR can now also filter out incomplete GS primitives. The NGG HW doesn't accept incomplete primitives so I'm really happy that this could be done in NIR and ACO doesn't have to deal with it.
- Loop creation helpers are added to ACO, thanks to Rhys.
- Some of the ACO NGG VS/TES code is reorganized to be easier to follow and be more consistent with the new code.
- GS threads store their output vertices in LDS, then at the end, vertex compaction is done and each thread exports one vertex (and maybe a primitive). For the vertex compaction, ACO now implements a workgroup reduction / scan.
A few optimizations are included:
- Early GS space allocation when we know the vertex and primitive counts in advance.
- Use the trick from RadeonSI and RADV/LLVM which helps reduce LDS bank conflicts.
Future work, not part of this MR:
- Allocate less GS space by compacting primitives (not just vertices).
- Combine LDS loads and stores when possible.
- Take the new NIR GS options into use by RADV/LLVM and possibly RadeonSI.
ACO is still missing NGG streamout, but RADV doesn't use NGG anyway when streamout is enabled. With this MR, the CTS still passes the same way it did before.