radv: Implement NGG culling
NGG culling: Quick summary
This MR implements NGG culling on top of the NGG lowering that was merged in MR !10740 (merged)
What is NGG?
"NGG" (Next Generation Geometry) is a feature on Navi GPUs which adds a GS-like hardware stage that is directly connected to the parameter cache (where PS inputs are loaded from). We use this HW stage to run VS, TES or GS (and in the future, MS). The hardware makes it possible for shaders running on this stage to modify not only vertices but also primitives (eg. triangles).
What is NGG culling?
Culling is the process of removing unneeded triangles. This is traditionally done by the rasterizer, but since that is fixed-function hardware, it can present a bottleneck when an app creates a large number of primitives. Especially a lot of tiny triangles reduce the rasterziation efficiency of GCN/RDNA GPUs. NGG makes it possible for the shader to check primitives and delete those that it can prove are not needed.
How to test
RADV_PERFTEST=nggc to enable the feature. It is not enabled by default.
What is supported by the RADV implementation in this MR?
- Vertex and tess eval shaders
- All triangle-based primitive topologies (triangle lists, strips, fans, etc.)
- Culling algorithms:
- Front/back face culling: removes triangles based on whether their front or back side is shown to the camera.
- Frustrum culling: removes triangles that are completely outside the viewport.
- Small primitive culling: removes triangles that are so small that they wouldn't be visible in any pixel.
The basic algorithm is the same as what RadeonSI also uses. This is called "deferred attribute shading". It not only implements culling, but also reduces bandwidth use, because only the vertices of the primitives that survive the culling will compute outputs other than position, and only they execute any memory load needed for these other output attributes.
- The VS (or TES) is butchered into two parts:
- Top part: ES vertex threads compute only the position output and store that to LDS.
- Culling code
- GS threads load the positions of each vertex that belongs to their triangle and decide whether to accept or cull the triangle.
- Surviving vertices are repacked if needed.
- Bottom part: ES threads of the repacked, surviving vertices compute the other outputs.
- The difference between this and RadeonSI's implementation is that the compiled shaders always contain all the culling code, and check what they should do in runtime. They can either run culling algorithms, or completely skip the culling code.
- A new shader argument is added that tells the shaders what to do.
- When culling is not enabled, the shaders can not only jump over the culling code (steps 2 and 3 above) but also don't require the LDS space that the culling algorithm needs.
The advantage of this approach is that we can get away with compiling only 1 shader variant which "does it all".
- No need to compile and cache multiple shader variants for each combination of culling settings.
- No need to compile shaders in draw calls, or even asynchronously.
- The trade-off is that the result is slightly less efficient than the original, when culling is disabled.
A new debug environment variable
RADV_PERFTEST=nggc is added, which allows RADV to compile culling-capable shaders.