New Tessellation Control Shader Optimizations for NIR: Initial Spec

nir_restructure_tcs_flow

This pass restructures a tessellation control shader by dividing it into 2 parts:

gl_TessLevel computation
All other outputs

The motivation is to skip computing all the other outputs when we know that the patch is going to be culled. Final form:

{
   float gl_TessLevels[6] = get_tess_levels_for_all_invocations();
   store_output_tess_levels(gl_TessLevels);

   bool not_killed =
      spacing == equal ? all(greaterThanEqual(gl_TessLevels, 0.5)) :
                         all(greaterThan(gl_TessLevels, 0));
   if (!not_killed) {
      compute_outputs_for_tes();
      store_outputs_for_tes();
   }
}

Tessellation levels are computed first and broadcast to all invocations. If any tessellation level is so low that it kills the patch, the rest of the shader is skipped.

Sometimes the tessellation levels are computed by all invocations and don't have to be broadcast, which saves us a barrier.

Output stores are moved to the end of the !not_killed block to make them faster. (some hardware benefits when stores are next to each other)

tess_primitive_mode from TES determines how many tessellation level elements to compare. gl_tess_spacing from TES determines the cut-off value for culling patches. Both must be known at compile time of TCS, though worst-case values can be assumed or guessed.

If memory stores are present, the pass will give up.

The implementation can follow ac_nir_lower_ngg_nogs, which does something similar to compute the VS position, cull in the shader, and then compute non-position outputs conditionally. It works by cloning the whole shader at its end, removing undesirable output stores in each part, and running DCE to end up with a separate position shader and varyings shader in different blocks. In this case, we should end up with a separate gl_TessLevel shader in its own block.

nir_opt_barriers - necessary changes

If the above optimization uses cloning and DCE, it may end up with unnecessary barriers that should be eliminated.

The basic idea is this. Separately for each part: If an output load is preceded by 2 barriers and there are no output loads between the 2 barriers, the first barrier should be removed. If there is no output load after the last barrier, that barrier should also be removed. Cross-invocation stores might suppress this.

nir_inline_vs_in_tcs

If the number of input and output control points is the same between VS and TCS, VS can be inlined into TCS.

The motivation is that VS output computations and associated VBO and UBO loads in VS that don't contribute to the computation of gl_TessLevel can be skipped for patches that are culled by gl_TessLevel to save memory bandwidth. The idea is to run this pass first, and then run the restructuring pass to move the VS code into the !not_killed block at the end of TCS.

It's assumed that all VS resources (VBOs, UBOs, etc.) are available in TCS. The intrinsics accessing those resources will need a flag indicating that they should use VS bindings, not TCS bindings. load_input will be legal in TCS, indicating a VS input load.

TCS inputs might not always be accessed by gl_InvocationID. Such cross-invocation access will be handled by passing those values via new TCS outputs.

We should end up with an empty shader in the VS stage (or just passing VS-only system values to TCS if they are not available in TCS). TCS might end up with more outputs if it had cross-invocation access of TCS inputs at the beginning.

Edited Sep 18, 2024 by Marek Olšák

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

New Tessellation Control Shader Optimizations for NIR: Initial Spec

nir_restructure_tcs_flow

nir_opt_barriers - necessary changes

nir_inline_vs_in_tcs