New Tessellation Control Shader Optimizations for NIR: Initial Spec
nir_restructure_tcs_flow
This pass restructures a tessellation control shader by dividing it into 2 parts:
- gl_TessLevel computation
- All other outputs
The motivation is to skip computing all the other outputs when we know that the patch is going to be culled. Final form:
{
float gl_TessLevels[6] = get_tess_levels_for_all_invocations();
store_output_tess_levels(gl_TessLevels);
bool not_killed =
spacing == equal ? all(greaterThanEqual(gl_TessLevels, 0.5)) :
all(greaterThan(gl_TessLevels, 0));
if (!not_killed) {
compute_outputs_for_tes();
store_outputs_for_tes();
}
}
Tessellation levels are computed first and broadcast to all invocations. If any tessellation level is so low that it kills the patch, the rest of the shader is skipped.
Sometimes the tessellation levels are computed by all invocations and don't have to be broadcast, which saves us a barrier.
Output stores are moved to the end of the !not_killed
block to make them
faster. (some hardware benefits when stores are next to each other)
tess_primitive_mode
from TES determines how many tessellation level
elements to compare. gl_tess_spacing
from TES determines the cut-off value
for culling patches. Both must be known at compile time of TCS, though
worst-case values can be assumed or guessed.
If memory stores are present, the pass will give up.
The implementation can follow ac_nir_lower_ngg_nogs
, which does something similar to compute the VS position, cull in the shader, and then compute non-position outputs conditionally. It works by cloning the whole shader at its end, removing undesirable output stores in each part, and running DCE to end up with a separate position shader and varyings shader in different blocks. In this case, we should end up with a separate gl_TessLevel
shader in its own block.
nir_opt_barriers - necessary changes
If the above optimization uses cloning and DCE, it may end up with unnecessary barriers that should be eliminated.
The basic idea is this. Separately for each part: If an output load is preceded by 2 barriers and there are no output loads between the 2 barriers, the first barrier should be removed. If there is no output load after the last barrier, that barrier should also be removed. Cross-invocation stores might suppress this.
nir_inline_vs_in_tcs
If the number of input and output control points is the same between VS and TCS, VS can be inlined into TCS.
The motivation is that VS output computations and associated VBO and UBO loads in VS that don't contribute to the computation of gl_TessLevel
can be skipped for patches that are culled by gl_TessLevel
to save memory bandwidth. The idea is to run this pass first, and then run the restructuring pass to move the VS code into the !not_killed
block at the end of TCS.
It's assumed that all VS resources (VBOs, UBOs, etc.) are available in TCS. The intrinsics accessing those resources will need a flag indicating that they should use VS bindings, not TCS bindings. load_input
will be legal in TCS, indicating a VS input load.
TCS inputs might not always be accessed by gl_InvocationID
. Such cross-invocation access will be handled by passing those values via new TCS outputs.
We should end up with an empty shader in the VS stage (or just passing VS-only system values to TCS if they are not available in TCS). TCS might end up with more outputs if it had cross-invocation access of TCS inputs at the beginning.