NIR: New Varying Linking Optimization Pass - Design Document
Disclaimer
This is a new varying linking/optimization pass that does a lot more than the passes everybody currently uses. It can be used as an alternative to 4+ existing passes, and it only operates on store_output
/load_input
intrinsics and friends.
I'm pretty far in the implementation and everything is done except cross-shader code motion, which I've started and know how to do. shader-db results so far are promising. This exact text will be in the code.
The first MR will only enable it for radeonsi and it will be executed by the GLSL linker after varying linking as a pure optimization pass. Further integration into the GLSL linker can be done in the future.
Some aspects may be geared towards AMD hw, and possible modifications to accommodate other drivers may follow after merging.
The functionality described here is what is planned to appear in the first MR.
Introduction
This pass optimizes varyings between 2 shaders, which means dead input/
output removal, constant and uniform load propagation, deduplication,
compaction, and cross-shader code motion. This is used during the shader
linking process.
Notes on behavior
The pass operates on scalar varyings using 32-bit and 16-bit types.
Vector varyings are not allowed.
Indirectly-indexed varying slots (not vertices) are not optimized or
compacted, but unused slots of indirectly-indexed varyings will still
be filled with directly-indexed varyings during compaction. Indirectly-
indexed varyings are still removed if they are unused by the other shader.
Indirectly-indexed vertices are always optimized. They are common in TCS,
TES, and GS. More on that below.
Transform feedback never prevents any optimizations such as constant
propagation and compaction.
All legacy varying slots are optimized when it's allowed.
Convergence property of shader outputs
When an output stores an SSAs that is convergent and all stores of that
output appear in unconditional blocks or conditional blocks with
a convergent entry condition and the shader is not GS and the output is not
flat, it implies that all vertices of that output have the same value,
therefore the output can be promoted to flat because all interpolation
modes lead to the same result as flat. Such outputs are opportunistically
compacted with both flat and non-flat varyings based on whichever has
unused slots in their vec4s. This pass refers to such inputs, outputs, and
varyings as "convergent". The only case when it doesn't work is when
vertices of the same primitive are stored by different subgroups and a vote
instruction is used to determine the output value. In that case, the output
will be treated as divergent by this pass even if NIR sees it as convergent.
Optimization steps
1. Determine which varying slots can be optimized and how.
- When a varying is said to be "optimized" in the following text, it
means all optimizations are performed, such as removal, constant
propagation, and deduplication. - All VARn, PATCHn, and FOGC varyings are always optimized and
compacted. - PRIMITIVE_ID is treated as VARn in (GS, FS).
- TEXn are removed if they are dead (except TEXn inputs, which can't be
removed due to being affected by the coord replace state). TEXn can’t
also be optimized or compacted due to being affected by the coord
replace state. - COLn and BFCn only propagate constants if they are between 0 and 1
because of the clamp vertex color state, and they are only
deduplicated and compacted among themselves because they are affected
by the flat shade, provoking vertex, two-side color selection, and
clamp vertex color states. - All system value outputs like POS, PSIZ, CLIP_DISTn, etc. can’t be
removed, but they are demoted to sysval-only outputs by setting
the no_varying bit (i.e. they can be removed as varyings), so drivers
should look at the no_varying bit. If an output is not a sysval output
in a specific stage, it's fully optimized and compacted. (such as POS
in TCS) - TESS_LEVEL_* inputs in TES can’t be touched if TCS is missing.
2. Remove unused inputs and outputs
- Outputs not used in the next shader are removed.
- Inputs not initialized by the previous shader are replaced with undef.
- Output loads that have no output stores anywhere in the shader are
replaced with undef. (for TCS, though it works with any shader) - Output stores with transform feedback are preserved, but get
the “no_varying” flag. Later, transform-feedback-only varyings are
compacted (relocated) such that they are always last.
3. Constant, uniform, UBO load, and uniform expression propagation
- Define “uniform expressions” as ALU expressions only sourcing
constants, uniforms, and UBO loads. - Constants, uniforms, UBO loads, and uniform expressions stored
in outputs are moved into the next shader, and the outputs are removed. - The same propagation is done from output stores to output loads.
(for TCS, though it works with any shader) - If there are multiple stores to the same output, all such stores
should store the same constant, uniform, UBO load, or uniform
expression for the expression to be propagated. If an output has
multiple vertices, all vertices should store the same expression. - nir->options has callbacks that are used to estimate the cost of
uniform expressions that drivers can set to control the complexity of
uniform expressions that will be propagated. This is to ensure that
we don't increase the GPU overhead measurably by moving code across
pipeline stages that amplify GPU work.
4. Remove duplicated output components
- By comparing SSA defs.
- If there are multiple stores to the same output, all such stores
should store the same SSA as all stores of another output for
the output to be considered duplicated. If an output has multiple
vertices, all vertices should store the same SSA.
5. Backward cross-shader code motion
Fragment shader example: result = input0 * uniform + input1;
The computation of "result" in the above example is moved into
the previous shader and both inputs are replaced with a new input
holding the value of "result", thus reducing the number of inputs by
one.
Such code motion can be performed for any expression sourcing only
inputs, constants, and uniforms except for fragment shaders, which can
also do it but with the following limitations:
- Only these transformations can be perfomed with interpolated inputs
and any composition of these transformations (such as lerp), which can
all be proven mathematically:- interp(x, i, j) + interp(y, i, j) = interp(x + y, i, j)
- interp(x, i, j) + convergent_expr = interp(x + convergent_expr, i, j)
- interp(x, i, j) * convergent_expr = interp(x * convergent_expr, i, j)
- all of these transformations are considered "inexact" in NIR
- interp interpolates an input according to the barycentric
coordinates (i, j), which are different for perspective,
noperspective, center, centroid, sample, at_offset, and at_sample
modes. - convergent_expr is any expression sourcing only constants,
uniforms, and convergent inputs. The only requirement on
convergent_expr is that it doesn't vary between vertices of
the same primitive, but it can vary between primitives.
- If inputs are flat or convergent, there are no limitations on
expressions that can be moved. - Interpolated and flat inputs can't mix in the same expression, but
convergent inputs can mix with both. - The interpolation qualifier of the new input is inherited from
the removed non-convergent inputs that should all have the same (i, j).
If there are no non-convergent inputs, then the new input is declared
as flat (for simplicity; we can't choose the barycentric coordinates
at random because AMD doesn't like when there are multiple sets of
barycentric coordinates in the same shader unnecessarily).
The general idea of the algorithm:
- We need a post-dominator tree that is constructed from a graph where
vertices are instructions and directed edges going into them are
their source operands. This is different from how NIR dominance
works, which represents all instructions within a basic block as
a linear chain of vertices in the graph. That representation doesn't
work here. We don't need to consider blocks in the graph at all.
In our graph, all loads without source operands and all constants are
entry nodes in the graph, and all stores are exit nodes in the graph.
Each shader can have multiple disjoint graphs. - Given the above definition, the instruction whose result is the best
candidate for a new input is the farthest instruction that
post-dominates one of more inputs.
The disadvantage of this algorithm is that e.g. if 3 inputs are
transitively used by only 2 instructions, we could replace the 3 inputs
with 2 inputs containing the results of those instructions, but
the post-dominator search won't be able to find them.
The advantage of this algorithm is that it's straightforward and well
defined mathematically (it's easy to prove its correctness).
6. Forward cross-shader code motion
Vertex shader example:
output0 = value + 1;
output1 = value * 2;
Both outputs can be replaced by 1 output storing "value", and both ALU
operations can be moved into the next shader.
The same dominance algorithm as in the previous optimization is used,
except that:
- Instead of inputs, we use outputs.
- Instead of a post-dominator tree, we use a dominator tree of the exact
same graph.
The full definition is: If an instruction dominates 2 output stores,
the result of that instruction can become a new output, and the dominated
instructions can be moved into the next shader.
The limitations on instructions that can be moved between shaders across
interpolated loads are exactly the same.
nir->options has callbacks that are used to estimate the cost of
expressions that drivers can set to control the complexity of
expressions that will be moved. This is to ensure that we don't increase
the GPU overhead measurably by moving code across pipeline stages that
amplify GPU work.
7. Compaction to vec4 slots
First, varyings are divided into these groups, and each group is compacted
separately with some exceptions listed below:
Non-FS groups (patch and non-patch are packed separately):
- 32-bit flat
- 16-bit flat
FS groups:
- 32-bit interpolated (always FP32)
- 32-bit flat
- 32-bit convergent (always FP32)
- 16-bit interpolated (always FP16)
- 16-bit flat
- 16-bit convergent (always FP16)
- 32-bit transform feedback only
- 16-bit transform feedback only
Then, all scalar varyings are relocated into new slots, starting from
VAR0.x and increasing the scalar slot offset in 32-bit or 16-bit
increments. Rules:
- Both 32-bit and 16-bit flat varyings are packed in the same vec4.
- Convergent varyings can be packed with interpolated varyings of
the same type or flat. The group to pack with is chosen based on
whichever has unused scalar slots because we want to reduce the total
number of vec4s. After filling all unused scalar slots, the remaining
convergent varyings are packed as flat. - Transform-feedback-only slots are last and 32-bit and 16-bit slots are
packed in the same vec4.
Colors are compacted the same but separately because they can't be mixed
with VARn.
To facilitate driver-specific output merging, color channels are
assigned in a rotated order depending on what the first unused VARn
channel is. For example, if the first unused VARn channel is VAR0.z,
color channels will be assigned in this order:
COL0.z, COL0.w, COL0.x, COL0.y, COL1.z, COL1.w, COL1.x, COL1.y
The reason is that some drivers merge outputs if each output sets
different components, for example 2 outputs defining VAR0.xy and COL0.z.
If drivers do interpolation in the fragment shader and color
interpolation can differ for each component, VAR0.xy and COL.z can be
merged into 1 output in the producer, and the consumer can load VAR0
and COL0 from the same varying.
Usage
It requires scalarized load_input and store_output intrinsics.
64-bit varyings should be lowered to 32 bits.
shader_info should be up to date.
It's recommended to run this for all shader pairs from the first shader
to the last shader first (to propagate constants etc.). If the optimization
of (S1, S2) stages leads to changes in S1, remember the highest S1. Then
re-run this for all shader pairs in the descending order from S1 to VS.
NIR optimizations should be performed after every run that changes the IR.
Alternative applications
This performs enough analysis of varyings that we can use it to add
an analysis pass that decides whether a separate shader has
the potential to benefit from full draw-time linking. This way we
could perform the linking only if we are confident that it would help
performance.