NIR: New Varying Linking Optimization Pass - Design Document
Introduction
This pass optimizes varyings between 2 shaders, which means dead input/ output removal, constant and uniform load propagation, deduplication, compaction, and inter-shader code motion. This is used during the shader linking process.
Notes on behavior
The pass operates on scalar varyings using 32-bit and 16-bit types. Vector varyings are not allowed.
Indirectly-indexed varying slots (not vertices) are not optimized or compacted, but unused slots of indirectly-indexed varyings are still filled with directly-indexed varyings during compaction. Indirectly-indexed varyings are still removed if they are unused by the other shader.
Indirectly-indexed vertices don't disallow optimizations, but compromises are made depending on how they are accessed. They are common in TCS, TES, and GS, so there is a desire to optimize them as much as possible. More on that in various sections below.
Transform feedback doesn't prevent most optimizations such as constant propagation and compaction. Shaders can be left with output stores that set the no_varying flag, meaning the output is not consumed by the next shader, which means that optimizations did their job and now the output is only consumed by transform feedback.
All legacy varying slots are optimized when it's allowed.
Convergence property of shader outputs
When an output stores an SSA that is convergent and all stores of that output appear in unconditional blocks or conditional blocks with a convergent entry condition and the shader is not GS, it implies that all vertices of that output have the same value, therefore the output can be promoted to flat because all interpolation modes lead to the same result as flat. Such outputs are opportunistically compacted with both flat and non-flat varyings based on whichever has unused slots in their vec4s. This pass refers to such inputs, outputs, and varyings as "convergent" (meaning all vertices are always equal).
Flat varyings are the only ones that are never considered convergent because we want the flexibility to pack convergent varyings with both flat and non-flat varyings, and since flat varyings can contain integers and doubles, we can never interpolate them as FP32 or FP16. Optimizations start with separate interpolated, flat, and convergent groups of varyings, and they choose whether they want to promote convergent to interpolated or flat, or whether to leave that decision to the end when the compaction happens.
TES patch inputs are always convergent because they are uniform within a primitive.
When a vote instruction is used to determine an output value, nir_divergence_analysis evaluates it as convergent, but nir_vertex_divergence_analysis evaluates it as divergent because vertices of the same primitive can be stored by different subgroups.
The nir_vertex_divergence_analysis pass is used to determine which outputs are convergent.
Optimization steps
-
Determine which varying slots can be optimized and how.
- When a varying is said to be "optimized" in the following text, it means all optimizations are performed, such as removal, constant propagation, and deduplication.
- All VARn, PATCHn, and FOGC varyings are always optimized and compacted.
- PRIMITIVE_ID is treated as VARn in (GS, FS).
- TEXn are removed if they are dead (except TEXn inputs, which can't be removed due to being affected by the coord replace state). TEXn can’t also be optimized or compacted due to being affected by the coord replace state. TEXn not consumed by FS are treated as VARn.
- COLn and BFCn only propagate constants if they are between 0 and 1 because of the clamp vertex color state, and they are only deduplicated and compacted among themselves because they are affected by the flat shade, provoking vertex, two-side color selection, and clamp vertex color states. COLn and BFCn not consumed by FS are treated as VARn.
- All system value outputs like POS, PSIZ, CLIP_DISTn, etc. can’t be removed, but they are demoted to sysval-only outputs by setting the "no_varying" flag (i.e. they can be removed as varyings), so drivers should look at the "no_varying" flag. If an output is not a sysval output in a specific stage, it's treated as VARn. (such as POS in TCS)
- TESS_LEVEL_* inputs in TES can’t be touched if TCS is missing.
-
Remove unused inputs and outputs
- Outputs not used in the next shader are removed.
- Inputs not initialized by the previous shader are replaced with undef
except:
- LAYER and VIEWPORT are replaced with 0 in FS.
- TEXn.xy is untouched because the coord replace state can set it, and TEXn.zw is replaced by (0, 1), which is equal to the coord replace value.
- Output loads that have no output stores anywhere in the shader are replaced with undef. (for TCS, though it works with any shader)
- Output stores with transform feedback are preserved, but get the “no_varying” flag, meaning they are not consumed by the next shader stage. Later, transform-feedback-only varyings are compacted (relocated) such that they are always last.
- TCS outputs that are read by TCS, but not used by TES get the "no_varying" flag to indicate that they are only read by TCS and not consumed by TES. Later, such TCS outputs are compacted (relocated) such that they are always last to keep all outputs consumed by TES consecutive without holes.
-
Constant, uniform, UBO load, and uniform expression propagation
- Define “uniform expressions” as ALU expressions only sourcing constants, uniforms, and UBO loads.
- Constants, uniforms, UBO loads, and uniform expressions stored in outputs are moved into the next shader, and the outputs are removed.
- The same propagation is done from output stores to output loads. (for TCS, though it works with any shader)
- If there are multiple stores to the same output, all such stores should store the same constant, uniform, UBO load, or uniform expression for the expression to be propagated. If an output has multiple vertices, all vertices should store the same expression.
- nir->options has callbacks that are used to estimate the cost of uniform expressions that drivers can set to control the complexity of uniform expressions that are propagated. This is to ensure that we don't increase the GPU overhead measurably by moving code across pipeline stages that amplify GPU work.
- Special cases:
- Constant COLn and BFCn are propagated only if the constants are in the [0, 1] range because of the clamp vertex color state. If both COLn and BFCn are written, they must write the same constant. If BFCn is written but not COLn, the constant is propagated from BFCn to COLn.
- TEX.xy is untouched because of the coord replace state. If TEX.zw is (0, 1), only those constants are propagated because they match the coord replace values.
- CLIP_DISTn, LAYER and VIEWPORT are always propagated. Eliminated output stores get the "no_varying" flag if they are also xfb stores or write sysval outputs.
-
Remove duplicated output components
- By comparing SSA defs.
- If there are multiple stores to the same output, all such stores should store the same SSA as all stores of another output for the output to be considered duplicated. If an output has multiple vertices, all vertices should store the same SSA.
- Deduplication can only be done between outputs of the same category. Those are: interpolated, patch, flat, interpolated color, flat color, and conditionally interpolated color based on the flat shade state
- Everything is deduplicated except TEXn due to the coord replace state.
- Eliminated output stores get the "no_varying" flag if they are also xfb stores or write sysval outputs.
-
Backward inter-shader code motion
"Backward" refers to moving code in the opposite direction that shaders are executed, i.e. moving code from the consumer to the producer.
Fragment shader example:
result = input0 * uniform + input1 * constant + UBO.variable;
The computation of "result" in the above example can be moved into the previous shader and both inputs can be replaced with a new input holding the value of "result", thus making the shader smaller and possibly reducing the number of inputs, uniforms, and UBOs by 1.
Such code motion can be performed for any expression sourcing only inputs, constants, and uniforms except for fragment shaders, which can also do it but with the following limitations:
- Only these transformations can be perfomed with interpolated inputs
and any composition of these transformations (such as lerp), which can
all be proven mathematically:
- interp(x, i, j) + interp(y, i, j) = interp(x + y, i, j)
- interp(x, i, j) + convergent_expr = interp(x + convergent_expr, i, j)
- interp(x, i, j) * convergent_expr = interp(x * convergent_expr, i, j)
- all of these transformations are considered "inexact" in NIR
- interp interpolates an input according to the barycentric coordinates (i, j), which are different for perspective, noperspective, center, centroid, sample, at_offset, and at_sample modes.
- convergent_expr is any expression sourcing only constants, uniforms, and convergent inputs. The only requirement on convergent_expr is that it doesn't vary between vertices of the same primitive, but it can vary between primitives.
- If inputs are flat or convergent, there are no limitations on expressions that can be moved.
- Interpolated and flat inputs can't mix in the same expression, but convergent inputs can mix with both.
- The interpolation qualifier of the new input is inherited from the removed non-convergent inputs that should all have the same (i, j). If there are no non-convergent inputs, then the new input is declared as flat (for simplicity; we can't choose the barycentric coordinates at random because AMD doesn't like when there are multiple sets of barycentric coordinates in the same shader unnecessarily).
- Inf values break code motion across interpolation. See the section discussing how we handle it near the end.
The above rules also apply to open-coded TES input interpolation, which is handled the same as FS input interpolation. The only differences are:
- Open-coded TES input interpolation must match one of the allowed equations. Different interpolation equations are treated the same as different interpolation qualifiers in FS.
- Patch varyings are always treated as convergent.
Prerequisites:
- We need a post-dominator tree that is constructed from a graph where vertices are instructions and directed edges going into them are the values of their source operands. This is different from how NIR dominance works, which represents all instructions within a basic block as a linear chain of vertices in the graph. In our graph, all loads without source operands and all constants are entry nodes in the graph, and all stores and discards are exit nodes in the graph. Each shader can have multiple disjoint graphs where the Lowest Common Ancestor of 2 instructions doesn't exist.
- Given the above definition, the instruction whose result is the best candidate for a new input is the farthest instruction that post-dominates one of more inputs and is movable between shaders.
Algorithm Idea Part 1: Search
- Pick any input load that is hypothetically movable and call it the iterator.
- Get the immediate post-dominator of the iterator, and if it's movable, replace the iterator with it.
- Repeat the previous step until the obtained immediate post-dominator is not movable.
- The iterator now contains the farthest post-dominator that is movable.
- Gather all input loads that the post-dominator consumes.
- For each of those input loads, all matching output stores must be in the same block (because they will be replaced by a single store).
Algorithm Idea Part 2: Code Motion
- Clone the post-dominator in the producer except input loads, which should be replaced by stored output values. Uniform and UBO loads, if any, should be cloned too.
- Remove the original output stores.
- Replace the post-dominator from the consumer with a new input load.
- The step above makes the post-dominated input load that we picked at the beginning dead, but other input loads used by the post- dominator might still have other uses (shown in the example below).
Example SSA-use graph - initial shader and the result:
input0 input1 input0 input1 \ / \ | \ constant alu ... ======> | ... \ / alu (post-dominator)
Description: On the right, the algorithm moved the constant and both ALU opcodes into the previous shader and input0 now contains the value of the post-dominator. input1 stays the same because it still has one use left. If input1 hadn't had the other use, it would have been removed.
If the algorithm moves any code, the algorithm is repeated until there is no code that it can move.
Which shader pairs are supported:
- (VS, FS), (TES, FS): yes, fully
- Limitation: If Infs must be preserved, no code is moved across interpolation, so only flat varyings are optimized.
- (VS, TCS), (VS, GS), (TES, GS): no, but possible -- TODO
- Current behavior:
- Per-vertex inputs are rejected.
- Possible solution:
- All input loads used by an accepted post-dominator must use the same vertex index. The post-dominator must use all loads with that vertex index.
- If a post-dominator is found for an input load from a specific slot, all other input loads from that slot must also have an accepted post-dominator, and all such post-dominators should be identical expressions.
- Current behavior:
- (TCS, TES), (VS, TES): yes, with limitations
- Limitations:
- Only 1 store and 1 load per slot allowed.
- No output loads allowed.
- All stores used by an accepted post-dominator must be in the same block.
- TCS barriers don't matter because there are no output loads.
- Patch varyings are handled trivially with the above constraints.
- Per-vertex outputs should only be indexed by gl_InvocationID.
- An interpolated TES load is any ALU instruction that computes the result of linear interpolation of per-vertex inputs from the same slot using gl_TessCoord. If such an ALU instruction is found, it must be the only one, and all per-vertex input loads from that slot must feed into it. The interpolation equation must be equal to one of the allowed equations. Then the same rules as for interpolated FS inputs are used, treating different interpolation equations just like different interpolation qualifiers.
- Patch inputs are treated as convergent, which means they are allowed to be in the same movable expression as interpolated TES inputs, and the same rules as for convergent FS inputs apply.
- Limitations:
- (GS, FS), (MS, FS): no
- Workaround: Add a passthrough VS between GS/MS and FS, run the pass on the (VS, FS) pair to move code out of FS, and inline that VS at the end of your hw-specific GS/MS if it's possible.
- (TS, MS): no
The disadvantage of using the post-dominator tree is that it's a tree, which means there is only 1 post-dominator of each input. This example shows a case that could be optimized by replacing 3 inputs with 2 inputs, reducing the number of inputs by 1, but the immediate post-dominator of all input loads is NULL:
temp0 = input0 + input1 + input2; temp1 = input0 + input1 * const1 + input2 * const2;
If there is a graph algorithm that returns the best solution to the above case (which is temp0 and temp1 to replace all 3 inputs), let us know.
- Only these transformations can be perfomed with interpolated inputs
and any composition of these transformations (such as lerp), which can
all be proven mathematically:
-
Forward inter-shader code motion
TODO: Not implemented. The text below is a draft of the description.
"Forward" refers to moving code in the direction that shaders are executed, i.e. moving code from the producer to the consumer.
Vertex shader example:
output0 = value + 1; output1 = value * 2;
Both outputs can be replaced by 1 output storing "value", and both ALU operations can be moved into the next shader.
The same dominance algorithm as in the previous optimization is used, except that:
- Instead of inputs, we use outputs.
- Instead of a post-dominator tree, we use a dominator tree of the exact same graph.
The algorithm idea is: For each pair of 2 output stores, find their Lowest Common Ancestor in the dominator tree, and that's a candidate for a new output. All movable loads like load_const should be removed from the graph, otherwise the LCA wouldn't exist.
The limitations on instructions that can be moved between shaders across interpolated loads are exactly the same as the previous optimization.
nir->options has callbacks that are used to estimate the cost of expressions that drivers can set to control the complexity of expressions that can be moved to later shaders. This is to ensure that we don't increase the GPU overhead measurably by moving code across pipeline stages that amplify GPU work.
-
Compaction to vec4 slots (AKA packing)
First, varyings are divided into these groups, and each group is compacted separately with some exceptions listed below:
Non-FS groups (patch and non-patch are packed separately):
- 32-bit flat
- 16-bit flat
- 32-bit no-varying (TCS outputs read by TCS but not TES)
- 16-bit no-varying (TCS outputs read by TCS but not TES)
FS groups:
- 32-bit interpolated (always FP32)
- 32-bit flat
- 32-bit convergent (always FP32)
- 16-bit interpolated (always FP16)
- 16-bit flat
- 16-bit convergent (always FP16)
- 32-bit transform feedback only
- 16-bit transform feedback only
Then, all scalar varyings are relocated into new slots, starting from VAR0.x and increasing the scalar slot offset in 32-bit or 16-bit increments. Rules:
- Both 32-bit and 16-bit flat varyings are packed in the same vec4.
- Convergent varyings can be packed with interpolated varyings of the same type or flat. The group to pack with is chosen based on whichever has unused scalar slots because we want to reduce the total number of vec4s. After filling all unused scalar slots, the remaining convergent varyings are packed as flat.
- Transform-feedback-only slots and no-varying slots are packed last, so that they are consecutive and not intermixed with varyings consumed by the next shader stage, and 32-bit and 16-bit slots are packed in the same vec4. This allows reducing memory for outputs by ignoring the trailing outputs that the next shader stage doesn't read.
In the end, we should end up with these groups for FS:
- 32-bit interpolated (always FP32) on separate vec4s
- 16-bit interpolated (always FP16) on separate vec4s
- 32-bit flat and 16-bit flat, mixed in the same vec4
- 32-bit and 16-bit transform feedback only, sharing vec4s with flat
Colors are compacted the same but separately because they can't be mixed with VARn. Colors are divided into 3 FS groups. They are:
- 32-bit maybe-interpolated (affected by the flat-shade state)
- 32-bit interpolated (not affected by the flat-shade state)
- 32-bit flat (not affected by the flat-shade state)
To facilitate driver-specific output merging, color channels are assigned in a rotated order depending on which one the first unused VARn channel is. For example, if the first unused VARn channel is VAR0.z, color channels are allocated in this order: COL0.z, COL0.w, COL0.x, COL0.y, COL1.z, COL1.w, COL1.x, COL1.y The reason is that some drivers merge outputs if each output sets different components, for example 2 outputs defining VAR0.xy and COL0.z. If drivers do interpolation in the fragment shader and color interpolation can differ for each component, VAR0.xy and COL.z can be stored in the same output storage slot, and the consumer can load VAR0 and COL0 from the same slot.
If COLn, BFCn, and TEXn are transform-feedback-only, they are moved to VARn. PRIMITIVE_ID in (GS, FS) and FOGC in (xx, FS) are always moved to VARn for better packing.
Issue: Interpolation converts Infs to NaNs
Interpolation converts Infs to NaNs, i.e. interp(Inf, i, j) = NaN, which impacts and limits backward inter-shader code motion, uniform expression propagation, and compaction.
When we decide not to interpolate a varying, we need to convert Infs to NaNs manually. Infs can be converted to NaNs like this: x*0 + x (suggested by Ian Romanick, the multiplication must be "exact")
Changes to optimizations:
- When we propagate a uniform expression and NaNs must be preserved, convert Infs in the result to NaNs using "x*0 + x" in the consumer.
- When we change interpolation to flat for convergent varyings and NaNs must be preserved, apply "x*0 + x" to the stored output value in the producer.
- There is no solution for backward inter-shader code motion with interpolation if Infs must be preserved. As an alternative, we can allow code motion across interpolation only for specific shader hashes in can_move_alu_across_interp. We can use shader-db to automatically produce a list of shader hashes that benefit from this optimization.
Usage
Requirements:
- ALUs should be scalarized
- Dot products and other vector opcodes should be lowered (recommended)
- Input loads and output stores should be scalarized
- 64-bit varyings should be lowered to 32 bits
- nir_vertex_divergence_analysis must be called on the producer if the constumer is a fragment shader
It's recommended to run this for all shader pairs from the first shader to the last shader first (to propagate constants etc.). If the optimization of (S1, S2) stages leads to changes in S1, remember the highest S1. Then re-run this for all shader pairs in the descending order from S1 to VS.
NIR optimizations should be performed after every run that changes the IR.
Analyzing the optimization potential of linking separate shaders
We can use this pass in an analysis pass that decides whether a separate shader has the potential to benefit from full draw-time linking. The way it would work is that we would create a passthrough shader adjacent to the separate shader, run this pass on both shaders, and check if the number of varyings decreased. This way we can decide to perform the draw-time linking only if we are confident that it would help performance.
TODO: not implemented, mention the pass that implements it