# NIR: New Varying Linking Optimization Pass - Design Document

# Disclaimer

This is a new varying linking/optimization pass that does a lot more than the passes everybody currently uses. It can be used as an alternative to 4+ existing passes, and it only operates on `store_output`

/`load_input`

intrinsics and friends.

I'm pretty far in the implementation and everything is done except cross-shader code motion, which I've started and know how to do. shader-db results so far are promising. This exact text will be in the code.

The first MR will only enable it for radeonsi and it will be executed by the GLSL linker after varying linking as a pure optimization pass. Further integration into the GLSL linker can be done in the future.

Some aspects may be geared towards AMD hw, and possible modifications to accommodate other drivers may follow after merging.

The functionality described here is what is planned to appear in the first MR.

# Introduction

This pass optimizes varyings between 2 shaders, which means dead input/

output removal, constant and uniform load propagation, deduplication,

compaction, and cross-shader code motion. This is used during the shader

linking process.

# Notes on behavior

The pass operates on scalar varyings using 32-bit and 16-bit types.

Vector varyings are not allowed.

Indirectly-indexed varying slots (not vertices) are not optimized or

compacted, but unused slots of indirectly-indexed varyings will still

be filled with directly-indexed varyings during compaction. Indirectly-

indexed varyings are still removed if they are unused by the other shader.

Indirectly-indexed vertices are always optimized. They are common in TCS,

TES, and GS. More on that below.

Transform feedback never prevents any optimizations such as constant

propagation and compaction.

All legacy varying slots are optimized when it's allowed.

# Convergence property of shader outputs

When an output stores an SSAs that is convergent and all stores of that

output appear in unconditional blocks or conditional blocks with

a convergent entry condition and the shader is not GS and the output is not

flat, it implies that all vertices of that output have the same value,

therefore the output can be promoted to flat because all interpolation

modes lead to the same result as flat. Such outputs are opportunistically

compacted with both flat and non-flat varyings based on whichever has

unused slots in their vec4s. This pass refers to such inputs, outputs, and

varyings as "convergent". The only case when it doesn't work is when

vertices of the same primitive are stored by different subgroups and a vote

instruction is used to determine the output value. In that case, the output

will be treated as divergent by this pass even if NIR sees it as convergent.

# Optimization steps

**1. Determine which varying slots can be optimized and how.**

- When a varying is said to be "optimized" in the following text, it

means all optimizations are performed, such as removal, constant

propagation, and deduplication. - All VARn, PATCHn, and FOGC varyings are always optimized and

compacted. - PRIMITIVE_ID is treated as VARn in (GS, FS).
- TEXn are removed if they are dead (except TEXn inputs, which can't be

removed due to being affected by the coord replace state). TEXn can’t

also be optimized or compacted due to being affected by the coord

replace state. - COLn and BFCn only propagate constants if they are between 0 and 1

because of the clamp vertex color state, and they are only

deduplicated and compacted among themselves because they are affected

by the flat shade, provoking vertex, two-side color selection, and

clamp vertex color states. - All system value outputs like POS, PSIZ, CLIP_DISTn, etc. can’t be

removed, but they are demoted to sysval-only outputs by setting

the no_varying bit (i.e. they can be removed as varyings), so drivers

should look at the no_varying bit. If an output is not a sysval output

in a specific stage, it's fully optimized and compacted. (such as POS

in TCS) - TESS_LEVEL_* inputs in TES can’t be touched if TCS is missing.

**2. Remove unused inputs and outputs**

- Outputs not used in the next shader are removed.
- Inputs not initialized by the previous shader are replaced with undef.
- Output loads that have no output stores anywhere in the shader are

replaced with undef. (for TCS, though it works with any shader) - Output stores with transform feedback are preserved, but get

the “no_varying” flag. Later, transform-feedback-only varyings are

compacted (relocated) such that they are always last.

**3. Constant, uniform, UBO load, and uniform expression propagation**

- Define “uniform expressions” as ALU expressions only sourcing

constants, uniforms, and UBO loads. - Constants, uniforms, UBO loads, and uniform expressions stored

in outputs are moved into the next shader, and the outputs are removed. - The same propagation is done from output stores to output loads.

(for TCS, though it works with any shader) - If there are multiple stores to the same output, all such stores

should store the same constant, uniform, UBO load, or uniform

expression for the expression to be propagated. If an output has

multiple vertices, all vertices should store the same expression. - nir->options has callbacks that are used to estimate the cost of

uniform expressions that drivers can set to control the complexity of

uniform expressions that will be propagated. This is to ensure that

we don't increase the GPU overhead measurably by moving code across

pipeline stages that amplify GPU work.

**4. Remove duplicated output components**

- By comparing SSA defs.
- If there are multiple stores to the same output, all such stores

should store the same SSA as all stores of another output for

the output to be considered duplicated. If an output has multiple

vertices, all vertices should store the same SSA.

**5. Backward cross-shader code motion**

Fragment shader example: result = input0 * uniform + input1;

The computation of "result" in the above example is moved into

the previous shader and both inputs are replaced with a new input

holding the value of "result", thus reducing the number of inputs by

one.

Such code motion can be performed for any expression sourcing only

inputs, constants, and uniforms except for fragment shaders, which can

also do it but with the following limitations:

- Only these transformations can be perfomed with interpolated inputs

and any composition of these transformations (such as lerp), which can

all be proven mathematically:- interp(x, i, j) + interp(y, i, j) = interp(x + y, i, j)
- interp(x, i, j) + convergent_expr = interp(x + convergent_expr, i, j)
- interp(x, i, j) * convergent_expr = interp(x * convergent_expr, i, j)
- all of these transformations are considered "inexact" in NIR
- interp interpolates an input according to the barycentric

coordinates (i, j), which are different for perspective,

noperspective, center, centroid, sample, at_offset, and at_sample

modes. - convergent_expr is any expression sourcing only constants,

uniforms, and convergent inputs. The only requirement on

convergent_expr is that it doesn't vary between vertices of

the same primitive, but it can vary between primitives.

- If inputs are flat or convergent, there are no limitations on

expressions that can be moved. - Interpolated and flat inputs can't mix in the same expression, but

convergent inputs can mix with both. - The interpolation qualifier of the new input is inherited from

the removed non-convergent inputs that should all have the same (i, j).

If there are no non-convergent inputs, then the new input is declared

as flat (for simplicity; we can't choose the barycentric coordinates

at random because AMD doesn't like when there are multiple sets of

barycentric coordinates in the same shader unnecessarily).

The general idea of the algorithm:

- We need a post-dominator tree that is constructed from a graph where

vertices are instructions and directed edges going into them are

their source operands. This is different from how NIR dominance

works, which represents all instructions within a basic block as

a linear chain of vertices in the graph. That representation doesn't

work here. We don't need to consider blocks in the graph at all.

In our graph, all loads without source operands and all constants are

entry nodes in the graph, and all stores are exit nodes in the graph.

Each shader can have multiple disjoint graphs. - Given the above definition, the instruction whose result is the best

candidate for a new input is the farthest instruction that

post-dominates one of more inputs.

The disadvantage of this algorithm is that e.g. if 3 inputs are

transitively used by only 2 instructions, we could replace the 3 inputs

with 2 inputs containing the results of those instructions, but

the post-dominator search won't be able to find them.

The advantage of this algorithm is that it's straightforward and well

defined mathematically (it's easy to prove its correctness).

**6. Forward cross-shader code motion**

Vertex shader example:

output0 = value + 1;

output1 = value * 2;

Both outputs can be replaced by 1 output storing "value", and both ALU

operations can be moved into the next shader.

The same dominance algorithm as in the previous optimization is used,

except that:

- Instead of inputs, we use outputs.
- Instead of a post-dominator tree, we use a dominator tree of the exact

same graph.

The full definition is: If an instruction dominates 2 output stores,

the result of that instruction can become a new output, and the dominated

instructions can be moved into the next shader.

The limitations on instructions that can be moved between shaders across

interpolated loads are exactly the same.

nir->options has callbacks that are used to estimate the cost of

expressions that drivers can set to control the complexity of

expressions that will be moved. This is to ensure that we don't increase

the GPU overhead measurably by moving code across pipeline stages that

amplify GPU work.

**7. Compaction to vec4 slots**

First, varyings are divided into these groups, and each group is compacted

separately with some exceptions listed below:

Non-FS groups (patch and non-patch are packed separately):

- 32-bit flat
- 16-bit flat

FS groups:

- 32-bit interpolated (always FP32)
- 32-bit flat
- 32-bit convergent (always FP32)
- 16-bit interpolated (always FP16)
- 16-bit flat
- 16-bit convergent (always FP16)
- 32-bit transform feedback only
- 16-bit transform feedback only

Then, all scalar varyings are relocated into new slots, starting from

VAR0.x and increasing the scalar slot offset in 32-bit or 16-bit

increments. Rules:

- Both 32-bit and 16-bit flat varyings are packed in the same vec4.
- Convergent varyings can be packed with interpolated varyings of

the same type or flat. The group to pack with is chosen based on

whichever has unused scalar slots because we want to reduce the total

number of vec4s. After filling all unused scalar slots, the remaining

convergent varyings are packed as flat. - Transform-feedback-only slots are last and 32-bit and 16-bit slots are

packed in the same vec4.

Colors are compacted the same but separately because they can't be mixed

with VARn.

To facilitate driver-specific output merging, color channels are

assigned in a rotated order depending on what the first unused VARn

channel is. For example, if the first unused VARn channel is VAR0.z,

color channels will be assigned in this order:

` COL0.z, COL0.w, COL0.x, COL0.y, COL1.z, COL1.w, COL1.x, COL1.y `

The reason is that some drivers merge outputs if each output sets

different components, for example 2 outputs defining VAR0.xy and COL0.z.

If drivers do interpolation in the fragment shader and color

interpolation can differ for each component, VAR0.xy and COL.z can be

merged into 1 output in the producer, and the consumer can load VAR0

and COL0 from the same varying.

# Usage

It requires scalarized load_input and store_output intrinsics.

64-bit varyings should be lowered to 32 bits.

shader_info should be up to date.

It's recommended to run this for all shader pairs from the first shader

to the last shader first (to propagate constants etc.). If the optimization

of (S1, S2) stages leads to changes in S1, remember the highest S1. Then

re-run this for all shader pairs in the descending order from S1 to VS.

NIR optimizations should be performed after every run that changes the IR.

# Alternative applications

This performs enough analysis of varyings that we can use it to add

an analysis pass that decides whether a separate shader has

the potential to benefit from full draw-time linking. This way we

could perform the linking only if we are confident that it would help

performance.