nir,glsl: add nir_opt_varyings, new varying linking optimization pass, enable it in the linker for radeonsi
The first 6 commits moved to !26918 (merged).
This adds:
-
nir_vertex_divergence_analysis
(via !26918 (merged)) - reusesnir_divergence_analysis
, but computes divergence within a primitive instead of within a subgroup; used bynir_opt_varyings
- Computation and queries of a post-dominator tree of SSA uses - same algorithm as
nir_dominance.c
, but the graph is defined differently and it computes post-dominance instead of dominance; it's explained in the source file; used bynir_opt_varyings
for backward inter-shader code motion -
nir_opt_varyings
and a lot of NIR tests for it - radeonsi preparation changes for
nir_opt_varyings
(it must be done before enabling it in the GLSL linker) - st/mesa and GLSL linker changes to enable
nir_opt_varyings
iflower_io_variables
is true; this only enables it for radeonsi because no other driver sets that option- Total: 35 files changed, 8478 insertions(+), 107 deletions(-)
Optimizations performed by nir_opt_varyings
:
- Dead input/output removal
- Propagation of constants, uniforms, UBO loads, and ALU expressions that use them from shader outputs to later shaders
- Output deduplication
- Backward inter-shader code motion (it moves code from the consumer to the producer)
- Compaction
- The optimizations are pretty thorough and support all shaders except backward inter-shader code motion, which only handles a subset of shaders (e.g. TCS->TES, VS->FS, TES->FS)
The complete description of the behavior of nir_opt_varyings
is in the source file and here: #8841
This uncovers incorrect expectations in dEQP and GLCTS tests described here: #10361
STATS FOR AFFECTED SHADERS (16009/58918) (AMD terminology)
TCS inputs: 475 -> 379.00 (-20.21 %) (= LS outputs)
TES inputs: 478 -> 366 (-23.43 %) (= HS outputs)
TES patch inputs: 234 -> 232 (-0.85 %) (= HS patch outputs)
GS inputs: 168 -> 115 (-31.55 %) (= ES outputs)
FS inputs after GS: 67 -> 61 (-8.96 %) (= GS param exports)
FS inputs after VS and TES: 31988 -> 28495 (-10.92 %) (= VS/TES param exports)
Code Size: 24606160 -> 24320676 (-1.16 %) bytes
Max Waves: 242634 -> 243876 (0.51 %)
I also noticed that GLCTS finished 30% faster with this on Radeon 7600, probably because the pass moves a lot of code from FS to VS (including slow FP64 code) due to how the tests are written.