Skip to content

intel/brw: Treat convergent values as SIMD8

Ian Romanick requested to merge idr/mesa:review/fs-scalar into main

(The first 7 commits in this MR are from !30251 (merged).)

This is the bulk of "treat convergent values as SIMD8." This series lays the groundwork to treat convergent ALU operations and many convergent "leaf" operations as SIMD8 in all dispatch modes. In SIMD16 this saves register space. In SIMD32 it saves register space and instructions. Leaf operations are operations that are always the leaves of NIR ALU expression trees. Load constant, load UBO, etc. are all leaf operations.

Overall, this MR is -158 lines of code because it deletes all of the resource rematerialization code. The resource rematerialization code used a similar technique, but it was more limited in scope. This MR does everything that the resource rematerialization code did, and much, much more.

The basic idea is values that are not marked as divergent in NIR get an is_scalar flag set. These values are generated using SIMD8 dispatch with NoMask set. When used as sources to SIMD16 instructions, the values are accessed using <0,1,0>. When used as sources to SIMD8 instructions, the values can either be accessed as <0,1,0> or <8,8,1>. This is helpful for instructions that cannot use scalar sources. The is_scalar flags is used during code generation to convert illegal <0,1,0> to <8,8,1>.

A lot of work that remains to be done.

First, many more leaf operations need to be supported. The branch https://gitlab.freedesktop.org/idr/mesa/-/commits/wip/fs-scalar has some work in this area. I think I would tackle this by adding an assert(def.divergent) in the default case of the switch-statement in get_nir_def. That will highlight candidate intrinsics.

I examined a few shaders that were hurt for register pressure. They generally fall into one of two categories. There is nondeterminism in the register allocation algorithm. Slight changes in which value is allocated a register when there are multiple possible choice can have dramatic effects later on.

The other category is cases where a comparison is scalarized, but result wants to be used as flags. This prevents cmod propagation from making progress. Consider a case like

mul(8)          g63<1>          g115<0,1,0>F    g13.4<0,1,0>F   { align1 WE_all 1Q compacted };
cmp.z.f0.0(8)   g4<1>F          g63<0,1,0>F     0F              { align1 WE_all 1Q compacted };
mov.nz.f0.0(8)  null<1>D        g4<0,1,0>D                      { align1 1Q };
(+f0.0) if(8)   JIP:  LABEL37         UIP:  LABEL36             { align1 1Q };

Without scalarization, both the mov and the cmp would have been eliminated by cmod propagation. This adds instructions and a temporary register. This appears to be the source of many of the +110 spills and +111 fills in Wolfenstein Youngblood caused by "intel/brw: Treat load_*_block_intel as convergent." Quite a few of the added spills and fills in this shader and in Red Dead Redemption 2 shaders are resolved by "intel/brw: Don't generate scalar byte to float conversions on DG2+ in optimize_extract_to_float."

I experimented with having cmp that will be used as the condition for an if-statement not be scalarized. This helped some shaders and hurt others. The problem is that cases where cmod propagation would not have been able to make progess anyway end up with higher register usage.

I believe the correct solution is to modify cmod propagation to account for these kinds of cases. In the above example, if the mov is the only use of g4, then progress can be made.

Fossil-db changes across the whole series on Meteor Lake:

Totals:
Instrs: 151072561 -> 151527771 (+0.30%); split: -0.27%, +0.58%
Subgroup size: 7608080 -> 7608208 (+0.00%)
Cycle count: 17078265389 -> 16863904647 (-1.26%); split: -1.79%, +0.53%
Spill count: 78374 -> 75797 (-3.29%); split: -3.78%, +0.49%
Fill count: 148383 -> 143412 (-3.35%); split: -3.97%, +0.62%
Scratch Memory Size: 4001792 -> 3976192 (-0.64%); split: -1.28%, +0.64%
Max live registers: 31517006 -> 31225060 (-0.93%); split: -0.97%, +0.05%
Max dispatch width: 5535976 -> 5707448 (+3.10%); split: +3.77%, -0.67%

Totals from 408773 (64.95% of 629377) affected shaders:
Instrs: 119670637 -> 120125847 (+0.38%); split: -0.35%, +0.73%
Subgroup size: 5482736 -> 5482864 (+0.00%)
Cycle count: 16786333398 -> 16571972656 (-1.28%); split: -1.82%, +0.54%
Spill count: 72246 -> 69669 (-3.57%); split: -4.10%, +0.53%
Fill count: 136809 -> 131838 (-3.63%); split: -4.30%, +0.67%
Scratch Memory Size: 3464192 -> 3438592 (-0.74%); split: -1.48%, +0.74%
Max live registers: 22811356 -> 22519410 (-1.28%); split: -1.34%, +0.06%
Max dispatch width: 3488160 -> 3659632 (+4.92%); split: +5.98%, -1.06%

Shader-db changes across the whole series on Meteor Lake:

total instructions in shared programs: 19687396 -> 19698677 (0.06%)
instructions in affected programs: 6532124 -> 6543405 (0.17%)
helped: 4466
HURT: 19103
helped stats (abs) min: 1 max: 1867 x̄: 10.81 x̃: 3
helped stats (rel) min: 0.08% max: 28.76% x̄: 2.44% x̃: 1.57%
HURT stats (abs)   min: 1 max: 120 x̄: 3.12 x̃: 2
HURT stats (rel)   min: 0.05% max: 80.00% x̄: 2.46% x̃: 1.00%
95% mean confidence interval for instructions value: 0.11 0.85
95% mean confidence interval for instructions %-change: 1.45% 1.61%
Instructions are HURT.

total cycles in shared programs: 913130718 -> 912210107 (-0.10%)
cycles in affected programs: 704607014 -> 703686403 (-0.13%)
helped: 17025
HURT: 29607
helped stats (abs) min: 1 max: 260769 x̄: 410.11 x̃: 16
helped stats (rel) min: <.01% max: 65.15% x̄: 2.55% x̃: 0.67%
HURT stats (abs)   min: 1 max: 74205 x̄: 204.73 x̃: 7
HURT stats (rel)   min: <.01% max: 110.23% x̄: 2.61% x̃: 0.60%
95% mean confidence interval for cycles value: -44.63 5.15
95% mean confidence interval for cycles %-change: 0.66% 0.79%
Inconclusive result (value mean confidence interval includes 0).

total spills in shared programs: 4901 -> 4648 (-5.16%)
spills in affected programs: 302 -> 49 (-83.77%)
helped: 7
HURT: 1

total fills in shared programs: 6646 -> 5591 (-15.87%)
fills in affected programs: 1203 -> 148 (-87.70%)
helped: 8
HURT: 1

LOST:   370
GAINED: 936

The one shader hurt for spills and fills is from Synmark Gl43CSDof.

Edited by Ian Romanick

Merge request reports

Loading