freedreno/ir3: SSA-based register allocation (!9842) · Merge requests · Mesa / mesa

Connor Abbott requested to merge cwabbott0/mesa:review/ir3-ra-rewrite into main Mar 25, 2021

This has been my project for the last few months: a ground-up rewrite of ir3's register allocator.

Why do we need to rewrite the register allocator?

The current allocator uses the graph-coloring framework in util. The allocator itself isn't badly implemented, and it works well up to a certain point, for simple things, but one quickly comes up against several limitations that are inherent in its design:

It can't handle interference of sub-registers. Here's what this means by way of example. For this very common sequence:

vec4 bar = texture(...) + baz;

which typically expands to something like:

vec4 foo = texture(...);
bar.x = foo.x + baz.x;
bar.y = foo.y + baz.y;
bar.z = foo.z + baz.z;
bar.w = foo.w + baz.w;

foo and bar only partly interfere. That is, if foo is allocated to r1-r4, then bar can be allocated to r0-r3 or r1-r4, but not r2-r5 (since bar.x and foo.y interfere). There is absolutely no way to tell the current allocator about this; it's just impossible with its current API. ir3 works around this by doing allocation in two (actually three!) parts, first for vectors and then for scalars, which handles the case where bar is split up into individual components but not where it's used e.g. by a store instruction and needs to stay as a vector. This also significantly complicates freedreno's RA, and would make it much more difficult to implement spilling due to the split. Speaking of spilling...

Spilling with it sucks. You have to spill one thing at a time, and there is no support for partially spilling registers so it's an all-or-nothing affair. Thus, as soon as you've started spilling, you're in for a world of hurt, and there's not much you can do to help. Big shaders from test suites (looking at you, soft-fp64) take an enormous amounts of time in the spill-allocate loop, and the performance cliff from spilling is bigger than it ought to be.
It can't do live-range splitting and "smarter" coalescing. Oftentimes, a few copies can be enough to avoid spilling, but util/ra can't handle that currently. You have to aggressively coalesce (e.g. with nir_from_ssa) and then hope that you don't run out of registers.

Why SSA-based register allocation?

Due to all these problems, and especially the difficulty implementing spilling, the lack of which bites us all the time with CTS tests, I thought about how we'd want to rewrite freedreno's register allocator. There are known extensions to classic graph coloring that help alleviate most of these problems. See, for example, this paper written by the author of the graph-coloring allocator used by GCC. However, implementing them in mesa would basically mean rewriting the allocator anyway, and it would probably have to be tied much more tightly to the IR in addition to likely being more complicated than an SSA-based allocator. And we still wouldn't have SSA-based register allocation's "killer feature": the ability to know in advance precisely how many registers are needed, and why. This helps a lot for architectures with register-sharing, like AMD/GCN and Adreno, because the decision of what occupancy to target can be decoupled from the allocator itself. This approach was pioneered by ACO, which has turned out well, and in some respects the GCN and Adreno instruction sets are very similar as they are both scalar architectures (vectorization is mostly over pixels instead of color components) which dynamically partition the register file between waves.

How does this compare to ACO?

While it's also SSA-based, and somewhat inspired by experiences with ACO (thanks @daniel-schuermann for the discussions!), there are some pretty significant differences. This is mostly because of the design of the existing IR, and the need to avoid too serious regressions in moves from the current RA, instead of anything specific to the Adreno ISA. The main difference is the ability to track when multiple live values inhabit the same register. For example, in ir3 a vector is split into its components with multiple meta:split instructions, only the last one killing the vector:

vec4 foo = ...
foox = split foo.x
fooy = split foo.y
fooz = split foo.z
foow = split (kill)foo.w

(in ACO this would be a single instruction with 4 destinations). We need to keep track of the fact that foox is in the first component of foo until foo dies, and move around foox if foo needs to be moved around, both when calculating register pressure, when spilling, and in RA. We also do something similar for collect instructions, which had a pretty big impact on moves when I implemented it. As far as I know, no published tree-scan allocator has done something similar.

The implementation of this relies on an extension of the CSSA construction algorithm in Bossinot et. al. (used by e.g. nir_from_ssa) to coalesce splits and collects in addition to phis. The algorithm constructs "merge sets" which in a traditional graph-coloring allocator would be allocated to a single register. For example, it would merge foo with foox, fooy, etc. and replace references to foox with foo.x. (This roughly maps to ->cp.left and ->cp.right in the old allocator.) Traditionally, they are only used in the spilling phase of an SSA-based allocator. However we also use them to force foox to be in the same register as foo. However, crucially, we don't require every value in the merge set to use the same register -- only values using overlapping subregisters, and only when they are live at the same time. The coalescing algorithm decides which things to put in the same register ahead of time, so that we can still precisely calculate the required registers before RA.

What other changes to ir3 are there?

Changing the register allocator is like dropping in a new engine in an old car: you can't do it without changing anything about the surrounding IR. Probably the most invasive change is the addition of support for multiple destinations in an instruction. This is needed due to the phi node's trusty sidekick, the parallel copy, which simplifies several parts of register allocation by splitting (hehe) the tricky parts of splitting live ranges into a separate pass that lowers parallel copies after RA. Fortunately, we only produce parallel copies in RA and then immediately lower them, so most of the rest of ir3 can remain blissfully unaware for now, but we still need to do a flag-day to change the core ir3_instruction datastructure.

Our handling of outputs also needs to be reworked, and fixed inputs/outputs for tessellation need to be handled differently, due to the fact that moves can suddenly show up in places where they never were before.

To properly match phi node sources to predecessors we need to either take the NIR approach of associating sources with blocks, or keep a fixed array of predecessors; since we don't do much CFG manipulation, I chose the latter. There's also one NIR patch to allow always scalarizing phis, to simplify ir3 a bit.

Finally, due to the live-range splitting and the tricky parallel copy lowering, keeping the SSA information around after RA is a lot more difficult. There are several reworks to make post-RA passes not depend on that information and instead use the assigned physregs, most notably ir3_delay.

How far along is this allocator?

It's probably not ready to be merged yet:

~~I've got a report that 3dmark slingshot hangs~~ (should be fixed)
It needs more real-world benchmarking, I haven't done anything besides look at shader-db numbers yet.
I want to add some unit tests, especially for tricky things like the parallel copy lowering.
~~There's one more TODO around implementing (rptN) in the reworked delay code, although we haven't implemented (rptN) so it's not that important.~~

However it fully passes CI, I've gotten far enough implementing the "core" that further major reworks are unlikely to be necessary, and the shader-db stats are getting decent. Thus I think it's time to post this publicly and get some more eyes on it.

Spilling isn't implemented yet. There's a skeleton in ir3_spill.c but it's just used for calculating register pressure so far. However as spilling is one of the main reasons I wanted to start on this rework in the first place, it's definitely coming up after this lands.

How is the series organized?

First are some general ir3 fixes, and one for util/bitset, found along the way. These can hopefully be reviewed and merged earlier. Then there are a series of commits doing the reworks mentioned earlier, the biggie rewrite-it commit, and then a few commits updating the rest of ir3 now that the allocator has been switched.

Happy reviewing!

shader-db Results

This is using Rob's shader-db, compared to a baseline of some RA-independent fixes in the first few commits:

total instructions in shared programs: 1975576 -> 1641600 (-16.91%)
instructions in affected programs: 1922639 -> 1588663 (-17.37%)
helped: 9893
HURT: 908
helped stats (abs) min: 1 max: 2832 x̄: 35.75 x̃: 7
helped stats (rel) min: 0.17% max: 89.61% x̄: 16.29% x̃: 13.21%
HURT stats (abs)   min: 1 max: 480 x̄: 21.75 x̃: 2
HURT stats (rel)   min: 0.18% max: 120.00% x̄: 8.49% x̃: 5.88%
95% mean confidence interval for instructions value: -32.30 -29.54
95% mean confidence interval for instructions %-change: -14.47% -13.95%
Instructions are helped.

total nops in shared programs: 734866 -> 415696 (-43.43%)
nops in affected programs: 724635 -> 405465 (-44.05%)
helped: 9237
HURT: 839
helped stats (abs) min: 1 max: 2896 x̄: 36.26 x̃: 9
helped stats (rel) min: 0.62% max: 100.00% x̄: 60.07% x̃: 59.09%
HURT stats (abs)   min: 1 max: 415 x̄: 18.79 x̃: 2
HURT stats (rel)   min: 0.00% max: 847.62% x̄: 30.43% x̃: 11.54%
95% mean confidence interval for nops value: -33.01 -30.34
95% mean confidence interval for nops %-change: -53.40% -51.67%
Nops are helped.

total non-nops in shared programs: 1240710 -> 1225904 (-1.19%)
non-nops in affected programs: 825152 -> 810346 (-1.79%)
helped: 2334
HURT: 1976
helped stats (abs) min: 1 max: 224 x̄: 12.83 x̃: 3
helped stats (rel) min: 0.15% max: 87.50% x̄: 14.01% x̃: 8.33%
HURT stats (abs)   min: 1 max: 269 x̄: 7.66 x̃: 4
HURT stats (rel)   min: 0.09% max: 175.00% x̄: 3.82% x̃: 2.21%
95% mean confidence interval for non-nops value: -4.11 -2.76
95% mean confidence interval for non-nops %-change: -6.29% -5.39%
Non-nops are helped.

total mov in shared programs: 73273 -> 78532 (7.18%)
mov in affected programs: 48480 -> 53739 (10.85%)
helped: 2312
HURT: 1737
helped stats (abs) min: 1 max: 37 x̄: 3.28 x̃: 2
helped stats (rel) min: 1.67% max: 100.00% x̄: 46.64% x̃: 44.44%
HURT stats (abs)   min: 1 max: 79 x̄: 7.40 x̃: 4
HURT stats (rel)   min: 0.00% max: 2200.00% x̄: 92.95% x̃: 40.00%
95% mean confidence interval for mov value: 1.03 1.57
95% mean confidence interval for mov %-change: 9.18% 17.31%
Mov are HURT.

total cov in shared programs: 48557 -> 47291 (-2.61%)
cov in affected programs: 3214 -> 1948 (-39.39%)
helped: 362
HURT: 0
helped stats (abs) min: 1 max: 19 x̄: 3.50 x̃: 3
helped stats (rel) min: 0.98% max: 100.00% x̄: 56.70% x̃: 60.00%
95% mean confidence interval for cov value: -3.80 -3.19
95% mean confidence interval for cov %-change: -58.00% -55.40%
Cov are helped.

total dwords in shared programs: 2894168 -> 2828432 (-2.27%)
dwords in affected programs: 1567102 -> 1501366 (-4.19%)
helped: 2660
HURT: 912
helped stats (abs) min: 2 max: 480 x̄: 33.83 x̃: 24
helped stats (rel) min: 0.08% max: 83.33% x̄: 10.71% x̃: 5.88%
HURT stats (abs)   min: 2 max: 538 x̄: 26.60 x̃: 26
HURT stats (rel)   min: 0.10% max: 100.00% x̄: 7.31% x̃: 4.02%
95% mean confidence interval for dwords value: -20.21 -16.59
95% mean confidence interval for dwords %-change: -6.56% -5.66%
Dwords are helped.

total last-baryf in shared programs: 138439 -> 106752 (-22.89%)
last-baryf in affected programs: 134606 -> 102919 (-23.54%)
helped: 2117
HURT: 346
helped stats (abs) min: 1 max: 195 x̄: 19.03 x̃: 11
helped stats (rel) min: 0.44% max: 99.15% x̄: 56.19% x̃: 58.33%
HURT stats (abs)   min: 1 max: 435 x̄: 24.86 x̃: 13
HURT stats (rel)   min: 0.56% max: 7250.00% x̄: 132.81% x̃: 60.00%
95% mean confidence interval for last-baryf value: -14.01 -11.72
95% mean confidence interval for last-baryf %-change: -36.60% -22.69%
Last-baryf are helped.

total full in shared programs: 67882 -> 69033 (1.70%)
full in affected programs: 47472 -> 48623 (2.42%)
helped: 2910
HURT: 6360
helped stats (abs) min: 1 max: 30 x̄: 2.81 x̃: 1
helped stats (rel) min: 4.55% max: 84.62% x̄: 27.54% x̃: 22.22%
HURT stats (abs)   min: 1 max: 41 x̄: 1.46 x̃: 1
HURT stats (rel)   min: 5.00% max: 585.71% x̄: 55.05% x̃: 33.33%
95% mean confidence interval for full value: 0.06 0.19
95% mean confidence interval for full %-change: 28.11% 30.14%
Full are HURT.

total constlen in shared programs: 280208 -> 279668 (-0.19%)
constlen in affected programs: 16124 -> 15584 (-3.35%)
helped: 133
HURT: 0
helped stats (abs) min: 4 max: 8 x̄: 4.06 x̃: 4
helped stats (rel) min: 3.12% max: 33.33% x̄: 3.75% x̃: 3.23%
95% mean confidence interval for constlen value: -4.14 -3.98
95% mean confidence interval for constlen %-change: -4.39% -3.11%
Constlen are helped.

total cat0 in shared programs: 766540 -> 446436 (-41.76%)
cat0 in affected programs: 750094 -> 429990 (-42.68%)
helped: 9213
HURT: 849
helped stats (abs) min: 1 max: 2896 x̄: 36.46 x̃: 9
helped stats (rel) min: 0.58% max: 99.32% x̄: 49.47% x̃: 50.00%
HURT stats (abs)   min: 1 max: 415 x̄: 18.63 x̃: 2
HURT stats (rel)   min: 0.79% max: 809.09% x̄: 53.21% x̃: 20.00%
95% mean confidence interval for cat0 value: -33.16 -30.47
95% mean confidence interval for cat0 %-change: -41.65% -39.96%
Cat0 are helped.

total cat1 in shared programs: 121830 -> 125823 (3.28%)
cat1 in affected programs: 75959 -> 79952 (5.26%)
helped: 2350
HURT: 1705
helped stats (abs) min: 1 max: 37 x̄: 3.75 x̃: 3
helped stats (rel) min: 0.89% max: 100.00% x̄: 43.76% x̃: 34.78%
HURT stats (abs)   min: 1 max: 79 x̄: 7.51 x̃: 4
HURT stats (rel)   min: 0.00% max: 2000.00% x̄: 70.02% x̃: 28.57%
95% mean confidence interval for cat1 value: 0.71 1.26
95% mean confidence interval for cat1 %-change: 0.59% 7.58%
Cat1 are HURT.

total cat2 in shared programs: 742471 -> 726992 (-2.08%)
cat2 in affected programs: 123627 -> 108148 (-12.52%)
helped: 368
HURT: 172
helped stats (abs) min: 1 max: 119 x̄: 47.55 x̃: 54
helped stats (rel) min: 0.72% max: 92.25% x̄: 19.84% x̃: 19.58%
HURT stats (abs)   min: 1 max: 255 x̄: 11.75 x̃: 9
HURT stats (rel)   min: 0.45% max: 142.11% x̄: 8.48% x̃: 3.85%
95% mean confidence interval for cat2 value: -31.61 -25.72
95% mean confidence interval for cat2 %-change: -12.39% -9.25%
Cat2 are helped.

total cat3 in shared programs: 280645 -> 278934 (-0.61%)
cat3 in affected programs: 7978 -> 6267 (-21.45%)
helped: 340
HURT: 0
helped stats (abs) min: 3 max: 68 x̄: 5.03 x̃: 3
helped stats (rel) min: 10.34% max: 100.00% x̄: 21.92% x̃: 20.00%
95% mean confidence interval for cat3 value: -5.57 -4.49
95% mean confidence interval for cat3 %-change: -23.38% -20.47%
Cat3 are helped.

total cat4 in shared programs: 35105 -> 34767 (-0.96%)
cat4 in affected programs: 2607 -> 2269 (-12.97%)
helped: 275
HURT: 0
helped stats (abs) min: 1 max: 31 x̄: 1.23 x̃: 1
helped stats (rel) min: 6.25% max: 100.00% x̄: 13.67% x̃: 12.50%
95% mean confidence interval for cat4 value: -1.47 -0.98
95% mean confidence interval for cat4 %-change: -15.01% -12.34%
Cat4 are helped.

total cat5 in shared programs: 23545 -> 23541 (-0.02%)
cat5 in affected programs: 4 -> 0
helped: 4
HURT: 0
helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
helped stats (rel) min: 100.00% max: 100.00% x̄: 100.00% x̃: 100.00%
95% mean confidence interval for cat5 value: -1.00 -1.00
95% mean confidence interval for cat5 %-change: -100.00% -100.00%
Cat5 are helped.

total cat6 in shared programs: 5229 -> 4896 (-6.37%)
cat6 in affected programs: 333 -> 0
helped: 333
HURT: 0
helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
helped stats (rel) min: 100.00% max: 100.00% x̄: 100.00% x̃: 100.00%
95% mean confidence interval for cat6 value: -1.00 -1.00
95% mean confidence interval for cat6 %-change: -100.00% -100.00%
Cat6 are helped.

total sstall in shared programs: 170534 -> 148468 (-12.94%)
sstall in affected programs: 154566 -> 132500 (-14.28%)
helped: 2357
HURT: 1108
helped stats (abs) min: 1 max: 144 x̄: 13.81 x̃: 9
helped stats (rel) min: 0.52% max: 100.00% x̄: 38.41% x̃: 30.00%
HURT stats (abs)   min: 1 max: 283 x̄: 9.47 x̃: 6
HURT stats (rel)   min: 0.00% max: 1110.00% x̄: 47.21% x̃: 18.35%
95% mean confidence interval for sstall value: -6.97 -5.77
95% mean confidence interval for sstall %-change: -13.37% -8.71%
Sstall are helped.

total (ss) in shared programs: 40291 -> 40038 (-0.63%)
(ss) in affected programs: 33488 -> 33235 (-0.76%)
helped: 2429
HURT: 1183
helped stats (abs) min: 1 max: 419 x̄: 2.27 x̃: 1
helped stats (rel) min: 1.19% max: 100.00% x̄: 40.85% x̃: 33.33%
HURT stats (abs)   min: 1 max: 32 x̄: 4.45 x̃: 3
HURT stats (rel)   min: 0.00% max: 2300.00% x̄: 121.83% x̃: 38.46%
95% mean confidence interval for (ss) value: -0.34 0.20
95% mean confidence interval for (ss) %-change: 6.59% 18.26%
Inconclusive result (value mean confidence interval includes 0).

total (sy) in shared programs: 13545 -> 18060 (33.33%)
(sy) in affected programs: 7825 -> 12340 (57.70%)
helped: 841
HURT: 955
helped stats (abs) min: 1 max: 125 x̄: 1.55 x̃: 1
helped stats (rel) min: 2.50% max: 100.00% x̄: 60.59% x̃: 50.00%
HURT stats (abs)   min: 1 max: 55 x̄: 6.09 x̃: 2
HURT stats (rel)   min: 0.00% max: 2100.00% x̄: 217.64% x̃: 100.00%
95% mean confidence interval for (sy) value: 2.15 2.88
95% mean confidence interval for (sy) %-change: 76.12% 98.60%
(sy) are HURT.

ttotal waves in shared programs: 228654 -> 237898 (4.04%)
waves in affected programs: 21028 -> 30272 (43.96%)
helped: 56
HURT: 2279
helped stats (abs) min: 2 max: 4 x̄: 2.39 x̃: 2
helped stats (rel) min: 12.50% max: 66.67% x̄: 27.46% x̃: 25.00%
HURT stats (abs)   min: 2 max: 14 x̄: 4.11 x̃: 4
HURT stats (rel)   min: 14.29% max: 700.00% x̄: 76.21% x̃: 33.33%
95% mean confidence interval for waves value: 3.84 4.07
95% mean confidence interval for waves %-change: 68.74% 78.71%
Waves are HURT.

LOST:   0
GAINED: 7

Total CPU time (seconds): 222.73 -> 219 (-1.67%)

The increase in register usage may seem weird, but it's due to the new RA using the awareness of occupancy introduced in !9498 (merged) to use more registers when not detrimental to max_waves. The only serious regression is in moves, which unfortunately is kind of impossible to mitigate too much without ruining the central SSA allocator guarantee of known register usage, and in (sy) usage. The latter seems to be mostly from the new RA using less registers in shaders with long sequences of texture fetches resulting in a worse schedule, and not anything fundamentally wrong with it. If you give it more registers, it will happily match the existing RA. And it mostly seems to be happening in skia shaders, which might not be reflections of actual skia usage.

Edited May 10, 2021 by Connor Abbott

Admin message

freedreno/ir3: SSA-based register allocation