Kenneth Graunke requested to merge kwg/mesa:brw-cse-prop into main Apr 10, 2024

SSA Def-Analysis in brw:

This MR introduces SSA def analysis in the Intel "brw" compiler backend, and updates two optimization passes (global common subexpression elimination, copy propagation) to use the new infrastructure in order to demonstrate its utility. This is one small step toward incrementally taking more advantage of SSA in our backend compiler.

Although the brw backend has traditionally been a non-SSA backend, and goes "out-of-SSA" for anything involving phi-webs, a large percentage of the instructions in our shaders are in fact static single assignments, writing whole values into new temporary registers each time. We can opportunistically look for the SSA property in our virtual registers, and track their definitions. For the moment, we do this via a new defs_analysis pass which optimizations can use if they wish. Def analysis requires the CFG dominance tree, and a single linear walk over the instructions. It notably does not require liveness.

In the future, we may bake SSA definitions into the IR itself, so we can avoid the cost of running this analysis. However, as a first step, adding it as an analysis pass allows us to start using SSA information right away without needing to refactor the IR.

Partial-SSA Visibility So Far

To give an idea of how effective SSA-based optimizations might be, even now before we have phi nodes, here is the percentage of VGRFs which are SSA. These are measured prior to LOAD_PAYLOAD lowering, using shader-db and fossil-db.

Shader Collection	Median	Average
shader-db (GLSL)	98.7%	71%
Rise of the Tomb Raider	94.1%	71.4%
Witcher 3	84.8%	66.3%
Borderlands 3	70%	64.3%
Cyberpunk 2077 (DX12)	57.6%	57.4%

While there's still a ways to go, the idea of being able to easily see where even 60% of our values come from is a huge improvement. Phi nodes will obviously be the heaviest lift here. However, there are other places (lower_simd_width zipping, some raytracing setup, etc) that can be converted to emit code in a more SSA-friendly manner, and that will give us some additional coverage.

Preliminary Results

Compile time in most fossils decreases by around 6%. A few data points:

Fallout 4: 11%
Witcher 3: 4.7%
Borderlands 3: 7.34%
Wolfenstein Youngblood: 4.6%

fossil-db on Alchemist shows:

-8.97% in spills, -7.49% in fills overall
-0.53% in SEND instructions overall, -0.62% in affected (due to global CSE on memory loads)
-1.40% in cycles overall
-0.06% in instructions overall
SIMD32: net loss (+3103, -5682)

Merge request organization

This MR is the combination of several subseries:

Bug fixes to avoid regressions from later optimization changes
[VEC]: Patches to emit more SSA-friendly IR. The bulk of these are switching to LOAD_PAYLOAD to write entire vectors in one go instead of emitting a series of MOVs that each set up one component of a large VGRF. Some are fancier, like abstracting our 0x76543210 invocation ID setup, or splitting geometry shader output live ranges. I found the majority of these by analyzing cases in fossil-db and shader-db, analyzing regressions from using only the new passes vs. the old ones.
[DEFS]: The new opportunistic SSA def analysis pass. This is the infrastructure for working with SSA definitions in the backend. It's pretty simple and self-contained.
[CSE]: A new Global Common Subexpression Elimination (CSE) pass using the new infrastructure. It fully replaces our old local CSE pass.
[PROP]: A new SSA-based copy propagation pass. It has a number of advantages, but isn't fully ready to replace the old pass yet.

I have labeled them to make life easier for reviewers.

Why is this all in one MR? The first set of patches which make the IR more SSA-friendly introduce some significant code quality regressions. I originally had some patches to update the existing copy propagation and register coalescing passes to mitigate a lot of this, and it definitely can be done. However, the interconnectedness of the old local CSE pass, register coalescing, and old copy propagation, trying to coalesce/propagate/CSE some LOAD_PAYLOAD instructions, but not others, is...really quite fragile. Mostly because CSE can cause infinite loops in the optimization pass. In the end, I decided to simply replace the problematic CSE pass. With the new passes in place, the code quality results are solid.

Future Work

There is a whole lot of future work that could be done. A few ideas that I have in this area:

~~Update the builder to automatically allocate new temporary registers so that most code is automatically "SSA-friendly".~~
Make a SIMD_ZIP virtual opcode and lowering pass so that SIMD lowering is more SSA-friendly.
[done locally] Make the new copy propagation pass handle PACK and PACK_HALF (suggested by Ian).
Update more optimization passes to use the new infrastructure.
Make opt_combine_constants more SSA-friendly. Currently it is a huge source of partial writes.
Calculate minimum bounds for register demand and identify high-pressure regions in the IR, and use this in a separate pre-spilling pass.
Update IR to have SSA def destinations that cannot be written to, so that we can avoid the analysis cost.
Add lane-aligned vector phi-nodes and an out-of-SSA pass so that optimizations can see all values.

Obviously, this is going to take a lot of work, but we are excited to be back working on the compiler. We hope to continue landing improvements that provide immediate benefits for users, while also moving the compiler infrastructure closer to where it needs to be in the long term.

Thanks in advance to any and all reviewers.

Edited Jun 17, 2024 by Kenneth Graunke

Admin message

intel/brw: Opportunistic SSA def analysis and updated global CSE, copy propagation optimization passes

SSA Def-Analysis in brw:

Partial-SSA Visibility So Far

Preliminary Results

Merge request organization

Future Work

Merge request reports