Skip to content

nir,agx: rewrite AGX address mode handling

Alyssa Rosenzweig requested to merge alyssa/mesa:agx/address-mode-rewrite into main

AGX load/stores supports a single family of addressing modes:

 64-bit base + sign/zero-extend(32-bit) << (format shift + optional shift)

This is a base-index addressing mode, where the index is minimally in elements (never bytes, unless we're doing 8-bit load/stores). Both base and the resulting address must be aligned to the format size; the mandatory shift means that alignment of base is equivalent to alignment of the final address, which is taken care of by lower_mem_access_bit_size anyhow.

The other key thing to note is that this is a 64-bit shift, after the sign- or zero-extension of the 32-bit index. That means that AGX does NOT implement

 64-bit base + sign/zero-extend(32-bit << shift)

This has sweeping implications.

For addressing math from C-based languages (including OpenCL C), the AGX mode is more helpful, since we tend to get 64-bit shifts instead of 32-bit shifts. However, for addressing math coming from GLSL, the AGX mode is rather annoying since we know UBOs/SSBOs are at most 4GB so nir_lower_io & friends are all 32-bit byte indexing. It's tricky to teach them to do otherwise, and would not be optimal either since 64-bit adds&shifts are usually much more expensive than 32-bit on AGX except for when fused into the load/store.

So we don't want 32-bit NIR, since then we can't use the hardware addressing mode at all. We also don't want 64-bit NIR, since then we have excessive 64-bit math resulting from deep deref chains from complex struct/array cases. Instead, we want a middle ground: 32-bit operations that are guaranteed not to overflow 32-bit and can therefore be losslessly promoted to 64-bit.

We can make that no-overflow guarantee as a consequence of the maximum UBO/SSBO size, and indeed Mesa relies on this already all over the place. So, in this series, we convert Mesa to produce relaxed amul/aadd opcodes for addressing math. Then, we rewrite our address mode pattern matching to fuse AGX address modes, promoting 32-bit amul/aadd to 64-bit when necessary.

The actual pattern matching is rewritten. The old code was brittle handwritten nir_scalar chasing, based on a faulty model of the hardware (with the 32-bit shift). We delete it all, it's broken. In the new approach, we add some NIR pseudo-opcodes for address math (ulea_agx/ilea_agx) which we pattern match with NIR algebraic rules. Then the chasing required to fuse LEA's into load/stores is trivial because we never go deeper than 1 level. After fusing, we then lower the leftover lea/amul/aadd opcodes and let regular nir_opt_algebraic take it from here.

We do need to be very careful around pass order to make sure things like load/store vectorization still happen. Some passes are shuffled in this commit to make this work. We also need to cleanup amul/aadd before fusing since we specifically do not have nir_opt_algebraic do so - the entire point of the pseudo-opcodes is to make nir_opt_algebraic ignore the opcodes until we've had a chance to fuse. If we simply used the .nuw bit on iadd/imul, nir_opt_algebraic would "optimize" things and lose the bit and then we would fail to fuse addressing modes, which is a much more expensive failure case than anything nir_opt_algebraic can do for us. I don't know what the "optimal" pass order for AGX would look like at this point, but what we have here is good enough for now and is a net positive for shader-db.

That all ends up being much less code and much simpler code, while fixing the soundness holes in the old code, and also optimizing a significantly richer set of addressing calculations. Now we don't juts optimize GL/VK modes, but also CL. This is crucial even for GL/VK performance, since we rely on CL via libagx even in graphics shaders.

Terraintessellation is up 10% to ~310fps, which is quite nice.

The following stats are for the end of the series together, including this change + libagx change + the NIR changes building up to this... but not including the SSBO vectorizer stats or the IC modelling fix. In other words, these are the stats for "rewriting address mode handling". This is on OpenGL, and since the old code was targeted at GL, anything that's not a loss is good enough - we need this for the soundness fix regardless.

total instructions in shared programs: 2751295 -> 2750352 (-0.03%)
instructions in affected programs: 373401 -> 372458 (-0.25%)
helped: 721
HURT: 65
Instructions are helped.

total alu in shared programs: 2274468 -> 2273810 (-0.03%)
alu in affected programs: 309155 -> 308497 (-0.21%)
helped: 719
HURT: 68
Alu are helped.

total fscib in shared programs: 2272785 -> 2272127 (-0.03%)
fscib in affected programs: 309155 -> 308497 (-0.21%)
helped: 719
HURT: 68
Fscib are helped.

total ic in shared programs: 612698 -> 602870 (-1.60%)
ic in affected programs: 73568 -> 63740 (-13.36%)
helped: 1124
HURT: 90
Ic are helped.

total bytes in shared programs: 21488668 -> 21476424 (-0.06%)
bytes in affected programs: 3059418 -> 3047174 (-0.40%)
helped: 759
HURT: 96
Bytes are helped.

total regs in shared programs: 865148 -> 865114 (<.01%)
regs in affected programs: 1603 -> 1569 (-2.12%)
helped: 10
HURT: 9
Inconclusive result (value mean confidence interval includes 0).

total uniforms in shared programs: 2120735 -> 2120792 (<.01%)
uniforms in affected programs: 22752 -> 22809 (0.25%)
helped: 76
HURT: 49
Inconclusive result (value mean confidence interval includes 0).

total threads in shared programs: 27613312 -> 27613504 (<.01%)
threads in affected programs: 1536 -> 1728 (12.50%)
helped: 3
HURT: 0
Edited by Alyssa Rosenzweig

Merge request reports

Loading