nir: allow custom lowering in nir_lower_phis_to_scalar (!20881) · Merge requests · Mesa / mesa

Pavel Ondračka requested to merge ondracka/mesa:nir_scalarize_phis_in_loops_only into main Jan 24, 2023

Allow to pass a custom function to override the logic behind which phis to scalarize (allow overriding the current logic in should_lower_phi) and pipe it also through ntt.

The intent of this is to allow r300 driver to unroll loops that come from nine/ttn without enabling scalarizing all phis in general (on r300 it would come with ~5% instruction number regression).

The problematic loops look like this:

impl main {
	block block_0:
	/* preds: */
	vec4 32 ssa_0 = load_const (0x00000000, 0x00000000, 0x00000000, 0x00000000) = (0.000000, 0.000000, 0.000000, 0.000000)
	vec1 32 ssa_1 = load_const (0x3f000000 = 0.500000)
	vec1 32 ssa_2 = load_const (0x3f800000 = 1.000000)
	/* succs: block_1 */
	loop {
		block block_1:
		/* preds: block_0 block_4 */
		vec4 32 ssa_3 = phi block_0: ssa_0, block_4: ssa_7
		vec1 32 ssa_4 = slt ssa_3.y, ssa_1
		vec1 32 ssa_9 = load_const (0x00000000 = 0.000000)
		vec1 32 ssa_10 = seq ssa_4, ssa_9
		/* succs: block_2 block_3 */
		if ssa_10 {
			block block_2:
			/* preds: block_1 */
			break
			/* succs: block_5 */
		} else {
			block block_3:
			/* preds: block_1 */
			/* succs: block_4 */
		}
		block block_4:
		/* preds: block_3 */
		vec1 32 ssa_6 = fadd ssa_3.y, ssa_2
		vec4 32 ssa_7 = vec4 ssa_3.x, ssa_6, ssa_3.z, ssa_3.w
		/* succs: block_1 */
	}
	block block_5:
	/* preds: block_2 */
	vec1 32 ssa_8 = load_const (0x00000000 = 0.000000)
	intrinsic store_output (ssa_3, ssa_8) (base=0, wrmask=xyzw /*15*/, component=0, src_type=float32 /*160*/, io location=2 slots=1 /*130*/, xfb() /*0*/, xfb2() /*0*/)	/* gl_FragColor */
	/* succs: block_6 */
	block block_6:
}

this comes from the tests/spec/glsl-1.10/execution/loops/glsl-fs-loop-vec4-counter.shader_test piglit which I crafted to reproduce this without nine. I haven't actually saw any GLSL shader to hit this, but any loop from nine will look similar (just due to how nine and ttn works).

I'm not sure if the approach in this MR is clean enough, @tarceri seemed to suggest in #7222 (closed) that scalarizing is the way to go so hopefully this could work and does not pollute the common code too much for this special case. I was also considering and trying adding some special case for this directly in loop unrolling, but that was way above my skill and the other option I could think of to scalarize everything and write a general phi vectorizing pass to clean after was also looking too scary (I was not able to find any phi vectorization).

I placed the should_lower_phi override function directly to r300, because while other ntt users could benefit in theory, I don't think any of i915, nvc0, or virgl actually works with nine so I think r300 is the only user (r600 scalarizes phis by default it seems and I assume whoever has vulkan just uses dxvk for D3D).

As always, I hardly know what I'm doing when it comes to NIR so any suggestions are welcomed.

BTW r300 driver really needs to unroll the nine loops for several reasons:

there is a lot of relative addressing in D3D9 using the loop adress register, the R500 hardware supports this just fine, but there is no way to pipe this through nir (or tgsi for that matter) and because we don't have normal address register in fragment shaders we have to lower to costly if ladders. This problem goes away when unrolled.
Also, while the indirect lowering works with OpenGL, it is somehow broken when it comes to the nine/ttn combo (see #6371), so it is not only slow but broken as well.
for vertex shaders, we have no jumps, so we run every loop fixed number of iterations (currently actually the maximum number of iterations the hardware can do with the later iterations doing nothing using the predicate logic), so not unrolling is also performance killer here.

nir: allow custom lowering in nir_lower_phis_to_scalar

Merge request reports