r300, nir: missing per-channel constant folding
Here is a r300 NIR snippet from the following Lightsmark shader:13.shader_test
....
32 %15 = fsin %14
32 %16 = fcos %14
32x3 %17 = vec3 %15, %16, %0 (0x0)
32x4 %18 = @load_ubo_vec4 (%0 (0x0), %0 (0x0)) (access=none, base=0, component=0)
32x3 %19 = fmul %17, %18.www
32x3 %20 = vec3 %16, %15, %0 (0x0)
32x3 %21 = fmul %20, %18.xyz
32x4 %22 = @load_interpolated_input (%1, %0 (0x0)) (base=2, component=0, dest_type=float32, io location=VARYING_SLOT_VAR0 slots=1) // shadowCoord
32 %23 = frcp %22.w
32x3 %24 = ffma %22.xyz, %23.xxx, %19
32x4 %25 = (float32)tex %24 (backend1), 0 (texture), 0 (sampler)
32x3 %26 = fneg %19
32x3 %27 = ffma %22.xyz, %23.xxx, %26
32x4 %28 = (float32)tex %27 (backend1), 0 (texture), 0 (sampler)
32 %29 = fadd %25.z, %28.z
32x3 %30 = ffma %22.xyz, %23.xxx, %21
....
We construct two vectors with 0 in z and than we do a mul. We should be able to figure out that we are doing a*0 in the z channel and rather construct a new vector later only using the extra channel when we really need it.
This is what we end with after translation to backend IR
const[3] FLT32 { 5.7000, 8.1000, 0.1592, 0.0000}
...
8: SIN temp[1].x, temp[0].xxxx;
9: COS temp[1].y, temp[0].xxxx;
10: MOV temp[1].z, const[3].wwww;
11: MUL temp[0].xyz, temp[1].xyzz, const[0].wwwx;
12: MOV temp[1].xy, temp[1].yxxx;
13: MOV temp[1].z, const[3].wwww;
14: MUL temp[1].xyz, temp[1].xyzz, const[0].xyzz;
r300 backed is quite good in handling the zeros, so we get away from needing a separate instruction for the second mov 0, but for the first one we can't merge it with the scalar math instruction and the backend can't do per-channel copy propagate.
8: SIN temp[1].x, temp[0].x___;
9: COS temp[1].y, temp[0]._x__;
10: MOV temp[1].z, none.__0_;
11: MUL temp[0].xyz, temp[1].xyz_, const[0].www_;
12: MUL temp[1].xyz, temp[1].yx0_, const[0].xyz_;
We could probably figure out in the backend that we could copy propagate it and replace temp[1].xyz_ swizzle with temp[1].xy0_, but IMO better approach would be to solve this in NIR and also shrink the muls, so that we maybe can also reduce register usage.