r300, nir: missing per-channel constant folding

Here is a r300 NIR snippet from the following Lightsmark shader:13.shader_test

    ....
    32    %15 = fsin %14
    32    %16 = fcos %14
    32x3  %17 = vec3 %15, %16, %0 (0x0)
    32x4  %18 = @load_ubo_vec4 (%0 (0x0), %0 (0x0)) (access=none, base=0, component=0)
    32x3  %19 = fmul %17, %18.www
    32x3  %20 = vec3 %16, %15, %0 (0x0)
    32x3  %21 = fmul %20, %18.xyz
    32x4  %22 = @load_interpolated_input (%1, %0 (0x0)) (base=2, component=0, dest_type=float32, io location=VARYING_SLOT_VAR0 slots=1)  // shadowCoord
    32    %23 = frcp %22.w
    32x3  %24 = ffma %22.xyz, %23.xxx, %19
    32x4  %25 = (float32)tex %24 (backend1), 0 (texture), 0 (sampler)
    32x3  %26 = fneg %19
    32x3  %27 = ffma %22.xyz, %23.xxx, %26
    32x4  %28 = (float32)tex %27 (backend1), 0 (texture), 0 (sampler)
    32    %29 = fadd %25.z, %28.z
    32x3  %30 = ffma %22.xyz, %23.xxx, %21
    ....

We construct two vectors with 0 in z and than we do a mul. We should be able to figure out that we are doing a*0 in the z channel and rather construct a new vector later only using the extra channel when we really need it.

This is what we end with after translation to backend IR

const[3] FLT32 {    5.7000,     8.1000,     0.1592,     0.0000}
...
  8: SIN temp[1].x, temp[0].xxxx;
  9: COS temp[1].y, temp[0].xxxx;
 10: MOV temp[1].z, const[3].wwww;
 11: MUL temp[0].xyz, temp[1].xyzz, const[0].wwwx;
 12: MOV temp[1].xy, temp[1].yxxx;
 13: MOV temp[1].z, const[3].wwww;
 14: MUL temp[1].xyz, temp[1].xyzz, const[0].xyzz;

r300 backed is quite good in handling the zeros, so we get away from needing a separate instruction for the second mov 0, but for the first one we can't merge it with the scalar math instruction and the backend can't do per-channel copy propagate.

  8: SIN temp[1].x, temp[0].x___;
  9: COS temp[1].y, temp[0]._x__;
 10: MOV temp[1].z, none.__0_;
 11: MUL temp[0].xyz, temp[1].xyz_, const[0].www_;
 12: MUL temp[1].xyz, temp[1].yx0_, const[0].xyz_;

We could probably figure out in the backend that we could copy propagate it and replace temp[1].xyz_ swizzle with temp[1].xy0_, but IMO better approach would be to solve this in NIR and also shrink the muls, so that we maybe can also reduce register usage.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information