r300: LRP present even with .lower_flrp32=true

I was looking again if we could get rid of the deadcode and dataflow optimize passes at least for VS and we are quite close (with RV530):

total instructions in shared programs: 104924 -> 106185 (1.20%)
instructions in affected programs: 47765 -> 49026 (2.64%)

Lats time I checked two months ago the increase was close to 5%.

However some of the regressed shaders still needs it for optimizations like this:

CONST[9] = {    10.0000   -10.0000     0.5000     0.0000 }
....
Vertex Program: after 'emulate negative addressing'
# Radeon Compiler Program
  0: MAD temp[0], const[4], input[0].xxxx, const[7];
  1: MAD temp[0], const[5], input[0].yyyy, temp[0];
  2: MUL temp[1], const[1], temp[0].yyyy;
  3: MAD temp[1], const[0], temp[0].xxxx, temp[1];
  4: MAD temp[1], const[2], temp[0].zzzz, temp[1];
  5: MAD temp[2], const[3], temp[0].wwww, temp[1];
  6: SLT temp[0].x, const[9].xxxx, const[8].xxxx;
  7: ADD temp[1].x, const[8].xxxx, const[9].yyyy;
  8: MUL temp[1].x, temp[1].xxxx, const[9].zzzz;
  9: LRP output[1].xyz, temp[0].xxxx, temp[1].xxxx, const[9].wwww;
 10: MOV output[1].w, temp[0].1111;
 11: MOV output[0], temp[2];
 12: MOV output[2], temp[2];
Vertex Program: after 'native rewrite'
# Radeon Compiler Program
  0: MAD temp[0], const[4], input[0].xxxx, const[7];
  1: MAD temp[0], const[5], input[0].yyyy, temp[0];
  2: MUL temp[1], const[1], temp[0].yyyy;
  3: MAD temp[1], const[0], temp[0].xxxx, temp[1];
  4: MAD temp[1], const[2], temp[0].zzzz, temp[1];
  5: MAD temp[2], const[3], temp[0].wwww, temp[1];
  6: SLT temp[0].x, const[9].xxxx, const[8].xxxx;
  7: ADD temp[1].x, const[8].xxxx, const[9].yyyy;
  8: MUL temp[1].x, temp[1].xxxx, const[9].zzzz;
  9: MAD temp[3].xyz, -temp[0].xxxx, const[9].wwww, const[9].wwww;
 10: MAD output[1].xyz, temp[0].xxxx, temp[1].xxxx, temp[3];
 11: MOV output[1].w, temp[0].1111;
 12: MOV output[0], temp[2];
 13: MOV output[2], temp[2];
Vertex Program: after 'deadcode'
# Radeon Compiler Program
  0: MAD temp[0], const[4], input[0].xxxx, const[7];
  1: MAD temp[0], const[5], input[0].yyyy, temp[0];
  2: MUL temp[1], const[1], temp[0].yyyy;
  3: MAD temp[1], const[0], temp[0].xxxx, temp[1];
  4: MAD temp[1], const[2], temp[0].zzzz, temp[1];
  5: MAD temp[2], const[3], temp[0].wwww, temp[1];
  6: SLT temp[0].x, const[9].x___, const[8].x___;
  7: ADD temp[1].x, const[8].x___, const[9].y___;
  8: MUL temp[1].x, temp[1].x___, const[9].z___;
  9: MAD temp[3].xyz, -temp[0].xxx_, const[9].www_, const[9].www_;
 10: MAD output[1].xyz, temp[0].xxx_, temp[1].xxx_, temp[3].xyz_;
 11: MOV output[1].w, temp[0].___1;
 12: MOV output[0], temp[2];
 13: MOV output[2], temp[2];
Vertex Program: after 'dataflow optimize'
# Radeon Compiler Program
  0: MAD temp[0], const[4], input[0].xxxx, const[7];
  1: MAD temp[0], const[5], input[0].yyyy, temp[0];
  2: MUL temp[1], const[1], temp[0].yyyy;
  3: MAD temp[1], const[0], temp[0].xxxx, temp[1];
  4: MAD temp[1], const[2], temp[0].zzzz, temp[1];
  5: MAD temp[2], const[3], temp[0].wwww, temp[1];
  6: SLT temp[0].x, const[9].x___, const[8].x___;
  7: ADD temp[1].x, const[8].x___, const[9].y___;
  8: MUL temp[1].x, temp[1].x___, const[9].z___;
  9: MUL output[1].xyz, temp[0].xxx_, temp[1].xxx_;
 10: MOV output[1].w, none.___1;
 11: MOV output[0], temp[2];
 12: MOV output[2], temp[2];

The question is why do we get LRP, it should be lowered in nir, AFAICS we set .lower_flrp32=true properly.

There are some other places where we still need the deadcode and dataflow optimize like when cleaning after rc_copy_output but that one should be easy to fix there.

Edited Mar 09, 2022 by Pavel Ondračka

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information