r300: LRP present even with .lower_flrp32=true
I was looking again if we could get rid of the deadcode and dataflow optimize passes at least for VS and we are quite close (with RV530):
total instructions in shared programs: 104924 -> 106185 (1.20%)
instructions in affected programs: 47765 -> 49026 (2.64%)
Lats time I checked two months ago the increase was close to 5%.
However some of the regressed shaders still needs it for optimizations like this:
CONST[9] = { 10.0000 -10.0000 0.5000 0.0000 }
....
Vertex Program: after 'emulate negative addressing'
# Radeon Compiler Program
0: MAD temp[0], const[4], input[0].xxxx, const[7];
1: MAD temp[0], const[5], input[0].yyyy, temp[0];
2: MUL temp[1], const[1], temp[0].yyyy;
3: MAD temp[1], const[0], temp[0].xxxx, temp[1];
4: MAD temp[1], const[2], temp[0].zzzz, temp[1];
5: MAD temp[2], const[3], temp[0].wwww, temp[1];
6: SLT temp[0].x, const[9].xxxx, const[8].xxxx;
7: ADD temp[1].x, const[8].xxxx, const[9].yyyy;
8: MUL temp[1].x, temp[1].xxxx, const[9].zzzz;
9: LRP output[1].xyz, temp[0].xxxx, temp[1].xxxx, const[9].wwww;
10: MOV output[1].w, temp[0].1111;
11: MOV output[0], temp[2];
12: MOV output[2], temp[2];
Vertex Program: after 'native rewrite'
# Radeon Compiler Program
0: MAD temp[0], const[4], input[0].xxxx, const[7];
1: MAD temp[0], const[5], input[0].yyyy, temp[0];
2: MUL temp[1], const[1], temp[0].yyyy;
3: MAD temp[1], const[0], temp[0].xxxx, temp[1];
4: MAD temp[1], const[2], temp[0].zzzz, temp[1];
5: MAD temp[2], const[3], temp[0].wwww, temp[1];
6: SLT temp[0].x, const[9].xxxx, const[8].xxxx;
7: ADD temp[1].x, const[8].xxxx, const[9].yyyy;
8: MUL temp[1].x, temp[1].xxxx, const[9].zzzz;
9: MAD temp[3].xyz, -temp[0].xxxx, const[9].wwww, const[9].wwww;
10: MAD output[1].xyz, temp[0].xxxx, temp[1].xxxx, temp[3];
11: MOV output[1].w, temp[0].1111;
12: MOV output[0], temp[2];
13: MOV output[2], temp[2];
Vertex Program: after 'deadcode'
# Radeon Compiler Program
0: MAD temp[0], const[4], input[0].xxxx, const[7];
1: MAD temp[0], const[5], input[0].yyyy, temp[0];
2: MUL temp[1], const[1], temp[0].yyyy;
3: MAD temp[1], const[0], temp[0].xxxx, temp[1];
4: MAD temp[1], const[2], temp[0].zzzz, temp[1];
5: MAD temp[2], const[3], temp[0].wwww, temp[1];
6: SLT temp[0].x, const[9].x___, const[8].x___;
7: ADD temp[1].x, const[8].x___, const[9].y___;
8: MUL temp[1].x, temp[1].x___, const[9].z___;
9: MAD temp[3].xyz, -temp[0].xxx_, const[9].www_, const[9].www_;
10: MAD output[1].xyz, temp[0].xxx_, temp[1].xxx_, temp[3].xyz_;
11: MOV output[1].w, temp[0].___1;
12: MOV output[0], temp[2];
13: MOV output[2], temp[2];
Vertex Program: after 'dataflow optimize'
# Radeon Compiler Program
0: MAD temp[0], const[4], input[0].xxxx, const[7];
1: MAD temp[0], const[5], input[0].yyyy, temp[0];
2: MUL temp[1], const[1], temp[0].yyyy;
3: MAD temp[1], const[0], temp[0].xxxx, temp[1];
4: MAD temp[1], const[2], temp[0].zzzz, temp[1];
5: MAD temp[2], const[3], temp[0].wwww, temp[1];
6: SLT temp[0].x, const[9].x___, const[8].x___;
7: ADD temp[1].x, const[8].x___, const[9].y___;
8: MUL temp[1].x, temp[1].x___, const[9].z___;
9: MUL output[1].xyz, temp[0].xxx_, temp[1].xxx_;
10: MOV output[1].w, none.___1;
11: MOV output[0], temp[2];
12: MOV output[2], temp[2];
The question is why do we get LRP, it should be lowered in nir, AFAICS we set .lower_flrp32=true
properly.
There are some other places where we still need the deadcode and dataflow optimize like when cleaning after rc_copy_output
but that one should be easy to fix there.
Edited by Pavel Ondračka