r300: make glamor work on R400 - part1 (compact constants)
The first three commits are from !28428 (merged)
This series if a first part of work to fix glamor on R400 (specifically shaders/glamor/82.shader_test and shaders/glamor/88.shader_test). The biggest issue right now is to fit into the 32 vec4 constant slot limits. Though to hit this, one needs to unroll the loops first which is not working now. I'm contemplating few options here, but for now I went to improve the constant situation, which definitely needs a driver side solution.
To reproduce the constant layout shown below add constant terminator to the loop so that NIR can figure out the unrolling.
for(i = 0; i < n_stop - 1; i++) {
if(stop_len < stops[i])
break;
}
new version
for(i = 0; i < 17; i++) {
if(stop_len < stops[i] || i < n_stop - 1)
break;
}
This is how the constants look for shaders/glamor/88.shader_test
we have 45 uniforms (lot of them uses just the x channel) and a 11 immediates, 56 in total
CONST[45] = { 0.000010 0.000000 1.000000 0.500000 }
CONST[46] = { 0.000000 1.000000 3.000000 2.000000 }
CONST[47] = { 16.000000 17.000000 15.000000 14.000000 }
CONST[48] = { 13.000000 12.000000 11.000000 10.000000 }
CONST[49] = { 9.000000 8.000000 7.000000 6.000000 }
CONST[50] = { 5.000000 4.000000 -1.000000 0.000001 }
CONST[51] = { 9.000000 4.000000 2.000000 1.000000 }
CONST[52] = { 3.000000 6.000000 5.000000 7.000000 }
CONST[53] = { 8.000000 13.000000 11.000000 10.000000 }
CONST[54] = { 12.000000 15.000000 14.000000 16.000000 }
CONST[55] = { 17.000000 4.000000 2.000000 1.000000 }
This series will do it in two steps, first it will place scalar uniforms (aka RC_CONSTANT_EXTERNAL
) into free slots in other uniforms and in later commit does the same for immediates (while checking for duplicates). To ease review the commit that adds the groundwork is also split (new remapping table to allow per-channel remapping and updating the emit paths to allow emitting uniforms one by one instead of full vec4). We only compact scalar uniforms since that way we don't have to worry about valid/invalid swizzles.
This is how the layout looks after (for uniforms the layout shows which slot they resided before the compaction):
CONST[0] = {CONST[0].x CONST[0].y CONST[0].z CONST[3].x }
CONST[1] = {CONST[1].x CONST[1].y CONST[1].z CONST[4].x }
CONST[2] = {CONST[2].x CONST[2].y CONST[2].z CONST[6].x }
CONST[3] = {CONST[5].x CONST[5].y CONST[8].x CONST[9].x }
CONST[4] = {CONST[7].x CONST[7].y CONST[10].x CONST[11].x }
CONST[5] = {CONST[27].x CONST[27].y CONST[27].z CONST[27].w }
CONST[6] = {CONST[28].x CONST[28].y CONST[28].z CONST[28].w }
CONST[7] = {CONST[29].x CONST[29].y CONST[29].z CONST[29].w }
CONST[8] = {CONST[30].x CONST[30].y CONST[30].z CONST[30].w }
CONST[9] = {CONST[31].x CONST[31].y CONST[31].z CONST[31].w }
CONST[10] = {CONST[32].x CONST[32].y CONST[32].z CONST[32].w }
CONST[11] = {CONST[33].x CONST[33].y CONST[33].z CONST[33].w }
CONST[12] = {CONST[34].x CONST[34].y CONST[34].z CONST[34].w }
CONST[13] = {CONST[35].x CONST[35].y CONST[35].z CONST[35].w }
CONST[14] = {CONST[36].x CONST[36].y CONST[36].z CONST[36].w }
CONST[15] = {CONST[37].x CONST[37].y CONST[37].z CONST[37].w }
CONST[16] = {CONST[38].x CONST[38].y CONST[38].z CONST[38].w }
CONST[17] = {CONST[39].x CONST[39].y CONST[39].z CONST[39].w }
CONST[18] = {CONST[40].x CONST[40].y CONST[40].z CONST[40].w }
CONST[19] = {CONST[41].x CONST[41].y CONST[41].z CONST[41].w }
CONST[20] = {CONST[42].x CONST[42].y CONST[42].z CONST[42].w }
CONST[21] = {CONST[43].x CONST[43].y CONST[43].z CONST[43].w }
CONST[22] = {CONST[44].x CONST[44].y CONST[44].z CONST[44].w }
CONST[23] = {CONST[12].x CONST[13].x CONST[14].x CONST[15].x }
CONST[24] = {CONST[16].x CONST[17].x CONST[18].x CONST[19].x }
CONST[25] = {CONST[20].x CONST[21].x CONST[22].x CONST[23].x }
CONST[26] = {CONST[24].x CONST[25].x CONST[26].x CONST[-1].u }
CONST[27] = { 0.000000 1.000000 3.000000 0.000010 }
CONST[28] = { 4.000000 5.000000 6.000000 7.000000 }
CONST[29] = { 8.000000 9.000000 10.000000 11.000000 }
CONST[30] = { 12.000000 13.000000 14.000000 15.000000 }
CONST[31] = { 9.000000 4.000000 2.000000 0.000001 }
CONST[32] = { 3.000000 6.000000 5.000000 7.000000 }
CONST[33] = { 8.000000 13.000000 11.000000 10.000000 }
CONST[34] = { 12.000000 15.000000 14.000000 16.000000 }
CONST[35] = { 17.000000 4.000000 2.000000 unused }
Unfortunately, we are still quite a lot over the limit. I'm looking for ideas how to fix this. The main reason is vectorization, we have a lot of duplicate immediates, but they are used a vectors, so we can't reuse. My plan right now is to iteratively find immediates that contains most constant swizzles and values that are also in other immediates, scalarize any instruction that uses it as a vector, replace what we can with constant swizzles (because we can always do it for scalars), point the stuff we can reuse to the other locations and rest in unused slots. This way we should be able to make this just work. But it will be more code so this MR is large enough already.
But I'm open to other suggestions... We have in theory also one free channel in the CONST[26].w
slow, but if we could refrain from mixing uniforms and immediates together it would be better (it would need a really invasive rework of large parts of the compiler and constant emitting paths...)
As a sidenote, if I go for the shader variants solution to the loop unrolling (and making shader variants for different values of n_stop
, thus making it immediate in the shader) in that case we handle the constants even better and are only one constant over the limit....
There are some instruction/cycle/register changes (sometimes we switch rbg/alpha slots and thus influence scheduling, etc...), but overall its mostly even or slightly positive, see individual commits.