Eliminate more redundant operations on vectored platforms
(NOTE: I only tagged TGSI because I think this enhancement would most likely help platforms using NIR-to-TGSI.)
While trying to investigate a solution for #6038, I noticed this NIR in the output of fs-temp-array-mat4-index-col-row-wr.shader_test on my R430.
vec3 32 ssa_10 = load_const (0x3f800000, 0x40000000, 0x40000000) = (1.000000, 2.000000, 2.000000)
vec3 1 ssa_11 = flt ssa_9.xxx, ssa_10
...
vec4 32 ssa_16 = load_const (0x40000000, 0x40000000, 0x40000000, 0x40000000) = (2.000000, 2.000000, 2.000000, 2.000000)
vec4 1 ssa_17 = flt ssa_9.xxxx, ssa_16
vec4 32 ssa_18 = bcsel ssa_17, ssa_1, ssa_8
...
vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
...
vec4 32 ssa_28 = bcsel ssa_11.zzzz, ssa_22, ssa_18
Ideally, this should get reduced to
vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
vec2 1 ssa_11 = flt ssa_9.xx, ssa_10
...
vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
...
vec4 32 ssa_28 = bcsel ssa_11.yyyy, ssa_22, ssa_8
I think this can be achieved without too much difficulty. It seems like adding a pass that tries to narror vector operations would be the most important thing. That would perform a first reduction to
vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
vec2 1 ssa_11 = flt ssa_9.xx, ssa_10
...
vec1 32 ssa_16 = load_const (0x40000000) = (2.000000)
vec1 1 ssa_17 = flt ssa_9.x, ssa_16
vec4 32 ssa_18 = bcsel ssa_17.xxxx, ssa_1, ssa_8
...
vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
...
vec4 32 ssa_28 = bcsel ssa_11.yyyy, ssa_22, ssa_18
A good first step of this would probably be to just narrow constants. Then it should be easy to detect redundant channel operations in something like
vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
vec3 1 ssa_11 = flt ssa_9.xxx, ssa_10.xyy
...
Then, possibly with some enhancements, nir_opt_vectorize
(and another run of the narrowing pass) could reduce it to
vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
vec2 1 ssa_11 = flt ssa_9.xx, ssa_10
...
vec4 32 ssa_18 = bcsel ssa_11.yyyy, ssa_1, ssa_8
...
vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
...
vec4 32 ssa_28 = bcsel ssa_11.yyyy, ssa_22, ssa_18
Finally, an obvious algrebraic optimization would take care of the rest.
# In the innermost bcsel, 'a' must be false.
(('bcsel', a, b, ('bcsel', c, ('bcsel', a, d, e), 'f')),
('bcsel', a, b, ('bcsel', c, e , 'f'))),
It might also simplify things to have a pass that tries to detect scalar constants that are subsets of existing vector constants. That would allow in intermediate step that converts
vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
vec2 1 ssa_11 = flt ssa_9.xx, ssa_10
...
vec1 32 ssa_16 = load_const (0x40000000) = (2.000000)
vec1 1 ssa_17 = flt ssa_9.x, ssa_16
vec4 32 ssa_18 = bcsel ssa_17.xxxx, ssa_1, ssa_8
...
vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
...
vec4 32 ssa_28 = bcsel ssa_11.yyyy, ssa_22, ssa_18
into
vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
vec2 1 ssa_11 = flt ssa_9.xx, ssa_10
...
vec1 1 ssa_17 = flt ssa_9.x, ssa_10.y
vec4 32 ssa_18 = bcsel ssa_17.xxxx, ssa_1, ssa_8
...
vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
...
vec4 32 ssa_28 = bcsel ssa_11.yyyy, ssa_22, ssa_18
It's tempting to try a vector CSE pass here, but I suspect that would be more work. It might also have other benefits.
-
Implement general pass to eliminate redundant channels from vector constants. -
Modify the previous pass to eliminate redundant channels from vector operations. -
Enhance either CSE or constant propagation, if necessary, to replace scalar or small vector constants with swizzled components of larger vector constants. This might already "just work." -
Enhance nir_opt_vectorize
or CSE to replace scalar or small vector ALU operations with swizzled components of larger vector ALU operations.