Eliminate more redundant operations on vectored platforms

(NOTE: I only tagged TGSI because I think this enhancement would most likely help platforms using NIR-to-TGSI.)

While trying to investigate a solution for #6038, I noticed this NIR in the output of fs-temp-array-mat4-index-col-row-wr.shader_test on my R430.

    vec3 32 ssa_10 = load_const (0x3f800000, 0x40000000, 0x40000000) = (1.000000, 2.000000, 2.000000)
    vec3 1 ssa_11 = flt ssa_9.xxx, ssa_10
    ...
    vec4 32 ssa_16 = load_const (0x40000000, 0x40000000, 0x40000000, 0x40000000) = (2.000000, 2.000000, 2.000000, 2.000000)
    vec4 1 ssa_17 = flt ssa_9.xxxx, ssa_16
    vec4 32 ssa_18 = bcsel ssa_17, ssa_1, ssa_8
    ...
    vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
    ...
    vec4 32 ssa_28 = bcsel ssa_11.zzzz, ssa_22, ssa_18

Ideally, this should get reduced to

    vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
    vec2 1 ssa_11 = flt ssa_9.xx, ssa_10
    ...
    vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
    ...
    vec4 32 ssa_28 = bcsel ssa_11.yyyy, ssa_22, ssa_8

I think this can be achieved without too much difficulty. It seems like adding a pass that tries to narror vector operations would be the most important thing. That would perform a first reduction to

    vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
    vec2 1 ssa_11 = flt ssa_9.xx, ssa_10
    ...
    vec1 32 ssa_16 = load_const (0x40000000) = (2.000000)
    vec1 1 ssa_17 = flt ssa_9.x, ssa_16
    vec4 32 ssa_18 = bcsel ssa_17.xxxx, ssa_1, ssa_8
    ...
    vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
    ...
    vec4 32 ssa_28 = bcsel ssa_11.yyyy, ssa_22, ssa_18

A good first step of this would probably be to just narrow constants. Then it should be easy to detect redundant channel operations in something like

    vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
    vec3 1 ssa_11 = flt ssa_9.xxx, ssa_10.xyy
    ...

Then, possibly with some enhancements, nir_opt_vectorize (and another run of the narrowing pass) could reduce it to

    vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
    vec2 1 ssa_11 = flt ssa_9.xx, ssa_10
    ...
    vec4 32 ssa_18 = bcsel ssa_11.yyyy, ssa_1, ssa_8
    ...
    vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
    ...
    vec4 32 ssa_28 = bcsel ssa_11.yyyy, ssa_22, ssa_18

Finally, an obvious algrebraic optimization would take care of the rest.

   # In the innermost bcsel, 'a' must be false.
   (('bcsel', a, b, ('bcsel', c, ('bcsel', a, d, e), 'f')),
    ('bcsel', a, b, ('bcsel', c,                 e , 'f'))),

It might also simplify things to have a pass that tries to detect scalar constants that are subsets of existing vector constants. That would allow in intermediate step that converts

    vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
    vec2 1 ssa_11 = flt ssa_9.xx, ssa_10
    ...
    vec1 32 ssa_16 = load_const (0x40000000) = (2.000000)
    vec1 1 ssa_17 = flt ssa_9.x, ssa_16
    vec4 32 ssa_18 = bcsel ssa_17.xxxx, ssa_1, ssa_8
    ...
    vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
    ...
    vec4 32 ssa_28 = bcsel ssa_11.yyyy, ssa_22, ssa_18

into

    vec2 32 ssa_10 = load_const (0x3f800000, 0x40000000) = (1.000000, 2.000000)
    vec2 1 ssa_11 = flt ssa_9.xx, ssa_10
    ...
    vec1 1 ssa_17 = flt ssa_9.x, ssa_10.y
    vec4 32 ssa_18 = bcsel ssa_17.xxxx, ssa_1, ssa_8
    ...
    vec4 32 ssa_22 = bcsel ssa_11.xxxx, ssa_1, ssa_12
    ...
    vec4 32 ssa_28 = bcsel ssa_11.yyyy, ssa_22, ssa_18

It's tempting to try a vector CSE pass here, but I suspect that would be more work. It might also have other benefits.

Implement general pass to eliminate redundant channels from vector constants.
Modify the previous pass to eliminate redundant channels from vector operations.
Enhance either CSE or constant propagation, if necessary, to replace scalar or small vector constants with swizzled components of larger vector constants. This might already "just work."
Enhance nir_opt_vectorize or CSE to replace scalar or small vector ALU operations with swizzled components of larger vector ALU operations.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information