nir: Make algebraic process the program both ways.
The algebraic pass was exhibiting O(n^2) behavior in dEQP-GLES2.functional.uniform_api.random.3 and dEQP-GLES31.functional.ubo.random.all_per_block_buffers.20 (along with other code-generated tests, and likely real-world loop-unroll cases). In the process of using fmul(b2f(x), b2f(x)) -> b2f(iand(x, y)) to transform: result = b2f(a == b); result *= b2f(c == d); ... result *= b2f(z == w); -> temp = (a == b) temp = temp && (c == d) ... temp = temp && (z == w) result = b2f(temp); nir_opt_algebraic, proceeding bottom-to-top, would match and convert the top-most fmul(b2f(), b2f()) case each time, leaving the new b2f to be matched by the next fmul down on the next time algebraic got run by the optimization loop. Back in 2016 in 7be8d077 ("nir: Do opt_algebraic in reverse order."), Matt changed algebraic to go bottom-to-top so that we would match the biggest patterns first. This helped his cases, but I believe introduced this failure mode. Retain the bottom-to-top mode first, but if we made any changes through algebraic, then go back and do a top-to-bottom pass so we avoid the O(n^2) behavior. Reduces runtime (over this whole series, including automaton removal) of dEQP-GLES2.functional.uniform_api.random.3 from 4s to 1s and dEQP-GLES31.functional.ubo.random.all_per_block_buffers.20 from 13.4s to 4.06s on cheza. An x86_64 shader-db on freedreno had no statistically significant performance difference (n=3) between the redundant-movs change and this commit. This has a surprising amount of effect on compilation, though: total instructions in shared programs: 7821406 -> 7895951 (0.95%) total dwords in shared programs: 10893344 -> 10937696 (0.41%) total full in shared programs: 334605 -> 334538 (-0.02%) total constlen in shared programs: 2329266 -> 2329172 (<.01%) total (ss) in shared programs: 153631 -> 153584 (-0.03%) total (sy) in shared programs: 94667 -> 94668 (<.01%) total max_sun in shared programs: 1166142 -> 1168855 (0.23%)