radv: force unroll loops which access temporary arrays via induction
This has a very large impact on at least one X4 Foundations shader:
SGPRS: 40 -> 88 (120.00 %) VGPRS: 168 -> 32 (-80.95 %) Scratch size: 1012 -> 0 (-100.00 %) dwords per thread Code Size: 15360 -> 4864 (-68.33 %) bytes Max Waves: 1 -> 8 (700.00 %)
and has been reported to fix a large performance regression: https://github.com/daniel-schuermann/mesa/issues/120#issuecomment-523166738
Looking at force_unroll_array_accesses, this was already done but only if max_trip_count matched the array's size. This would prevent unrolling when the loop is split into multiple loops with each loop only processing a subset of the array (probably to try to convince the compiler to unroll the loop). Even if there was only a single loop, it wouldn't be unrolled because the (large) array was initialized in it's body.
Totals from affected shaders: SGPRS: 136 -> 96 (-29.41 %) VGPRS: 88 -> 168 (90.91 %) Spilled SGPRs: 0 -> 0 (0.00 %) Spilled VGPRs: 0 -> 0 (0.00 %) Private memory VGPRs: 0 -> 0 (0.00 %) Scratch size: 0 -> 0 (0.00 %) dwords per thread Code Size: 7100 -> 7968 (12.23 %) bytes LDS: 0 -> 0 (0.00 %) blocks Max Waves: 11 -> 6 (-45.45 %) Wait states: 0 -> 0 (0.00 %)
This causes LLVM to use far more vgprs in a couple of shaders, halving their max_waves and reducing performance slightly in F1 2017 and Dota 2.
EDIT: I don't think I made this clear, the array was large enough to be lowered to scratch. Since glslang creates SPIR-V that initializes the array each loop iteration, there is a large amount of scratch stores each loop iteration.