Skip to content

nir/loop_unroll: Alternate loop unrolling heuristic

Ian Romanick requested to merge idr/mesa:review/alt-loop-unroll into main

Assume the presence of an optimization pass that will create multiple copies of the loop body while reducing the loop count. That is, a transformation of

   for (unsigned i = 0; i < 64; i++)
      stuff();

into

   for (unsigned i = 0; i < 64; ) {
      stuff();
      i++;
      stuff();
      i++;
   }

If that optimization were applied, would the loop become unrollable by the previous heuristic? If so, just allow it to be unrolled now.

I am working on this optimization pass, so we're going to end up here sooner or later anyway. I hoped to have this ready before disappearing on vacation for most of April, but that did not work out.

This does help some additional loops unroll. In fact, it seems that the Vulkan CTS has a LOT of loops that get unrolled. So many loops, in fact, that performance is significantly impacted by this small change. On this subset

   ./deqp-vk --deqp-case=dEQP-VK.*spir* --deqp-log-images=disable \
       --deqp-log-shader-sources=disable

without MR !22299 (merged), the performance hit is on the order of +25%. The combined performance change across all of !22299 (merged) and this patch is +17.7% ± 0.08% (n = 5, pooled s = 0.323883), so that MR helps mitigate a good portion of the damage.

This commit should not affect the compile-time performance of real applications. Octopath Traveler (see below) had the most individual shaders affected by this commit, so I measured fossil-db time on just octopath_traveler.foz. Across just this commit, performance was improved -0.24% ± 0.11% (n = 10, pooled s = 0.019). octopath_traveler.foz takes less than 20 seconds to compile, so this change is trivial.

Tiger Lake, Ice Lake, and Skylake had similar results. (Tiger Lake shown)
total loops in shared programs: 5418 -> 5360 (-1.07%)
loops in affected programs: 97 -> 39 (-59.79%)
helped: 55 / HURT: 0

LOST:   40 / GAINED: 4

Broadwell and Haswell had similar results. (Broadwell shown)
total loops in shared programs: 5256 -> 5194 (-1.18%)
loops in affected programs: 101 -> 39 (-61.39%)
helped: 59 / HURT: 0

LOST:   36 / GAINED: 3

Ivy Bridge and Sandy Bridge had similar results. (Ivy Bridge shown)
total loops in shared programs: 3356 -> 3310 (-1.37%)
loops in affected programs: 46 -> 0
helped: 46
HURT: 0
helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
helped stats (rel) min: 100.00% max: 100.00% x̄: 100.00% x̃: 100.00%
95% mean confidence interval for loops value: -1.00 -1.00
95% mean confidence interval for loops %-change: -100.00% -100.00%
Loops are helped.

LOST:   32
GAINED: 0

No changes on any previous Intel platforms.

In fossil-db, two compute shaders in Shadow of the Tomb Raider, two compute shaders in Red Dead Redemption 2, two compute shaders in Assassin's Creed Odyssey, two compute shaders in Rise of the Tomb Raider, five fragment shaders in Octopath Traveler we affected, one compute shader in Cyperpunk 2077, and one fragment shader in the UE4 shooter game demo were affected.

Tiger Lake, Ice Lake, and Skylake had similar results. (Tiger Lake shown)
Instructions in all programs: 180216204 -> 180220092 (+0.0%)
Instructions hurt: 16

SENDs in all programs: 8768683 -> 8769260 (+0.0%)
helped: 7 / HURT: 9

Loops in all programs: 52701 -> 52683 (-0.0%)
helped: 16

Cycles in all programs: 9254382663 -> 9254417429 (+0.0%)
helped: 8 / HURT: 8

Lost: 14
Edited by Ian Romanick

Merge request reports