nir/loop_unroll: Alternate loop unrolling heuristic
Assume the presence of an optimization pass that will create multiple copies of the loop body while reducing the loop count. That is, a transformation of
for (unsigned i = 0; i < 64; i++)
stuff();
into
for (unsigned i = 0; i < 64; ) {
stuff();
i++;
stuff();
i++;
}
If that optimization were applied, would the loop become unrollable by the previous heuristic? If so, just allow it to be unrolled now.
I am working on this optimization pass, so we're going to end up here sooner or later anyway. I hoped to have this ready before disappearing on vacation for most of April, but that did not work out.
This does help some additional loops unroll. In fact, it seems that the Vulkan CTS has a LOT of loops that get unrolled. So many loops, in fact, that performance is significantly impacted by this small change. On this subset
./deqp-vk --deqp-case=dEQP-VK.*spir* --deqp-log-images=disable \
--deqp-log-shader-sources=disable
without MR !22299 (merged), the performance hit is on the order of +25%. The combined performance change across all of !22299 (merged) and this patch is +17.7% ± 0.08% (n = 5, pooled s = 0.323883), so that MR helps mitigate a good portion of the damage.
This commit should not affect the compile-time performance of real
applications. Octopath Traveler (see below) had the most individual
shaders affected by this commit, so I measured fossil-db time on just
octopath_traveler.foz
. Across just this commit, performance was
improved -0.24% ± 0.11% (n = 10, pooled s = 0.019).
octopath_traveler.foz
takes less than 20 seconds to compile, so this
change is trivial.
Tiger Lake, Ice Lake, and Skylake had similar results. (Tiger Lake shown)
total loops in shared programs: 5418 -> 5360 (-1.07%)
loops in affected programs: 97 -> 39 (-59.79%)
helped: 55 / HURT: 0
LOST: 40 / GAINED: 4
Broadwell and Haswell had similar results. (Broadwell shown)
total loops in shared programs: 5256 -> 5194 (-1.18%)
loops in affected programs: 101 -> 39 (-61.39%)
helped: 59 / HURT: 0
LOST: 36 / GAINED: 3
Ivy Bridge and Sandy Bridge had similar results. (Ivy Bridge shown)
total loops in shared programs: 3356 -> 3310 (-1.37%)
loops in affected programs: 46 -> 0
helped: 46
HURT: 0
helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
helped stats (rel) min: 100.00% max: 100.00% x̄: 100.00% x̃: 100.00%
95% mean confidence interval for loops value: -1.00 -1.00
95% mean confidence interval for loops %-change: -100.00% -100.00%
Loops are helped.
LOST: 32
GAINED: 0
No changes on any previous Intel platforms.
In fossil-db, two compute shaders in Shadow of the Tomb Raider, two compute shaders in Red Dead Redemption 2, two compute shaders in Assassin's Creed Odyssey, two compute shaders in Rise of the Tomb Raider, five fragment shaders in Octopath Traveler we affected, one compute shader in Cyperpunk 2077, and one fragment shader in the UE4 shooter game demo were affected.
Tiger Lake, Ice Lake, and Skylake had similar results. (Tiger Lake shown)
Instructions in all programs: 180216204 -> 180220092 (+0.0%)
Instructions hurt: 16
SENDs in all programs: 8768683 -> 8769260 (+0.0%)
helped: 7 / HURT: 9
Loops in all programs: 52701 -> 52683 (-0.0%)
helped: 16
Cycles in all programs: 9254382663 -> 9254417429 (+0.0%)
helped: 8 / HURT: 8
Lost: 14