Commit 21ffacff authored by Marcin Ślusarz's avatar Marcin Ślusarz Committed by Marge Bot

intel/compiler: remove branch weight heuristic

As a result of this patch, compiler chooses SIMD32 shaders more
frequently.

Current logic is designed to avoid regressions from enabling SIMD32 at
all cost, even though the cases where regression can happen are probably
for smaller draw calls (far away from the camera and though smaller).

In Intel perf CI this patch improves FPS in:
- gfxbench5 alu2:      21.92% (gen9), 23.7%  (gen11)
- synmark OglShMapVsm:  3.26% (gen9),  4.52% (gen11)
- gfxbench5 car chase:  1.34% (gen9),  1.32% (gen11)
No observed regressions there.

In my testing, it also improves FPS in:
- The Talos Principle:   2.9% (gen9)

The other 16 games I tested had very minor changes in performance
(2/3 positive, but not significant enough to list here).

Note: this patch harms synmark OglDrvState (which is not in Intel perf
CI) by ~2.9%, but this benchmark renders multiple scenes from other
workloads (including OglShMapVsm, which is helped in standalone mode)
in tiny rectangles. Rendering so small drastically changes branching
statistics, which favors smaller SIMD modes. I assume this matters
only in micro-benchmarks, as in real workloads more expensive (with
more uniform branching behavior) draw calls dominate.
Signed-off-by: Marcin Ślusarz's avatarMarcin Ślusarz <marcin.slusarz@intel.com>
Acked-by: Francisco Jerez's avatarFrancisco Jerez <currojerez@riseup.net>
Part-of: <!7137>
parent 06764e0e
Pipeline #222871 waiting for manual action with stages
in 18 seconds
......@@ -1505,16 +1505,23 @@ namespace {
const backend_instruction *),
unsigned dispatch_width)
{
/* XXX - Plumbing the trip counts from NIR loop analysis would allow us
* to do a better job regarding the loop weights. And some branch
* divergence analysis would allow us to do a better job with
* branching weights.
/* XXX - Note that the previous version of this code used worst-case
* scenario estimation of branching divergence for SIMD32 shaders,
* but this heuristic was removed to improve performance in common
* scenarios. Wider shader variants are less optimal when divergence
* is high, e.g. when application renders complex scene on a small
* surface. It is assumed that such renders are short, so their
* time doesn't matter and when it comes to the overall performance,
* they are dominated by more optimal larger renders.
*
* It's possible that we could do better with divergence analysis
* by isolating branches which are 100% uniform.
*
* Plumbing the trip counts from NIR loop analysis would allow us
* to do a better job regarding the loop weights.
*
* In the meantime use values that roughly match the control flow
* weights used elsewhere in the compiler back-end -- Main
* difference is the worst-case scenario branch_weight used for
* SIMD32 which accounts for the possibility of a dynamically
* uniform branch becoming divergent in SIMD32.
* weights used elsewhere in the compiler back-end.
*
* Note that we provide slightly more pessimistic weights on
* Gen12+ for SIMD32, since the effective warp size on that
......@@ -1523,7 +1530,6 @@ namespace {
* previous generations, giving narrower SIMD modes a performance
* advantage in several test-cases with non-uniform discard jumps.
*/
const float branch_weight = (dispatch_width > 16 ? 1.0 : 0.5);
const float discard_weight = (dispatch_width > 16 || s->devinfo->gen < 12 ?
1.0 : 0.5);
const float loop_weight = 10;
......@@ -1539,16 +1545,12 @@ namespace {
issue_instruction(st, s->devinfo, inst);
if (inst->opcode == BRW_OPCODE_ENDIF)
st.weight /= branch_weight;
else if (inst->opcode == FS_OPCODE_PLACEHOLDER_HALT && discard_count)
if (inst->opcode == FS_OPCODE_PLACEHOLDER_HALT && discard_count)
st.weight /= discard_weight;
elapsed += (st.unit_ready[unit_fe] - clock0) * st.weight;
if (inst->opcode == BRW_OPCODE_IF)
st.weight *= branch_weight;
else if (inst->opcode == BRW_OPCODE_DO)
if (inst->opcode == BRW_OPCODE_DO)
st.weight *= loop_weight;
else if (inst->opcode == BRW_OPCODE_WHILE)
st.weight /= loop_weight;
......
  • mentioned in issue #3753 (closed)

    Toggle commit list
  • mentioned in issue #3716 (closed)

    Toggle commit list
  • This patch harms synmark OglDrvState by ~2.9%, but this benchmark renders multiple scenes from other workloads in tiny rectangles.

    DrvState viewports are indeed tiny, but it's also marginally CPU bound, and as result it has high variance. Although I see the improvements in my data for the other tests, I don't see any regression in DrvState.

    However, I do see 1-2% drop in SynMark PSBump8, and potential 3% drop in Heaven. However, for now I have Heaven data point only from single device. I'll collect more data and from more devices to verify that it wasn't just bad luck (e.g. GPU hang).

  • Heaven drop is real, but it's not due to this Mesa change. I'll file a new bug on it.

  • mentioned in issue #3771

    Toggle commit list
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment