intel/ir/gen12+: Work around FS performance regressions due to SIMD32 discard divergence.

Merged Francisco Jerez requested to merge currojerez/mesa:intel-gen12-simd32-perf-fix into master

This avoids some performance regressions on Gen12 platforms caused by SIMD32 fragment shaders reported in titles like Dota2, TF2, Xonotic, and GFXBench5 Car Chase and Aztec Ruins.

The most obvious pattern in the regressing shaders I identified among these workloads is that they all had non-uniform discard statements, which are handled rather optimistically by the current IR analysis pass: No penalty is currently applied to the SIMD32 variant of the shader in the form of differing branching weights like we do for other control flow instructions in order to account for the greater likelihood of divergence of a SIMD32 shader.

Simply changing that by giving the same treatment to discard statements as we give to other branching instructions seemed to hurt more than it helped on platforms earlier than Gen12, since it reversed most of the improvement obtained from SIMD32 fragment shaders in Manhattan for no measurable benefit in other workloads (Manhattan has a handful of shaders with statically non-uniform discard statements which actually perform better in SIMD32 mode due to their approximate dynamic uniformity). For that reason this change is applied to Gen12+ platforms only.

I've been running a number of tests trying to understand the difference in behavior between Gen12 and earlier platforms, and most of the evidence I've gathered seems to point at EU fusion being the culprit: Unlike previous generations, on Gen12 EUs are arranged in pairs which execute instructions in lockstep, giving an effective warp size of 64 threads in SIMD32 mode, which seems to increase the likelihood for control flow divergence in some of the affected shaders significantly.

Merge request reports