Skip to content

Enable SIMD32 fragment shaders in the Intel back-end (4)

Francisco Jerez requested to merge currojerez/mesa:intel-simd32-mr-4 into master

This is a follow up to my previous MRs !3296 (closed), !3302 (closed) and !3545 (closed) preparing the compiler back-end for SIMD32 fragment shaders, which can provide a performance advantage under certain conditions. This fourth MR finally flips the switch after addressing the main remaining difficulty -- Determining whether SIMD32 helps more than it hurts by implementing a heuristic based on the analysis pass introduced by:

intel/ir: Import shader performance analysis pass.

Which attempts to model the performance of a shader based on static analysis. Among other interesting potential applications of the pass (described in the commit message) it allows the back-end to get a rough idea of the relative performance of the various SIMD variants of a shader, making the implementation of a SIMD32 heuristic straightforward (which is what the three commits after that one do).

Commits between:

intel/ir: Use brw:performance object instead of CFG cycle counts for codegen stats.

and:

intel/ir: Remove scheduling-based cycle count estimates.

Move away from the current approach of providing scheduling-derived timings via debug output for tools like shader-db. The main reason is that the scheduler is blind to the effect of compiler passes run after it, like the TGL software scoreboard pass or the dependency control pass used on VEC4 platforms, even though both can have a significant impact on performance. Those commits address the problem and in addition clean up the redundant bits in the instruction scheduler and CFG data structure.

Most other patches are in preparation for SIMD32 fragment shaders (the first few ones fix three different SIMD32 codegen issues that would otherwise lead to hangs on TGL).

Note that this series somewhat exacerbates a pre-existing synchronization issue in the Iris driver addressed by my previous !3875 (merged), so merging that MR first is recommended but not strictly necessary.

A handful of benchmarks improve between 1% and 8% with this series. E.g. on the ICL platforms Felix Degrood and I have been running tests on, Manhattan improves between 3% and 7% with this series, Aztec Ruins VK improves by ~2.3%, SynMark2 OglPSPom improves between 4% and 9%, TRex improves between 1% and 3%. Results are in the same ballpark on other platforms I've run benchmarks on. Additional test reports would be welcome, particularly if they uncover any regressions -- The only expected performance penalty from this is due to the increased CPU time spent compiling an additional variant of each fragment shader, which shouldn't be appreciable with shader caching enabled.

Edited by Francisco Jerez

Merge request reports