intel/xe3+: Take advantage of Variable Register Thread on Xe3+.
This MR adds support for VRT which is one of the major improvements introduced in the Xe3 ISA. In short it allows EU threads to use a variable number of registers that can be specified by the driver in a pipelined fashion for each shader stage. While using this feature up to 256 GRF registers can be used per thread at the cost of reducing the number of threads that can execute concurrently in the same EU -- Or conversely, a number of registers lower than 128 GRF can be allocated per thread, which allows the EU to execute a larger number of threads concurrently.
This has two primary benefits: On the one hand the amount of spilling is vastly reduced (spills drop by ~95% and fills by ~48% on shader-db, spills drop by ~86% and fills by ~76% on fossil-db) which is expected to reduce bandwidth consumption and improve performance at run-time, and OTOH the ability to compile a SIMD32 variant of most shaders allows for a more compile time-efficient heuristic to be used for SIMD width selection:
The new "optimistic" SIMD selection logic implemented here for PS, CS, TASK and MESH shaders starts with the SIMD width that is potentially highest performance and only compiles additional narrower variants if that fails (typically due to spilling), while the old "pessimistic" logic did the opposite: It started with the narrowest SIMD width and compiled additional variants with increasing register pressure until one of them failed to compile. So in typical non-spilling cases where we formerly compiled SIMD16 and SIMD32 variants of the same shader, this change will halve the number of backend compilations required to build it. With multi-polygon PS dispatch enabled (which is disabled by default right now) this has an even more dramatic effect since the number of compiler iterations will be reduced down to a fifth in the best case scenario.
/cc @cmarcelo