intel: Fix pixel-pipe performance bottlenecks on fused Gen12 parts. (!8749) · Merge requests · Mesa / mesa

Francisco Jerez requested to merge currojerez/mesa:gen12-pixel-pipe-balancing into master Jan 27, 2021

Currently fused Gen12 platforms use the pre-programmed pixel pipe hashing tables, which are constructed based on the assumption that all pixel pipes have the same processing power. However this is not the case on fused configurations where one or more pixel pipes are missing subslices, leading to a load imbalance that can lead to a serious bottleneck on the pixel pipe with the lowest EU count. This is likely to limit performance on most TGL and DG1 parts with less than 96 EUs.

This series addresses the problem by calculating a pixel-pipe hashing table adequate for the balance of computational power available in each pixel pipe, giving an FPS improvement that has been observed to range between 10% and 63% for most non-trivial graphics workloads I've tried on an 80 EU TGL platform:

gputest/pixmark_piano:      XXX ±0.00% x12 -> XXX ±0.08% x15     d=62.89% ±0.10%       p=0.00%
gputest/pixmark_volplosion: XXX ±0.03% x12 -> XXX ±0.05% x15     d=61.51% ±0.06%       p=0.00%
unigine/valley:             XXX ±0.11% x12 -> XXX ±0.25% x15     d=26.72% ±0.25%       p=0.00%
gfxbench/gl_5_high:         XXX ±0.11% x12 -> XXX ±0.18% x15     d=24.70% ±0.19%       p=0.00%
unigine/heaven:             XXX ±0.05% x12 -> XXX ±0.18% x15     d=23.54% ±0.17%       p=0.00%
steam/csgo:                 XXX ±3.76% x12 -> XXX ±3.90% x15     d=22.75% ±4.36%       p=0.00%
gfxbench/gl_manhattan31:    XXX ±0.26% x12 -> XXX ±0.26% x15     d=22.43% ±0.29%       p=0.00%
gfxbench/gl_4:              XXX ±0.06% x12 -> XXX ±0.39% x15     d=20.92% ±0.35%       p=0.00%
warsow/benchsow:            XXX ±1.69% x12 -> XXX ±2.54% x15     d=19.15% ±2.53%       p=0.00%
gfxbench/gl_trex_off:       XXX ±0.09% x12 -> XXX ±0.30% x15     d=18.84% ±0.27%       p=0.00%

Note that due to the large number of fusing configurations available and the sheer size of the hashing tables this series avoids hard-coding them in the source code, instead they are computed during driver init based on the number of dual-subslices available in each pixel pipe.

Benchmarking is encouraged on any TGL or DG1 systems with less than 96 EUs (You can quickly check the EU count in /sys/kernel/debug/dri/0/i915_sseu_status). Benchmark results on ICL systems with less than 64 EUs would also be interesting to confirm that there are no regressions, because this MR reworks the existing hash table code to re-use the same computation formula -- The resulting tables should have equivalent performance but they aren't numerically identical to the original ones, so it's worth checking that they don't lead to regressions.

Edited Feb 16, 2021 by Francisco Jerez

intel: Fix pixel-pipe performance bottlenecks on fused Gen12 parts.

Merge request reports