intel: Pixel pipeline optimizations for XeHP hardware.

Merged Francisco Jerez requested to merge currojerez/mesa:intel-xehp-pixel-hash into main

This series is part of the XeHP enabling effort. It's not strictly required for functional correctness, but it improves performance significantly for most non-trivial workloads on all DG2 platforms it's been tested on so far, particularly on fused configurations -- E.g. on DG2-448 it gives us a 20%-40% performance improvement on most interesting workloads:

UnigineValley.ogl-g6                                                       +38.76%
GFXBench 5 Public Candidate.ogl-AztecRuins-high-offscreen-2160p            +36.87%
GFXBench 5 Public Candidate.ogl-CarChase-Off-38x21                         +32.07%
GFXBench 5 Public Candidate.ogl-Manhattan-Off-38x21                        +29.78%
Dota 2 (replay Jul 2020).ogl-g6                                            +29.20%
ShooterGame.vk-g6                                                          +27.54%
Shadow of Mordor.vk-g6                                                     +25.33%
Dota 2 (replay Jul 2020).vk-g6                                             +22.93%
Counter-Strike GO.ogl-g6                                                   +22.66%
Xonotic.38x21-Ultimate                                                     +14.12%
Team Fortress 2.ogl-g6                                                     +2.82%

The first three patches of this series extend our device info infrastructure to handle hardware configurations with multiple slices (which are far more common in XeHP hardware than they used to be), and configurations where some subslices are only usable for GPGPU or for 3D (which had never been the case before). Note that the last of the device info patches depends on a kernel interface that hasn't been upstreamed yet, which is the main reason this MR is marked as draft. In order to avoid depending on the geometry topology kernel interface that hasn't been upstreamed yet, this MR includes a hack among the device info patches that attempts to guess the subset of slices which are available for 3D, the proper solution will be submitted as a follow-up MR that depends on an unreleased kernel API (see !14143 (merged)). [Update: I've resubmitted the device info patches as MR !14436 (merged), since they're independent from the remaining changes and they already have been reviewed.]

The next few patches include some minor fixes like updating the PSD thread counts and dropping some unnecessary flushes from BLORP.

The rest of the series implements an algorithm used to calculate the pixel pipe hashing tables required to get optimal load balancing (which is what gives us most of the performance improvement), and deals with programming the slice hashing tables on XeHP hardware. The last patch tweaks the cross-slice hashing mode in order to get better L1/L2 cache utilization, which required some additional changes to the pixel pipe hashing tables in order to avoid bottlenecks in a single slice, and gives us a measurable performance improvement on all configs tested so far -- Regardless of whether they suffer a pixel pipe imbalance with the default hashing behavior or not, though on balanced configs with a power-of-two number of subslices the overall improvement is expected to be smaller.

Edited by Francisco Jerez

Merge request reports