WIP: intel/fs: Reduce flag usage on Gen8+
This MR contains something of a grab-bag of attempts to make our flag handling better. It's not entirely a strict improvement and I wouldn't say every patch in here is ready to go but I think there are some useful ideas in here which is why I'm posting it as a WIP. The biggest impact is a patch which flips on CSEL for all 32-bit bcsel
ops which gets rid of a huge amount of our flag use and turns into a greater than 1% over-all reduction in cycles in shader-db:
total instructions in shared programs: 17078969 -> 17085094 (0.04%)
instructions in affected programs: 2311120 -> 2317245 (0.27%)
helped: 2139
HURT: 3105
helped stats (abs) min: 1 max: 65 x̄: 3.37 x̃: 1
helped stats (rel) min: 0.01% max: 14.29% x̄: 1.12% x̃: 0.58%
HURT stats (abs) min: 1 max: 154 x̄: 4.29 x̃: 1
HURT stats (rel) min: 0.04% max: 18.62% x̄: 1.71% x̃: 0.59%
95% mean confidence interval for instructions value: 0.88 1.45
95% mean confidence interval for instructions %-change: 0.48% 0.63%
Instructions are HURT.
total cycles in shared programs: 363122075 -> 359205666 (-1.08%)
cycles in affected programs: 233839748 -> 229923339 (-1.67%)
helped: 7504
HURT: 4026
helped stats (abs) min: 1 max: 101806 x̄: 542.82 x̃: 44
helped stats (rel) min: <.01% max: 68.38% x̄: 6.26% x̃: 2.46%
HURT stats (abs) min: 1 max: 6160 x̄: 38.97 x̃: 6
HURT stats (rel) min: <.01% max: 82.58% x̄: 1.77% x̃: 0.47%
95% mean confidence interval for cycles value: -413.67 -265.67
95% mean confidence interval for cycles %-change: -3.64% -3.28%
Cycles are helped.
total spills in shared programs: 10912 -> 11149 (2.17%)
spills in affected programs: 1634 -> 1871 (14.50%)
helped: 2
HURT: 32
total fills in shared programs: 12597 -> 12978 (3.02%)
fills in affected programs: 2917 -> 3298 (13.06%)
helped: 2
HURT: 32
LOST: 38
GAINED: 13
There are a few other interesting patches in here:
- A change to nir_algebraic.c which lets us easily add arguments to algebraic passes. In particular, it's now easy to make intel-specific passes which take a
const struct gen_device_info *devinfo
parameter which can then be used to predicate optimizations and/or lowerings. - An alternative patch to enable CSEL for a bunch of the obvious special cases that doesn't require !2680 or any other new framework.
- A patch to the scheduler which gets rid of WaW dependencies for flag writes which are never used. This allows the scheduler to move CMP instructions past each other in spite of the fact that they must always write the flag. The results from this were a bit of a wash, I'm afraid. Just enabling moving around of SEL by using CSEL instead seems to get us most of the scheduling flexibility that's actually useful.
As always, questions, comments, and other feedback welcome!