brw: Tune vectorizer settings, teach it about existing block load overfetching, add non-block overfetching
This MR brings significant improvements in brw's memory load handling via small changes to our NIR vectorizer settings, building upon @mareko's recent improvements to nir_opt_load_store_vectorize
in !29398 (merged).
For convergent memory loads, brw uses block (LSC transpose) read messages. Instead of performing a vectorized load, which replicates the uniform data per SIMD lane, these read uniform data as scalars, using one SIMD channel per value. One 32B register can hold 8 32-bit scalar values (or on Xe2, 64B registers holding 16 values) instead of a single value, greatly reducing our register usage.
Because it doesn't make much sense to load less than a single register, we always load blocks of 8/16/32/64(LSC only). Even if the shader only requests v.x, v.xy, or v.xyz...we load 8 values, implicitly overfetching. However, we did this in the backend, and didn't teach the NIR vectorizer about it. It would see a (e.g.) 32x3 load, and not realize we were actually loading 32x8. So it would frequently miss out on opportunities to merge overlapping or adjacent loads.
To remedy this, we:
- Increase the
max_hole
size innir_opt_load_store_vectorize
to allow holes of up to7*4=28
bytes. Drivers can reject large holes in their "should we vectorize?" callback. In fact, all drivers currently reject any holes (though I believe AMD MRs exist to allow that), so this should have no practical effect. - Stop considering unread components in the backend's push constant handling.
- Adjust
brw_nir_should_vectorize_mem
to allow holes, and vectorization when the components read is a non-standard size (i.e. vec7). For block loads, we allow holes of8 - low->num_components
(namely, recognize that you're going to load at least 8 components anyway, so even if 7 are a hole, it's fine). For non-block loads, we allow small 4 byte holes, to allow merging of (vec3 + blank + vec3 + blank) into a vec8. - Nerf overzealous robustness handling for 64-bit global memory loads. It turns out the handling isn't actually buying us any real robustness, and it prevents
vec8+vec8 -> vec16
merging in a lot of cases.
With this, we have the following fossil-db results (Lunarlake quoted, but Alchemist is similar):
Fossil/Game | Instructions | Sends | Spills | Fills |
---|---|---|---|---|
Borderlands 3 DX12 | -5.42% | -20.85% | +0.46% | +1.25% |
Cyberpunk 2077 | -1.06% | -13.82% | -1.72% | +1.07% |
Strange Brigade | -2.48% | -10.75% | n/a | n/a |
Red Dead Redemption 2 | -1.60% | -10.20% | -4.13% | -2.68% |
q2rtx-rt-pipeline | -3.44% | -1.94% | -47.06% | -68.66% |
parallel-rdp/subgroup | -3.20% | -12.80% | n/a | n/a |
— | — | — | — | — |
Affected | -0.33% | -2.30% | -0.14% | +1.08% |
Overall | -0.30% | -2.01% | -0.11% | +0.95% |
A previous version of this MR improved Cyberpunk 2077 trace replay performance on Arc A770 by 3.5%. Our performance lab is currently on the fritz but I'll hopefully be able to get additional numbers next week.
+@mareko due to NIR vectorizer core code changes +@llandwerlin may want to look at the anv patch or compiler patches +@idr @cmarcelo compiler patches! +@tripzero @ccallawa performance improvement FYI