Skip to content

brw: Tune vectorizer settings, teach it about existing block load overfetching, add non-block overfetching

Kenneth Graunke requested to merge kwg/mesa:brw-vectorize into main

This MR brings significant improvements in brw's memory load handling via small changes to our NIR vectorizer settings, building upon @mareko's recent improvements to nir_opt_load_store_vectorize in !29398 (merged).

For convergent memory loads, brw uses block (LSC transpose) read messages. Instead of performing a vectorized load, which replicates the uniform data per SIMD lane, these read uniform data as scalars, using one SIMD channel per value. One 32B register can hold 8 32-bit scalar values (or on Xe2, 64B registers holding 16 values) instead of a single value, greatly reducing our register usage.

Because it doesn't make much sense to load less than a single register, we always load blocks of 8/16/32/64(LSC only). Even if the shader only requests v.x, v.xy, or v.xyz...we load 8 values, implicitly overfetching. However, we did this in the backend, and didn't teach the NIR vectorizer about it. It would see a (e.g.) 32x3 load, and not realize we were actually loading 32x8. So it would frequently miss out on opportunities to merge overlapping or adjacent loads.

To remedy this, we:

  • Increase the max_hole size in nir_opt_load_store_vectorize to allow holes of up to 7*4=28 bytes. Drivers can reject large holes in their "should we vectorize?" callback. In fact, all drivers currently reject any holes (though I believe AMD MRs exist to allow that), so this should have no practical effect.
  • Stop considering unread components in the backend's push constant handling.
  • Adjust brw_nir_should_vectorize_mem to allow holes, and vectorization when the components read is a non-standard size (i.e. vec7). For block loads, we allow holes of 8 - low->num_components (namely, recognize that you're going to load at least 8 components anyway, so even if 7 are a hole, it's fine). For non-block loads, we allow small 4 byte holes, to allow merging of (vec3 + blank + vec3 + blank) into a vec8.
  • Nerf overzealous robustness handling for 64-bit global memory loads. It turns out the handling isn't actually buying us any real robustness, and it prevents vec8+vec8 -> vec16 merging in a lot of cases.

With this, we have the following fossil-db results (Lunarlake quoted, but Alchemist is similar):

Fossil/Game Instructions Sends Spills Fills
Borderlands 3 DX12 -5.42% -20.85% +0.46% +1.25%
Cyberpunk 2077 -1.06% -13.82% -1.72% +1.07%
Strange Brigade -2.48% -10.75% n/a n/a
Red Dead Redemption 2 -1.60% -10.20% -4.13% -2.68%
q2rtx-rt-pipeline -3.44% -1.94% -47.06% -68.66%
parallel-rdp/subgroup -3.20% -12.80% n/a n/a
Affected -0.33% -2.30% -0.14% +1.08%
Overall -0.30% -2.01% -0.11% +0.95%

A previous version of this MR improved Cyberpunk 2077 trace replay performance on Arc A770 by 3.5%. Our performance lab is currently on the fritz but I'll hopefully be able to get additional numbers next week.

+@mareko due to NIR vectorizer core code changes +@llandwerlin may want to look at the anv patch or compiler patches +@idr @cmarcelo compiler patches! +@tripzero @ccallawa performance improvement FYI

Merge request reports

Loading