Skip to content

Draft: nir: Fuse ffma for (1 + x) * y

Alyssa Rosenzweig requested to merge alyssa/mesa:nir/ffma-1 into main

By the distributive law, we can rewrite (1 + x) * y as y + x * y, which can be replaced by an ffma if the original expressions were inexact

This pattern shows up with nir_lower_blend output, in the form of (1 - x) * y. Note we handle the more general case of addition, since the subtraction is just inserting an fneg which gets matched as the x without any special case. This lets us save an add for common blend modes.

On AGX, fmul/ffma are the same perf so this should be a strict win for cycle count, as long as the (1 + x) is only used by fmul such that the addition will be eliminated. This should also be better for reg pressure in common situations. For the best effect, we only fuse when the (1 + x) is only used by fmul (and hence will be fused away); the optimization isn't a clear win otherwise. This parallels what we do for regular ffma matching.

shader-db results on AGX. Note that we don't run nir_lower_blend for shader-db (unlike fossil-db), so this may be underestimating the win.

   total instructions in shared programs: 1485324 -> 1482489 (-0.19%)
   instructions in affected programs: 238916 -> 236081 (-1.19%)

   total bytes in shared programs: 10182444 -> 10175610 (-0.07%)
   bytes in affected programs: 1918922 -> 1912088 (-0.36%)

   total halfregs in shared programs: 462616 -> 462317 (-0.06%)
   halfregs in affected programs: 3140 -> 2841 (-9.52%)

Results on Mali-G57, where an FMA.f32 is the same cost as a FADD.f32 or an FMUL.f32.

   total instructions in shared programs: 2686768 -> 2682860 (-0.15%)
   instructions in affected programs: 566082 -> 562174 (-0.69%)

   total cycles in shared programs: 140490.19 -> 140473.91 (-0.01%)
   cycles in affected programs: 2647.94 -> 2631.66 (-0.61%)

   total fma in shared programs: 22069.50 -> 22013.53 (-0.25%)
   fma in affected programs: 4134.59 -> 4078.62 (-1.35%)

   total cvt in shared programs: 14644.33 -> 14639.23 (-0.03%)
   cvt in affected programs: 1068.66 -> 1063.56 (-0.48%)

   total quadwords in shared programs: 1455320 -> 1453480 (-0.13%)
   quadwords in affected programs: 42704 -> 40864 (-4.31%)

   total threads in shared programs: 53538 -> 53532 (-0.01%)
   threads in affected programs: 12 -> 6 (-50.00%)

Signed-off-by: Alyssa Rosenzweig alyssa@rosenzweig.io

Edited by Alyssa Rosenzweig

Merge request reports