nir,radeonsi: move ffma fusing to late optimizations for better codegen
ffma increases register usage, as all ternary opcodes do (it needs 3 live registers if all registers are different; fmul+fadd only need 2 live registers in the best case), but it also increases performance on hardware where ffma doubles ALU performance.
Edited by Marek Olšák