Also add a radeonsi_force_use_fma32 option for it.
fma32 only round once so has 0.5UP accuracy. mad32 round twice so has 1UP accuracy. This accuracy difference sometimes make the result different at the last bit.
Applications like META need more accuracy for display right result. Here is the result comparison for using two different instructions.