@@ -710,7 +710,9 @@ Rather than having a dedicated unit for more complicated floating-point operatio

== Reciprocal (G71)

The reciprocal implementation uses an initial table lookup followed by a single step of https://en.wikipedia.org/wiki/Newton%27s_method#Multiplicative_inverses_of_numbers_and_power_series[Newton's method].

Given the input `a`, first, we do an argument reduction step, computing a reduced `a1` in the range [0.5, 1.0) and an initial approximation `x1` of `1/a1`. Computing `a1` is done with the `FRCP_FREXPM` instruction which is the same as the mantissa part of `frexp` in C. Except for special cases, it just sets the exponent to -1. `x1` is looked up in a table based on the top 17 bits of the mantissa of `a` (the mantissa of `a` is always the same as the mantissa of `a1`) using the `FRCP_TABLE` instruction. Next, we compute `x2 = x1 * (2 - a1 * x1)` and do the final exponent adjustment to compensate for the argument reduction. `x2` is actually computed as `x1 + x1 * (1 - a1 * x1)`, using two fused multiply-adds, to avoid rounding errors. The computation of `1 - a1 * x1` is particularly sensitive, since `a1 * x1` is very close to one by design. The second fused multiply-add and exponent adjustment are done in the same instruction, using a special `FMA_MSCALE` instruction which takes a fourth argument that is added to exponent of the result. There is an `FRCP_FREXPE` instruction which computes the exponent that `frexp` would have, and then negates it, to get the right bias for the final exponent adjustment. Finally, there is one final complication, which is that if `a` is infinity, then `a1` will be infinity and `x1` will be zero, and multiplying them will give NaN according to IEEE arithmetic, which will propagate to the final result instead of returning the appropriately-signed zero as expected. Also, internal details of the sign of `+/-NaN * +/-NaN` mean that the returned NaN will have the wrong sign when computing 1/NaN. To fix these issues with corner cases, ARM uses a special `rcp_mode` modifier which only exists on the aforementioned `FMA_MSCALE` instruction for the first FMA. The exponent bias is set to 0, so it is otherwise identical to a normal `FMA` instruction.

Given the input `a`, first, we do an argument reduction step, computing a reduced `a1` in the range [0.5, 1.0) and an initial approximation `x1` of `1/a1`. Computing `a1` is done with the `FRCP_FREXPM` instruction which is the same as the mantissa part of `frexp` in C. Except for special cases, it just sets the exponent to -1. `x1` is looked up in a table based on the top 17 bits of the mantissa of `a` (the mantissa of `a` is always the same as the mantissa of `a1`) using the `FRCP_TABLE` instruction. Next, we compute `x2 = x1 * (2 - a1 * x1)` and do the final exponent adjustment to compensate for the argument reduction. `x2` is actually computed as `x1 + x1 * (1 - a1 * x1)`, using two fused multiply-adds, to avoid rounding errors. The computation of `1 - a1 * x1` is particularly sensitive, since `a1 * x1` is very close to one by design. The second fused multiply-add and exponent adjustment are done in the same instruction, using a special `FMA_MSCALE` instruction which takes a fourth argument that is added to exponent of the result. There is an `FRCP_FREXPE` instruction which computes the exponent that `frexp` would have, and then negates it, to get the right bias for the final exponent adjustment.

There is one final complication, which is that if `a` is infinity, then `a1` will be infinity and `x1` will be zero, and multiplying them will give NaN according to IEEE arithmetic, which will propagate to the final result instead of returning the appropriately-signed zero as expected. Also, internal details of the sign of `+/-NaN * +/-NaN` mean that the returned NaN will have the wrong sign when computing 1/NaN. To fix these issues with corner cases, ARM uses a special `rcp_mode` modifier which only exists on the aforementioned `FMA_MSCALE` instruction for the first FMA. The exponent bias is set to 0, so it is otherwise identical to a normal `FMA` instruction.

All together, `x=1/a` is computed as follows:

...

...

@@ -729,11 +731,11 @@ On every Bifrost core released since G71, there is a new `FRCP_FAST` instruction

== Reciprocal square root (G71)

Similarly to reciprocal, we use a lookup table followed by one iteration of Newton's method. The computation of `a1` as well as the required fixup is computed by the `FRSQ_FREXPM` and `FRSQ_FREXPE` instructions, which are analogous to `FRCP_FREXPM` and `FRCP_FREXPE`, except that because we're taking the square root, only multiplying by four produces a power-of-two change in the output, so the input is only scaled to [0.25, 1) instead of [0.5, 1.0), and the exponent is divided by two when computing the fixup in `FRSQ_FREXPE`. The Newton's method fixup step is mathematically given by `x2 = .5 * x1 * (3 - a1 * x1 * x1)`. This is rearranged to give `x2 = x1 + x1 * 0.5 * (1 - a1 * x1 * x1)`, which is computed by squaring `x1` and then computing two fused multiply-adds. The multiplication by 0.5 is accomplished by passing -1 to the scale parameter of `FMA_MSCALE` for the first fused multiply-add. Both fused multiply-adds use the `rcp_mode` bit (not sure why the second one does?).

Similarly to reciprocal, we use a lookup table followed by one iteration of Newton's method. The computation of `a1` as well as the required fixup is computed by the `FSQRT_FREXPM` (shared with square root) and `FRSQ_FREXPE` instructions, which are analogous to `FRCP_FREXPM` and `FRCP_FREXPE`, except that because we're taking the square root, only multiplying by four produces a power-of-two change in the output, so the input is only scaled to [0.25, 1) instead of [0.5, 1.0), and the exponent is divided by two when computing the fixup in `FRSQ_FREXPE`. The Newton's method fixup step is mathematically given by `x2 = .5 * x1 * (3 - a1 * x1 * x1)`. This is rearranged to give `x2 = x1 + x1 * 0.5 * (1 - a1 * x1 * x1)`, which is computed by squaring `x1` and then computing two fused multiply-adds. The multiplication by 0.5 is accomplished by passing -1 to the scale parameter of `FMA_MSCALE` for the first fused multiply-add. Both fused multiply-adds use the `rcp_mode` bit (not sure why the second one does?).

[source]

----

a1 = FRSQ_FREXPM a

a1 = FSQRT_FREXPM a

x1 = FRSQ_TABLE a

t1 = FMA x1, x1, -0

t2 = FMA_MSCALE.rcp_mode a1, -t1, 1.0, -1

...

...

@@ -744,3 +746,32 @@ x = FMA_MSCALE.rcp_mode t2, x1, x1, e'

== Reciprocal square root (later versions)

Similar to `FRCP_FAST`, there is now a `FRSQ_FAST` instruction added on models after G71 which has the same restriction (it also takes 2 cycles).

== Square root (G71)

For the argument reduction, square root uses the same `FSQRT_FREXPM` instruction, but since the fixup exponent is the opposite sign of reciprocal square root, there is a `FSQRT_FREXPE` instruction that does the right thing for square roots. The initial estimate `x1` is computed by using `FRSQ_TABLE` to make an initial estimate `r1` of the reciprocal square root, and then computing `x1 = a1 * r1`. There is a subtlety here, since if `a` is either 0 or infinite, then `r1` will be the opposite, and we always want the result to be equal to `a` in that case. This is accomplished using a special `sqrt_mode` modifier on `FMA_MSCALE`, similar to `rcp_mode`.

Next, there is a Newton-Raphson step as before. The normal Newton-Raphson step for square root is `x2 = .5 * (x1 + a / x1)`. First, we rearrange this to `x2 = x1 + .5 * (a / x1 - x1)`, and then pull out a factor of `1/x1` from the error term to get `x2 = x1 + .5 / x1 * (a - x1 * x1)`. The only thing that can't be immediately computed is `1 / x1`, but `r1` is already a good enough approximation of that. So, the formula used is `x2 = x1 + r1 * (a - x1 * x1) * .5`. This doesn't correctly return -0 given -0, however, so the final formula is `x2 = x1 - abs(r1) * (x1 * x1 - a) * .5`. The `* .5` is folded into the first FMA, and the final exponent correction is folded into the second FMA, as usual. The second FMA must use `rcp_mode` to prevent another 0 times infinity problem, although this time, since `x1` is being added, it's ok to return 0 always.

[source]

----

a1 = FSQRT_FREXPM input

r1 = FRSQ_TABLE input

x1 = FMA_MSCALE.sqrt_mode a1, r1, -0, 0

e = FSQRT_FREXPE input

t2 = FMA_MSCALE x1, x1, -a1, -1

output = FMA_MSCALE.rcp_mode abs(r1), -t2, x1, e

----

== Square root (later)

There is no `FSQRT_FAST` instruction. Instead, we compute the reciprocal square root using the `FRSQ_FAST` instruction, and then multiply by the input to get the answer. We similarly have to be careful about 0, so we use the same `sqrt_mode` modifier. Apparently ARM also decomposes the number before handing it to `FRSQ_FAST` and applies the exponent fixup, although I'm not sure why this is necessary. The result is only 1 less instruction than before.