@@ -707,7 +707,7 @@ If the coordinates computed would be the same as the previous interpolation inst

Rather than having a dedicated unit for more complicated floating-point operations, Bifrost reuses the fixed-latency general-purpose FMA/ADD pipeline, meaning it uses regular fixed-latency instructions. Since most of these operations have a latency much longer than most FMA and ADD instructions, they have to be broken up into a number of simpler steps. This section explains each of the algorithms and special instructions used to implement each complex builtin function in GLSL.

== Reciprocal

== Reciprocal (G71)

The reciprocal implementation uses an initial table lookup followed by a single step of https://en.wikipedia.org/wiki/Newton%27s_method#Multiplicative_inverses_of_numbers_and_power_series[Newton's method].

Given the input `a`, first, we do an argument reduction step, computing a reduced `a1` in the range [0.5, 1.0) and an initial approximation `x1` of `1/a1`. Computing `a1` is done with the `FRCP_FREXPM` instruction which is the same as the mantissa part of `frexp` in C. Except for special cases, it just sets the exponent to -1. `x1` is looked up in a table based on the top 17 bits of the mantissa of `a` (the mantissa of `a` is always the same as the mantissa of `a1`) using the `FRCP_TABLE` instruction. Next, we compute `x2 = x1 * (2 - a1 * x1)` and do the final exponent adjustment to compensate for the argument reduction. `x2` is actually computed as `x1 + x1 * (1 - a1 * x1)`, using two fused multiply-adds, to avoid rounding errors. The computation of `1 - a1 * x1` is particularly sensitive, since `a1 * x1` is very close to one by design. The second fused multiply-add and exponent adjustment are done in the same instruction, using a special `FMA_MSCALE` instruction which takes a fourth argument that is added to exponent of the result. There is an `FRCP_FREXPE` instruction which computes the exponent that `frexp` would have, and then negates it, to get the right bias for the final exponent adjustment. Finally, there is one final complication, which is that if `a` is infinity, then `a1` will be infinity and `x1` will be zero, and multiplying them will give NaN according to IEEE arithmetic, which will propagate to the final result instead of returning the appropriately-signed zero as expected. Also, internal details of the sign of `+/-NaN * +/-NaN` mean that the returned NaN will have the wrong sign when computing 1/NaN. To fix these issues with corner cases, ARM uses a special `rcp_mode` modifier which only exists on the aforementioned `FMA_MSCALE` instruction for the first FMA. The exponent bias is set to 0, so it is otherwise identical to a normal `FMA` instruction.

...

...

@@ -716,9 +716,31 @@ All together, `x=1/a` is computed as follows:

[source]

----

a1 = rcp_frexpm(a)

x1 = rcp_table(a)

t1 = fma_rcp_mode(a1, -x1, 1.0)

e' = rcp_frexpe(a)

x = rcp_mscale(t1, x1, x1, e')

----

\ No newline at end of file

a1 = FRCP_FREXPM a

x1 = FRCP_TABLE a

t1 = FMA_MSCALE.rcp_mode a1, -x1, 1.0, 0

e' = FRCP_FREXPE a

x = FMA_MSCALE t1, x1, x1, e'

----

== Reciprocal (later versions)

On every Bifrost core released since G71, there is a new `FRCP_FAST` instruction that does all of the above in one instruction. Since there are two dependent fused multiply-adds, this can't feasibly be done in one cycle. Instead, even though `RCP_FAST` is encoded as an ADD instruction, it actually uses both FMA and ADD units, and the corresponding FMA instruction before it must be a nop.

== Reciprocal square root (G71)

Similarly to reciprocal, we use a lookup table followed by one iteration of Newton's method. The computation of `a1` as well as the required fixup is computed by the `FRSQ_FREXPM` and `FRSQ_FREXPE` instructions, which are analogous to `FRCP_FREXPM` and `FRCP_FREXPE`, except that because we're taking the square root, only multiplying by four produces a power-of-two change in the output, so the input is only scaled to [0.25, 1) instead of [0.5, 1.0), and the exponent is divided by two when computing the fixup in `FRSQ_FREXPE`. The Newton's method fixup step is mathematically given by `x2 = .5 * x1 * (3 - a1 * x1 * x1)`. This is rearranged to give `x2 = x1 + x1 * 0.5 * (1 - a1 * x1 * x1)`, which is computed by squaring `x1` and then computing two fused multiply-adds. The multiplication by 0.5 is accomplished by passing -1 to the scale parameter of `FMA_MSCALE` for the first fused multiply-add. Both fused multiply-adds use the `rcp_mode` bit (not sure why the second one does?).

[source]

----

a1 = FRSQ_FREXPM a

x1 = FRSQ_TABLE a

t1 = FMA x1, x1, -0

t2 = FMA_MSCALE.rcp_mode a1, -t1, 1.0, -1

e' = FRSQ_FREXPE a

x = FMA_MSCALE.rcp_mode t2, x1, x1, e'

----

== Reciprocal square root (later versions)

Similar to `FRCP_FAST`, there is now a `FRSQ_FAST` instruction added on models after G71 which has the same restriction (it also takes 2 cycles).