On every Bifrost core released since G71, there is a new `FRCP_FAST` instruction that does all of the above in one instruction. Since there are two dependent fused multiply-adds, this can't feasibly be done in one cycle. Instead, even though `RCP_FAST` is encoded as an ADD instruction, it actually uses both FMA and ADD units, and the corresponding FMA instruction before it must be a nop.

On every Bifrost core released after G71, there is a new `FRCP_FAST` instruction that does all of the above in one instruction. Since there are two dependent fused multiply-adds, this can't feasibly be done in one cycle. Instead, even though `RCP_FAST` is encoded as an ADD instruction, it actually uses both FMA and ADD units, and the corresponding FMA instruction before it must be a nop.

== Reciprocal square root (G71)

...

...

@@ -775,3 +775,44 @@ r1 = FRSQ_FAST a1

output = FMA_MSCALE.sqrt_mode a1, r1, -0, e

----

== Natural & Base-2 Log (G71)

As described in the patent, the argument reduction step for log depends on the fact that `log(a * b) = log(a) + log(b)`. First, we reduce the input `a` to the range [0.75, 1.5) and calculate the required fixup exponent using the `FLOG_FREPXE` instruction. This fixup exponent is converted to a floating-point number, and is then added to the result later (after multiplying by ln(2) for base-e log), using the fact that `log(m * 2^e) = e + log(m)`. We lookup an very coarse-grained reciprocal estimate `r1` to the reduced input `a1` in a table using the `FRCP_APPROX` instruction, and we also lookup `-log(r1)` using another table that stores for each input `x`, `log(FRCP_APPROX(x))`. Both these instructions take `a`, and do the reduction to [0.75, 1.5) implicitly. Since `log(a1) = log(a1 * r1) - log(r1)`, all we need to do to compute `log(a1)` is compute `log(a1 * r1)`. Since `a1 * r1` is close to 1 by design, we can do this with a polynomial approximation of `log(y + 1)` with only a few terms. The approximation chosen for base _e_ is `log(y + 1) = y * (1.0 + y * (a + y * b))` where the precise constants used are `b = 0x3eab3200 (0.33436)` and `a = 0xbf0003f0 (-0.500060)`, close to the Taylor series terms 1/3 and -1/2 but not exactly the same. The base 2 logarithm computes `y' = y * c` where `c` is a single-precision approximation to `1/ln(2)` and then `log2(y + 1) = y' + y * (c + y' * (y * a + b))` where `d = 0x32a57060 (1.29249 * 2^{-26})`. `d` holds the error due to approximating `1/ln(2)` as a single-precision floating point number, so that `c + d`, when evaluated at infinite precision, is a much better approximation to `1/ln(2)` than `c` by itself.

base 2 logarithm:

[source]

----

a1 = LOG_FREXPM a

r1 = FRCP_APPROX a

xt = FLOG2_TABLE a

e = FLOG_FREXPE a

e = I32_TO_F32 e

x1 = ADD.f32 T1, e, xt

y = FMA.f32 a1, r1, -1.0

y' = FMA.f32 y, 0x3fb8aa3b /* c */, -0

t1 = FMA.f32 y, 0x3eab3200 /* b */, 0xbf0003f0 /* a */

t2 = FMA.f32 y', t1, 0x32a57060 /* d */

x2 = FMA.f32 T0, y, t2, y'

output = ADD.f32 x1, x2

----

base e logarithm:

[source]

----

a1 = LOG_FREXPM a

r1 = FRCP_APPROX a

xt = FLOGE_TABLE a

e = FLOG_FREXPE a

e = I32_TO_F32 e

x1 = FMA.f32 T0, e, 0x3f317218 /* ln(2) */, xt

y = FMA.f32 a1, r1, -1.0

t1 = FMA.f32 y, 0x3eab3200 /* b */, 0xbf0003f0 /* a */