Commit 31e0febe authored by Connor Abbott's avatar Connor Abbott
Browse files

bifrost: add FRCP_FAST and reciprocal square root

parent 3fc6c781
......@@ -707,7 +707,7 @@ If the coordinates computed would be the same as the previous interpolation inst
Rather than having a dedicated unit for more complicated floating-point operations, Bifrost reuses the fixed-latency general-purpose FMA/ADD pipeline, meaning it uses regular fixed-latency instructions. Since most of these operations have a latency much longer than most FMA and ADD instructions, they have to be broken up into a number of simpler steps. This section explains each of the algorithms and special instructions used to implement each complex builtin function in GLSL.
== Reciprocal
== Reciprocal (G71)
The reciprocal implementation uses an initial table lookup followed by a single step of[Newton's method].
Given the input `a`, first, we do an argument reduction step, computing a reduced `a1` in the range [0.5, 1.0) and an initial approximation `x1` of `1/a1`. Computing `a1` is done with the `FRCP_FREXPM` instruction which is the same as the mantissa part of `frexp` in C. Except for special cases, it just sets the exponent to -1. `x1` is looked up in a table based on the top 17 bits of the mantissa of `a` (the mantissa of `a` is always the same as the mantissa of `a1`) using the `FRCP_TABLE` instruction. Next, we compute `x2 = x1 * (2 - a1 * x1)` and do the final exponent adjustment to compensate for the argument reduction. `x2` is actually computed as `x1 + x1 * (1 - a1 * x1)`, using two fused multiply-adds, to avoid rounding errors. The computation of `1 - a1 * x1` is particularly sensitive, since `a1 * x1` is very close to one by design. The second fused multiply-add and exponent adjustment are done in the same instruction, using a special `FMA_MSCALE` instruction which takes a fourth argument that is added to exponent of the result. There is an `FRCP_FREXPE` instruction which computes the exponent that `frexp` would have, and then negates it, to get the right bias for the final exponent adjustment. Finally, there is one final complication, which is that if `a` is infinity, then `a1` will be infinity and `x1` will be zero, and multiplying them will give NaN according to IEEE arithmetic, which will propagate to the final result instead of returning the appropriately-signed zero as expected. Also, internal details of the sign of `+/-NaN * +/-NaN` mean that the returned NaN will have the wrong sign when computing 1/NaN. To fix these issues with corner cases, ARM uses a special `rcp_mode` modifier which only exists on the aforementioned `FMA_MSCALE` instruction for the first FMA. The exponent bias is set to 0, so it is otherwise identical to a normal `FMA` instruction.
......@@ -716,9 +716,31 @@ All together, `x=1/a` is computed as follows:
a1 = rcp_frexpm(a)
x1 = rcp_table(a)
t1 = fma_rcp_mode(a1, -x1, 1.0)
e' = rcp_frexpe(a)
x = rcp_mscale(t1, x1, x1, e')
\ No newline at end of file
t1 = FMA_MSCALE.rcp_mode a1, -x1, 1.0, 0
x = FMA_MSCALE t1, x1, x1, e'
== Reciprocal (later versions)
On every Bifrost core released since G71, there is a new `FRCP_FAST` instruction that does all of the above in one instruction. Since there are two dependent fused multiply-adds, this can't feasibly be done in one cycle. Instead, even though `RCP_FAST` is encoded as an ADD instruction, it actually uses both FMA and ADD units, and the corresponding FMA instruction before it must be a nop.
== Reciprocal square root (G71)
Similarly to reciprocal, we use a lookup table followed by one iteration of Newton's method. The computation of `a1` as well as the required fixup is computed by the `FRSQ_FREXPM` and `FRSQ_FREXPE` instructions, which are analogous to `FRCP_FREXPM` and `FRCP_FREXPE`, except that because we're taking the square root, only multiplying by four produces a power-of-two change in the output, so the input is only scaled to [0.25, 1) instead of [0.5, 1.0), and the exponent is divided by two when computing the fixup in `FRSQ_FREXPE`. The Newton's method fixup step is mathematically given by `x2 = .5 * x1 * (3 - a1 * x1 * x1)`. This is rearranged to give `x2 = x1 + x1 * 0.5 * (1 - a1 * x1 * x1)`, which is computed by squaring `x1` and then computing two fused multiply-adds. The multiplication by 0.5 is accomplished by passing -1 to the scale parameter of `FMA_MSCALE` for the first fused multiply-add. Both fused multiply-adds use the `rcp_mode` bit (not sure why the second one does?).
t1 = FMA x1, x1, -0
t2 = FMA_MSCALE.rcp_mode a1, -t1, 1.0, -1
x = FMA_MSCALE.rcp_mode t2, x1, x1, e'
== Reciprocal square root (later versions)
Similar to `FRCP_FAST`, there is now a `FRSQ_FAST` instruction added on models after G71 which has the same restriction (it also takes 2 cycles).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment