asahi: Fuse cmpsel instructions
AGX has two "compare-and-select" instructions, icmpsel
(for integers) and fcmpsel
(for floating-point). These take four sources (a, b, c, d
in the following example syntax) and specify a comparison (lt
for "less than" in the following examples). The syntax looks like
fcmpsel.lt dest, a, b, c, d
which evaluates as
dest = (a < b) ? c : d
(with floating-point comparison. For integer comparison, use icmpsel
).
Instruction selection should take advantage of this hardware instruction.
NIR does not have a 4 source conditional select. Instead, it has 2 source conditional instructions (flt
, ilt
for float/integer less-than in this example), which produce a boolean. It also has a 3 source select instruction (bcsel
) that takes in a boolean in its first source, and selects between the values in the second and third sources. So the NIR sequence
flt bool, a, b
bcsel dest, bool, c, d
is equivalent to the single AGX instruction
fcmpsel dest, a, b, c, d
Right now, our instruction selection will translate both NIR comparisons (flt, feq, fne..) and bcsel into cmpsel instructions. Booleans are represented as 16-bit integers, 0 for false and 1 for true. So that NIR sequence gets translated to the IR:
fcmpsel bool, a, b, 0, 1
icmpsel.ne dest, bool, 0, c, d
While this is functional, it can be optimized: the cmpsel instructions can be fused together, using the following rules:
icmpsel.ne(fcmpsel.*(a, b, 0, 1), 0, c, d) -> fcmpsel.*(a, b, c, d))
icmpsel.ne(icmpsel.*(a, b, 0, 1), 0, c, d) -> icmpsel.*(a, b, c, d))
These rules can be implemented in the AGX optimizer, reducing instruction count.
However, there is a challenge here. Suppose the two cmpsel instructions are far apart in the program. Which "direction" do we fuse the instruction?
We could try to fuse the second instruction into the first. This approach would replace the destination of the first with the destination of the second, using a backwards walk of the program, the way we fuse "saturates"/clamps into instructions. While this will not increase register pressure, it is only valid if c
and d
are already live at the time of the first cmpsel
instruction. So this approach is good when c
and d
are constants or uniforms (and therefore always live), but not good when c
and d
come from other sources.
The other approach is to fuse the first instruction into the second. This approach would do a forward walk of the program, in source order. When it finds a icmpsel.ne
instruction where the b
source is zero, it looks up the instruction that wrote a
earlier in the program. If that instruction selects between 0
and 1
, it knows it's dealing with a comparison that it can fuse, so it will replace the a
and b
of the icmpsel.ne
with the a
and b
of the first comparison, and it will replace the opcode (icmpsel/fcmpsel) and comparison (lt in our example) of the secondary with the first. This approach is always correct, but it can increase register pressure.
How can this increase pressure? Suppose that a
and b
are only used by this cmpsel, they are not used anywhere else. In other words, the sources of fcmpsel
are killed by the fcmpsel
. Then in the unfused code, the register demand after fcmpsel
decreases by one, because two sources are killed and only one destination (the boolean) is generated. So for any instruction in between the two cmpsels, the number of simultaneously live values (the register pressure) is smaller for the unfused code than for the fused code, where a
and b
must remain live all the way until the second cmpsel instruction. Increasing register pressure can hurt performance, so that means fusing cmpsel might hurt performance in certain shaders. What are some possible solutions?
- Ignore the pressure issues and always fuse greedily anyways.
- Fuse when
c
andd
are constants or uniforms, using the backwards pass. - Fuse when
a
orb
is a constant or uniform, using the forwards pass. In this case, register pressure is nominally unaffected, at least if the register source is 16-bit like the 16-bit boolean. - Fuse when the only users of the boolean are bcsel instructions, where all users get fused, ensuring that the boolean itself disappears (instead of having
a
,b
, andbool
are live simultaneously -- which is even worse!)
Each of these solutions are heuristics. They try to balance instruction count and register pressure in a simple way. None of them are optimal but they each have their benefits. In this task, you should...
-
Add unit tests for one or more of these strategies ( mesa/src/asahi/compiler/test/test-optimizer.cpp
) and implement them into the optimizer (mesa/src/asahi/compiler/agx_optimizer.c
). -
Compare the results with shader-db. Decide which strategies should be used in the production driver. -
Make a merge request to Mesa with your selected strategies and unit tests, with shader-db results in the commit message justifying your choice.