aco: insert less s_delay_alu
If the SIMD frontend already waits, we don't need to insert a delay to avoid stalling the ALUs. One common case where this helps is VALU -> SALU dependencies. The s_delay_alu
is just unnecessary code size in this case.
The hardware also has a fast path for comparisons followed by v_cndmask
, there's effectively zero latency. Inserting a s_delay_alu
breaks this fast path by forcing the v_cndmask
to wait until the comparison completes.
I validated most of this information with synthetic micro benchmarking (https://gitlab.freedesktop.org/DadSchoorse/bvhre/-/tree/forwarding).
RGP also shows these effects:
SALU<->VALU latency is (partially) hidden with and without s_delay_alu
v_cndmask_b32
has lower latency without s_delay_alu
Foz-DB Navi31
Totals from 47215 (59.61% of 79206) affected shaders:
Instrs: 35363360 -> 35062463 (-0.85%); split: -0.85%, +0.00%
CodeSize: 186342228 -> 185073248 (-0.68%); split: -0.68%, +0.00%
Latency: 261725582 -> 261692233 (-0.01%); split: -0.02%, +0.00%
InvThroughput: 42382641 -> 42377295 (-0.01%); split: -0.01%, +0.00%