Skip to content

aco/insert_exec: re-use exec temporary more often instead of rematerializing it

Georg Lehmann requested to merge DadSchoorse/mesa:aco-less-saveexec into main

Based on: !31560 (merged)

This MR reworks part of the v_cmpx post-RA optimization which optimizes this:

s_mov_b64 exec, s[a:b]
.... // merge block content
v_cmp_lt_f32 vcc, v1, v2
s_and_saveexec_b64 s[c:d], vcc, exec
s_cbranch_execz invert

To this:

s_mov_b64 exec, s[a:b]
.... // merge block content
s_mov_b64 s[c:d], exec
v_cmpx_lt_f32 v1, v2
s_cbranch_execz invert

It also removes s_mov_b64 s[c:d] exec if a == c. But that part relies on RA internals, and with round robin it happens a lot less (at least in wave32, round robin is currently only used for single dword temps). Instead we can avoid inserting s_and_saveexec pre-RA by using s_and and reusing the old temporary (which will later be allocated to s[a:b]) for the exec backup. This is also more effective, even with the old RA algorithm. The only cost is a slight increase in sgpr pressure (and thus a little bit of spilling). Spilling doesn't seem to be a big issue in practice though, so I haven't bother with some heuristic (e.g. block length based) to avoid it.

This work also benefits loops with divergent breaks, because there we always inserted a real copy from exec, and there was no post-RA optimization to clean this up.

Edited by Georg Lehmann

Merge request reports

Loading