aco/insert_exec: re-use exec temporary more often instead of rematerializing it
Based on: !31560 (merged)
This MR reworks part of the v_cmpx
post-RA optimization which optimizes this:
s_mov_b64 exec, s[a:b]
.... // merge block content
v_cmp_lt_f32 vcc, v1, v2
s_and_saveexec_b64 s[c:d], vcc, exec
s_cbranch_execz invert
To this:
s_mov_b64 exec, s[a:b]
.... // merge block content
s_mov_b64 s[c:d], exec
v_cmpx_lt_f32 v1, v2
s_cbranch_execz invert
It also removes s_mov_b64 s[c:d] exec
if a == c
. But that part relies on RA internals, and with round robin it happens a lot less (at least in wave32, round robin is currently only used for single dword temps). Instead we can avoid inserting s_and_saveexec
pre-RA by using s_and
and reusing the old temporary (which will later be allocated to s[a:b]) for the exec backup. This is also more effective, even with the old RA algorithm. The only cost is a slight increase in sgpr pressure (and thus a little bit of spilling). Spilling doesn't seem to be a big issue in practice though, so I haven't bother with some heuristic (e.g. block length based) to avoid it.
This work also benefits loops with divergent breaks, because there we always inserted a real copy from exec, and there was no post-RA optimization to clean this up.