ir3: optimize subgroup operations using brcst.active
Follow the blob and optimize subgroup operation using brcst.active
and getlast
when supported.
The transformation consists of two parts. First, a NIR transform replaces subgroup operations with a sequence of new brcst_active_ir3
intrinsics followed by a new [type]_clusters_ir3
intrinsic (where type can be reduce
, inclusive_scan
, or exclusive_scan
).
The brcst_active_ir3
intrinsic is lowered directly to a brcst.active
instruction. The other intrinsics get lowered to a new macro (OPC_SCAN_CLUSTERS_MACRO
) which later gets emitted as a loop (using getlast
/getone
) that iterates all clusters and produces the requested scan result.
OPC_SCAN_CLUSTERS_MACRO
has a number of optional arguments. First, since the exclusive scan result is not a natural by-product of the loop but has to be calculated explicitly, its destination is optional. This is necessary since adding it unconditionally will produce unused instructions that won't be DCE'd anymore at this point. Second, when performing 32b MUL_U reductions (that expand to multiple instructions), an extra scratch register is necessary.
Note on brcst.active
classification for scheduling: it is currently classified as a texture instruction which seems to be fine for scheduling purposes but feels wrong. I think it should be enough to classify it as a sy
-producer which might prevent some unnecessary ss
-bits on consumers of brcst.active
that we currently emit but the blob doesn't have. This change seems quite involved though since many places seem to rely on is_tex
.