Skip to content

ir3: optimize subgroup operations using brcst.active

Job Noorman requested to merge jnoorman/mesa:ir3-brcst into main

Follow the blob and optimize subgroup operation using brcst.active and getlast when supported.

The transformation consists of two parts. First, a NIR transform replaces subgroup operations with a sequence of new brcst_active_ir3 intrinsics followed by a new [type]_clusters_ir3 intrinsic (where type can be reduce, inclusive_scan, or exclusive_scan).

The brcst_active_ir3 intrinsic is lowered directly to a brcst.active instruction. The other intrinsics get lowered to a new macro (OPC_SCAN_CLUSTERS_MACRO) which later gets emitted as a loop (using getlast/getone) that iterates all clusters and produces the requested scan result.

OPC_SCAN_CLUSTERS_MACRO has a number of optional arguments. First, since the exclusive scan result is not a natural by-product of the loop but has to be calculated explicitly, its destination is optional. This is necessary since adding it unconditionally will produce unused instructions that won't be DCE'd anymore at this point. Second, when performing 32b MUL_U reductions (that expand to multiple instructions), an extra scratch register is necessary.

Note on brcst.active classification for scheduling: it is currently classified as a texture instruction which seems to be fine for scheduling purposes but feels wrong. I think it should be enough to classify it as a sy-producer which might prevent some unnecessary ss-bits on consumers of brcst.active that we currently emit but the blob doesn't have. This change seems quite involved though since many places seem to rely on is_tex.

Edited by Job Noorman

Merge request reports