Skip to content

WIP: nir: Lower cross-invocation TCS I/O

Timur Kristóf requested to merge Venemo/mesa:nir-tcs-io into main

A new NIR pass is introduced which can lower cross-invocation TCS input/output loads into a same-invocation load and a subgroup operation. The selected subgroup operation is based on an options argument and best suitability.

The following lowerings are done:

  • When the TCS output patch size is 1, then all invocations can load only their own outputs, so all loads are turned into same-invocation loads.
  • When the TCS output patch size is 2, we can use quad_swizzle_amd to lower loads with a constant vertex index.
  • When the TCS output patch size is 4, we can use quad_broadcast. An option is available to select whether dynamic quad broadcast is allowed or not.
  • As a fallback, it can use shuffle as well.

Moreover, when the input/output patch sizes are the same, we know for sure that the number of VS and TCS invocations are the same, so the same lowering is also applied to TCS input loads.

Notes about correctness:

  • The new lowering is only applicable when we know in advance that each TCS invocation that belongs to the same patch will be in the same subgroup. For example:
    • It works when the output patch size is 4 and subgroup size is 64.
    • It won't work when the output patch size is 3 and subgroup size is 64.
  • It is not valid in divergent control flow, so some enhancements are added to NIR in order to be able to filter out blocks that have divergent control flow.

Resulting stats:

Totals from 165 (0.12% of 137887) affected shaders:
VGPRs: 5700 -> 6568 (+15.23%); split: -2.60%, +17.82%
CodeSize: 232664 -> 114684 (-50.71%)
MaxWaves: 2430 -> 2194 (-9.71%); split: +3.05%, -12.76%
Instrs: 35736 -> 17951 (-49.77%)
Cycles: 142944 -> 71804 (-49.77%)
VMEM: 47130 -> 47856 (+1.54%); split: +9.42%, -7.88%
SMEM: 12800 -> 7102 (-44.52%)
VClause: 6477 -> 2058 (-68.23%)
Copies: 1368 -> 1406 (+2.78%); split: -2.70%, +5.48%
PreVGPRs: 1315 -> 6303 (+379.32%)

Only DS3 seems to be affected, some tess control shaders go wild there. It's mostly what can be expected from such a change. Now that some I/O uses registers, VGPR usage increases drastically, but there is a massive decrease in code size thanks to there being fewer VMEM stores along with seveal LDS stores and loads eliminated, so this is an overall win.

Edited by Timur Kristóf

Merge request reports