Skip to content

Draft: nir,aco,radv: add a pass to optimize shuffle/booleans that only dependent on subgroup id and constants

Georg Lehmann requested to merge DadSchoorse/mesa:nir_opt_const_shuffle into main

This pass uses constant folding to determine which invocation is read by shuffle for each invocation. Then, it detects patterns in the result and uses more a specialized intrinsic if possible.

AMD hardware has a lot of specialized shuffle instruction (ds_swizzle, dpp16, dpp8, v_permlane16_b32, v_permlanex16_b32) that are faster than general shuffle.

The main motivation for this is the gdeflate decompression shader used by DirectStorage, which open codes clustered inclusive scans and broadcasts.

Booleans that only dependent on subgroup id and constants this pass creates inverse_ballot (from !25123 (merged)).

Stats from shuffle optimization:

Foz-DB Navi21 (only the_last_of_us_part1 is affected): 
Totals from 8 (0.01% of 76572) affected shaders:
Instrs: 6492 -> 5700 (-12.20%)
CodeSize: 34024 -> 29760 (-12.53%)
Latency: 81559 -> 77871 (-4.52%)
InvThroughput: 10037 -> 9131 (-9.03%); split: -9.04%, +0.01%
SClause: 324 -> 325 (+0.31%)
Copies: 773 -> 776 (+0.39%)
PreSGPRs: 553 -> 549 (-0.72%)
PreVGPRs: 239 -> 237 (-0.84%)

Stats from boolean optimization:

Foz-DB Navi21:
Totals from 564 (0.74% of 76572) affected shaders:
MaxWaves: 13921 -> 13923 (+0.01%)
Instrs: 622888 -> 624533 (+0.26%); split: -0.05%, +0.31%
CodeSize: 3317976 -> 3289844 (-0.85%); split: -1.11%, +0.27%
VGPRs: 22328 -> 22272 (-0.25%); split: -0.32%, +0.07%
SpillSGPRs: 149 -> 164 (+10.07%)
Latency: 5896948 -> 5898656 (+0.03%); split: -0.03%, +0.06%
InvThroughput: 1074693 -> 1071459 (-0.30%); split: -0.34%, +0.04%
VClause: 20584 -> 20588 (+0.02%)
SClause: 23372 -> 23297 (-0.32%); split: -0.38%, +0.06%
Copies: 64628 -> 66460 (+2.83%); split: -0.00%, +2.84%
Branches: 21731 -> 21927 (+0.90%); split: -0.03%, +0.93%
PreSGPRs: 23082 -> 23111 (+0.13%); split: -0.03%, +0.16%
PreVGPRs: 18658 -> 18564 (-0.50%)
Edited by Georg Lehmann

Merge request reports