aco,nir: use ds_append and ds_consume
ds_append/consume are an optimized version of this:
uint popcount = subgroupAdd(1); // -1 for consume
uint res;
if (subgroupElect()) {
res = atomicAdd(lds[offset], popcount);
}
res = subgroupBroadcastFirst(res);
Foz-DB Navi21:
Totals from 46 (0.06% of 79395) affected shaders:
Instrs: 85383 -> 84759 (-0.73%)
CodeSize: 449840 -> 447064 (-0.62%)
Latency: 570585 -> 566983 (-0.63%); split: -0.63%, +0.00%
InvThroughput: 133619 -> 132777 (-0.63%)
VClause: 1769 -> 1771 (+0.11%)
SClause: 2524 -> 2525 (+0.04%)
Copies: 6347 -> 6139 (-3.28%)
Branches: 4246 -> 4170 (-1.79%)
PreSGPRs: 2109 -> 2091 (-0.85%)
VALU: 50968 -> 50758 (-0.41%)
SALU: 14473 -> 14129 (-2.38%)
Draft because of the following open questions:
-
Is m0 read? Some ISA docs say that m0+offset is used for LDS (clamped using m0[16:32]), according to my testing GFX10/11 only use offset -
Is ds_append (which always has a return value) at least as fast as ds_add_u32 without return value? -
ac/llvm support?