nir: Reorder f2f16(bcsel(x, .., ..))
This removes all conversions from f2f16(bcsel(x, 0.0, f2f32(y)))
which I hit.
It's a nice win on Android shaders on AGX, at least.
total instructions in shared programs: 1776725 -> 1773295 (-0.19%)
instructions in affected programs: 166080 -> 162650 (-2.07%)
helped: 590
HURT: 6
Instructions are helped.
total bytes in shared programs: 11716524 -> 11695360 (-0.18%)
bytes in affected programs: 1176948 -> 1155784 (-1.80%)
helped: 612
HURT: 9
Bytes are helped.
total halfregs in shared programs: 531032 -> 530844 (-0.04%)
halfregs in affected programs: 1686 -> 1498 (-11.15%)
helped: 43
HURT: 3
Halfregs are helped.
Like AGX, ir3 knows how to fold conversions, although the results there are all over the place. I think this looks like a small win overall but there's a lot of Inconclusive here.
total instructions in shared programs: 3117278 -> 3113284 (-0.13%)
instructions in affected programs: 181511 -> 177517 (-2.20%)
helped: 419
HURT: 115
Instructions are helped.
total nops in shared programs: 674281 -> 672865 (-0.21%)
nops in affected programs: 43853 -> 42437 (-3.23%)
helped: 326
HURT: 194
Nops are helped.
total non-nops in shared programs: 2442997 -> 2440419 (-0.11%)
non-nops in affected programs: 120787 -> 118209 (-2.13%)
helped: 384
HURT: 65
Non-nops are helped.
total mov in shared programs: 78630 -> 78617 (-0.02%)
mov in affected programs: 2459 -> 2446 (-0.53%)
helped: 69
HURT: 70
Inconclusive result (value mean confidence interval includes 0).
total cov in shared programs: 75347 -> 72678 (-3.54%)
cov in affected programs: 11944 -> 9275 (-22.35%)
helped: 385
HURT: 33
Cov are helped.
total dwords in shared programs: 6734360 -> 6734314 (<.01%)
dwords in affected programs: 224888 -> 224842 (-0.02%)
helped: 173
HURT: 109
Inconclusive result (value mean confidence interval includes 0).
total last-baryf in shared programs: 118389 -> 118386 (<.01%)
last-baryf in affected programs: 678 -> 675 (-0.44%)
helped: 12
HURT: 9
Inconclusive result (value mean confidence interval includes 0).
total full in shared programs: 193607 -> 193562 (-0.02%)
full in affected programs: 473 -> 428 (-9.51%)
helped: 41
HURT: 20
Full are helped.
total constlen in shared programs: 494256 -> 494452 (0.04%)
constlen in affected programs: 3520 -> 3716 (5.57%)
helped: 0
HURT: 49
Constlen are HURT.
total cat0 in shared programs: 743772 -> 742357 (-0.19%)
cat0 in affected programs: 47059 -> 45644 (-3.01%)
helped: 326
HURT: 194
Cat0 are helped.
total cat1 in shared programs: 154137 -> 151471 (-1.73%)
cat1 in affected programs: 16492 -> 13826 (-16.17%)
helped: 383
HURT: 61
Cat1 are helped.
total cat2 in shared programs: 1147730 -> 1147821 (<.01%)
cat2 in affected programs: 19780 -> 19871 (0.46%)
helped: 46
HURT: 62
Inconclusive result (%-change mean confidence interval includes 0).
total cat3 in shared programs: 944660 -> 944656 (<.01%)
cat3 in affected programs: 69 -> 65 (-5.80%)
helped: 3
HURT: 0
total sstall in shared programs: 237980 -> 237807 (-0.07%)
sstall in affected programs: 10800 -> 10627 (-1.60%)
helped: 138
HURT: 149
Inconclusive result (value mean confidence interval includes 0).
total (ss) in shared programs: 58074 -> 58043 (-0.05%)
(ss) in affected programs: 1886 -> 1855 (-1.64%)
helped: 92
HURT: 76
Inconclusive result (value mean confidence interval includes 0).
total systall in shared programs: 504493 -> 504998 (0.10%)
systall in affected programs: 19326 -> 19831 (2.61%)
helped: 121
HURT: 161
Inconclusive result (value mean confidence interval includes 0).
total (sy) in shared programs: 27389 -> 27393 (0.01%)
(sy) in affected programs: 131 -> 135 (3.05%)
helped: 17
HURT: 17
Inconclusive result (value mean confidence interval includes 0).
total waves in shared programs: 440360 -> 440362 (<.01%)
waves in affected programs: 440 -> 442 (0.45%)
helped: 22
HURT: 19
Inconclusive result (value mean confidence interval includes 0).
Results on Mali-G57 are not as good (since panfrost doesn't know how to fold f2f16 conversions into destinations, although IIRC the hardware can do it). Still looks like a win overall thanks to register pressure decreasing:
total instructions in shared programs: 2695860 -> 2691035 (-0.18%)
instructions in affected programs: 517574 -> 512749 (-0.93%)
helped: 810
HURT: 287
Instructions are helped.
total cycles in shared programs: 141283.03 -> 141286.30 (<.01%)
cycles in affected programs: 576.19 -> 579.45 (0.57%)
helped: 52
HURT: 124
Inconclusive result (value mean confidence interval includes 0).
total fma in shared programs: 22136.34 -> 22139.73 (0.02%)
fma in affected programs: 426.44 -> 429.83 (0.80%)
helped: 34
HURT: 106
Inconclusive result (%-change mean confidence interval includes 0).
total cvt in shared programs: 14694.14 -> 14615.30 (-0.54%)
cvt in affected programs: 4193.95 -> 4115.11 (-1.88%)
helped: 826
HURT: 281
Cvt are helped.
total sfu in shared programs: 8292.94 -> 8293.19 (<.01%)
sfu in affected programs: 5.50 -> 5.75 (4.55%)
helped: 2
HURT: 2
Inconclusive result (value mean confidence interval includes 0).
total quadwords in shared programs: 1459944 -> 1457440 (-0.17%)
quadwords in affected programs: 123904 -> 121400 (-2.02%)
helped: 382
HURT: 77
Quadwords are helped.
total threads in shared programs: 53606 -> 53611 (<.01%)
threads in affected programs: 8 -> 13 (62.50%)
helped: 6
HURT: 1
Threads are helped.
Finally, Mali-T860 (a vector processor) is also a wash:
total instructions in shared programs: 1512521 -> 1511913 (-0.04%)
instructions in affected programs: 250360 -> 249752 (-0.24%)
helped: 505
HURT: 270
Instructions are helped.
total bundles in shared programs: 644827 -> 644610 (-0.03%)
bundles in affected programs: 86000 -> 85783 (-0.25%)
helped: 465
HURT: 231
Inconclusive result (%-change mean confidence interval includes 0).
total quadwords in shared programs: 1128965 -> 1128389 (-0.05%)
quadwords in affected programs: 180606 -> 180030 (-0.32%)
helped: 541
HURT: 224
Quadwords are helped.
total registers in shared programs: 90718 -> 90682 (-0.04%)
registers in affected programs: 962 -> 926 (-3.74%)
helped: 68
HURT: 43
Inconclusive result (%-change mean confidence interval includes 0).
total threads in shared programs: 55657 -> 55679 (0.04%)
threads in affected programs: 37 -> 59 (59.46%)
helped: 21
HURT: 5
Threads are helped.
total spills in shared programs: 1434 -> 1435 (0.07%)
spills in affected programs: 1 -> 2 (100.00%)
helped: 0
HURT: 1
total fills in shared programs: 5222 -> 5236 (0.27%)
fills in affected programs: 4 -> 18 (350.00%)
helped: 0
HURT: 1