aco: Add a simple heuristic to decide early or late primitive export.

Late export is theoretically better if used with LATE_ALLOC,
but in practice, the early export has an advantage of
lower register usage, therefore more concurrent waves.

The idea of this commit is that "small" shaders benefit from early
primitive export more, due to being able to launch much more waves.

Let's consider a NIR shader "small" when it has only 1 block.
This yields both better performance, and better stats, than always
using late export.

Fossil DB on Sienna:

Totals from 12807 (8.76% of 146265) affected shaders:
VGPRs: 609128 -> 620216 (+1.82%); split: -0.01%, +1.83%
SpillSGPRs: 1458 -> 1538 (+5.49%)
CodeSize: 37028204 -> 37019320 (-0.02%); split: -0.17%, +0.14%
MaxWaves: 282902 -> 278516 (-1.55%)
Instrs: 7163142 -> 7162925 (-0.00%); split: -0.18%, +0.18%
VClause: 169285 -> 169547 (+0.15%); split: -1.15%, +1.30%
SClause: 267373 -> 267151 (-0.08%); split: -0.24%, +0.16%
Copies: 446442 -> 444567 (-0.42%); split: -2.68%, +2.26%
Branches: 156245 -> 156195 (-0.03%); split: -0.30%, +0.26%
PreSGPRs: 434701 -> 447396 (+2.92%)
PreVGPRs: 527783 -> 540527 (+2.41%)

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
Part-of: <!10106>
36 jobs for !10106 with aco-ngg-tuning in 9 minutes and 58 seconds (queued for 3 seconds)
latest merge request