aco: Random small perf improvements + a stats improvement.
Just realized usings 64-bit shifts for parallel copies in RDNA3 is terrible, updated the stats to reflect this. (instruction scaling to wave64 is somewhat of a guess, didn't test all instruction classes, testing only on RDNA2 anyway)