aco: scheduling improvements
As mentioned in my XDC talk, ACO doesn't really take RAR dependencies into account when scheduling. This series does a few changes to the scheduler in order to improve the situation.
- restrict scheduling more depending on max_waves prevents the scheduler from loosing too much parallelism
- better handling of RAR dependencies for VMEM instructions. this greatly improves the VMEM def-use distances as they are executed in-order
- some minor changes.
A small increase in code size is mainly due to an increased number of waitcnt() instructions. As with almost all changes, the result is a bit mixed, and some games might experience a slight loss in performance, but overall I think this series is beneficial. This series also makes scheduling slightly faster than before.
Total shader stats changes:
- 57559 shaders in 28980 tests
- Totals:
- SGPRS: 2895271 -> 2969727 (2.57 %)
- VGPRS: 1981304 -> 1964604 (-0.84 %)
- Spilled SGPRs: 868 -> 868 (0.00 %)
- Spilled VGPRs: 0 -> 0 (0.00 %)
- Private memory VGPRs: 0 -> 0 (0.00 %)
- Scratch size: 10348 -> 10348 (0.00 %) dwords per thread
- Code Size: 114455544 -> 114584072 (0.11 %) bytes
- LDS: 933 -> 933 (0.00 %) blocks
- Max Waves: 378759 -> 382668 (1.03 %)
- Wait states: 0 -> 0 (0.00 %)