WIP: aco: Add a post-RA scheduler to improve ALU scheduling
This MR adds a post-RA scheduler to ACO. The main goal is to improve ALU scheduling on Navi, this is not useful on older GPUs. I built this on top of some prior work done by Daniel, which was removed shortly before ACO was merged. Back then, the post-RA scheduler didn't really do anything. I took the old code, buffed it up so it can work and build a DAG, understand things like barriers, etc.
It works in the following way:
- The post-RA scheduler (PRS) is a list scheduler.
- Each basic block is processed independently, without any regard to each other.
- Basic blocks are broken up to smaller units along scheduling barriers like control flow,
s_barrier
, and similar instructions. - Within the smaller units, a DAG (directed, acyclic graph) is built from the instructions based on the registers they read and write, and their memory semantics.
- Each instruction is assigned a priority which is roughly based on its latency.
- From the DAG, candidate instructions are selected based on priority and which of the available instructions can start first.
A few notes:
- Scheduling memory instructions is not the primary goal, so currently it treats VMEM as scheduling barriers. There are some ideas to improve VMEM scheduling in the future, but this is a very problematic topic, and the pre-RA scheduler does a good job already.
- It will still try to schedule SMEM and DS instructions, when it sees benefit in doing so.
There are also a couple of smaller fixes included here for other parts of ACO.
Edited by Timur Kristóf