lima ppir compiler instruction combining and optimization
[Warning: long post/brainstorm]
For the past few weeks I have been working on and off on lima ppir compiler optimization. One of the few items remaining in the original TODO list is:
PP compiler optimization, add post scheduler instruction combination which combine multi instructions into one instruction
While working on this, I have found other possible points of optimization and some challenges. For a while we have known that ppir needs some rework to allow for better instruction combining and reduce technical debt.
Since it is taking longer than expected to come up with a working solution by reasonable incremental steps, I'm filing this issue to document what I have experimented with so far as well as track progress and share some thoughts from the experiments. I welcome some feedback so I don't spend (more) time following the wrong idea.
- post scheduler instruction combine
I implemented a proof of concept of this as it was seemingly what ppir was originally designed for. One of the initial technical challenges here is that doing this breaks the current spilling code that inserts ops in existing instructions, in particular for varying loads. Since varying loads write to registers (not pipeline registers), when those are combined with other nodes that use the varying result, if the register holding the variable is spilled, the final result might be that a store instruction is placed at the end of the instruction after the varying usage, and a load inserted before this instruction (because of the register read) which doesn't make a lot of sense. To fix this, I implemented a full loop back from spilling to the start of scheduling. In this way, spilling only inserts new nodes, but the scheduler takes care of schduling the spill/fill nodes from the ground up, ensuring all is correct and without needing more complexity in the spilling path. This is not straightforward at all as the scheduler performs destructive changes to the ppir nodes, which need to be reverted in case of a 'un-schedule'. This seems to be a working solution, though.
Then finally the actual work is to implement something that will combine instructions taking care to not break any dependency. Constants are particularly annoying since they have already been placed and swizzled; technically they could be reordered again too.
Unfortunately, my initial benchmark with all that solution did not provide great results. There is some gain, but not a whole lot we can combine in the post-scheduler stage to make it much better. And in many cases the added register pressure adds even more spills to programs that already require spilling. Pretty much only varying loads to combine with their users. Probably not worth it for the whole algorithm run and added complexity, So I left that aside for now. I suppose things like deferring the const slot assignment to after this might be better but at this point I'm not sure if it is worth it to continue going this way.
WIP https://gitlab.freedesktop.org/enunes/mesa/-/tree/lima-ppir-combine-1
- start by merging the varyings only
One very low hanging fruit in ppir is to combine single-user varyings to their users in the same instruction, like we do for uniform loads. This saves up to one instruction per varying load, which is a few instructions per program. Not really ground breaking but if we delay a full rework for longer we might want to have it merged to provide a more fair comparison base against future more sophisticated algorithms (so that fancy algorithms have to do better than just save that varying load instruction).
WIP https://gitlab.freedesktop.org/enunes/mesa/-/tree/lima-ppir-combine-varying-1
- clone uniforms and varyings for each user node
The blob seems to do clone those for each user, not only for each block like we do. Since otherwise the uniform load might be empty, that might be helpful as it may reduce register pressure.
In the meantime, moving that to nir might also be helpful, as the way we currently have it in lima is a bit cumbersome (we clone those during the emit step of a nir node that uses the uniform result...).
It is also not a full rework that ppir needs but might be helpful to have, again, to provide a better comparison base against a more sophisticated algorithm.
WIP https://gitlab.freedesktop.org/enunes/mesa/-/tree/lima-ppir-nir-dupl-uniforms-1
- pre-scheduler instruction combining
While working on all these other things, I'm thinking that a better way might be to refactor the scheduler (in ppir currently the "node_to_instr" step) to generate instructions already combined and making better use of pipeline registers.
My current thought is to have an algorithm that bundles ppir nodes into groups and works on those groups. Nodes in the same group will be assigned to the same instruction. We may start making best case assumptions (everything possible can be pipelined) and do validation steps on them, and if any validation fails (e.g. group contains two nodes that require the same slot) we "demote" nodes until it validates completely. Uniforms may be demoted to a mov, alu nodes may be demoted from pipeline output to reg output. The advantage of working in these groups rather than currently on ppir_instr is that ppir_instr can only make decisions on where to put a node once and only by looking at each node at once. Working on node groups allows us to look at all the nodes in the group to decide which one to demote, which may lead to better decisions.
I'm working on prototyping this to run some benchmarks; if somebody has any feedback or has tried this before or has a better idea of ppir rework to improve combining, please let me know.
- other experiments
One other thing I have experimented with so far is to switch node_to_instr to a simpler forward list scheduler. This is far simpler and less cumbersome than the backwards scheduling but quickly falls apart as it is hard to schedule the pipeline dependencies forward. Scheduling each node would require checking whether the pipeline dependency will fit in the same instruction; which requires recursive checks as those dependencies might also have their own pipeline dependencies. So not as straightforward as it may seem. It might be possible to switch the scheduler to something like this rather than the current node_to_instr+scheduler combo if we have the node groups I mentioned, as all we would have to do is check whether the entire node group fits in the instruction.
Finally, another experiment was to simply allow node_to_instr to always try to place nodes in the same instruction first, instead of creating a new one. Unfortunately, this is also not very straightforward as node_to_instr doesn't seem to create instructions one at a time and just doing this blindly ends up breaking dependencies. Could be investigated more though if we prefer to keep node_to_instr.