Faster Mesa CI
A lot of people have been complaining about CI execution time lately. We should probably find a way to make them happy.
There are at least a couple of works in progress at the moment:
- @anholt has been working on switching our x86-64 build/test execution over from the shared runners to dedicated runners which I believe are sponsored by Google (which might be semi-done?)
- @bnieuwenhuizen has written a parallelising dEQP executor which should speed execution of the hardware tests which run directly under GitLab runners rather than LAVA (the LAVA devices have their own parallelisation)
In the interests of measuring before we cut, I drew up some graphs, some of which have some value if you squint at them hard enough. I've attached the script I used for analysis below, and will provide the raw data if someone wants to do their own analysis. This dump only covers pipelines run on mesa/mesa and not MRs, mostly because I couldn't easily figure out how to get the results for Mesa and MRs, but I could extend it to MRs if people would find it useful. [update: replaced the graphs with ones taking MRs into account as well.]
In all these graphs, each dot represents one stage of one pipeline, so each (successful) pipeline will have one dot for each of the stages. They're also all in logarithmic scale. I did no filtering for unsuccessful jobs, because in theory we shouldn't have any unsuccesful jobs in master ... if we were to extend the graphing to MRs, I don't know how much noise we'd introduce by including unsuccessful jobs.
Here is the total end-to-end time taken for each stage of each pipeline; measured by the delta between the first point any job for that stage was queued (i.e. its dependencies were satisfied and it was ready for execution), and the last point any job of that stage completed:
Here is the total time that each stage was blocked waiting for a runner to become available. Panfrost and Lima look pretty awesome here, but it's misleading. For both of these in the Collabora lab, we have a GitLab runner with pretty much infinite free slots which does nothing other than collect the job and queue it in LAVA, so the times shown do not reflect when the job actually started on the device - that instead gets hidden in execution time. Conversely, x86-64 test and freedreno do accurately reflect the amount of time spent queueing. BayLibre's LAVA devices run on a runner with a very small number of job slots, so that can introduce some element of waiting.
Another misleading part of this graph is that it is the cumulative time spent queueing, which penalises parallel runs: particularly x86-64 test and freedreno. For example, if the x86-64 softpipe test has 4 parallel stages which each spend 100 seconds waiting in the queue before launching simultaneously, this will be counted as 400 seconds waiting, even though there is only 100 seconds of wall-time impact.
And here's the total execution duration for each class in each pipeline (i.e. sum([finish - start for each job in class in pipeline])
), which is at least pleasingly consistent. (Low outliers are almost certainly failed jobs. Panfrost/Lima fluctuations are, as above, very likely due to waiting for a hardware device to become available.)
Here's the pile of Python/Matplot hacks I used to generate the graphs: mesa-ci-plot.py
My observations so far:
- the queues for the shared runners are absolutely stupid, and if the autoscaling runners bring that down to something reasonable then we've eliminated the vast majority of variability in the average pipeline (note that the x86-64 build execution durations are quite constant, but the end-to-end time bounces around quite wildly - from a floor of ~12min to a ceiling of 3 hours?!)
- the shared ARM runner has recently become oversubscribed, so we should have another one, which would double our capacity
- freedreno is very oversubscribed - either underprovisioned or availability issues? network also seems to be an interesting factor there
- Panfrost execution time is pleasingly constant; Lima is relatively constant but does have availability issues, which is hopefully just due to bedding in infrastructure
- the end-to-end time on x86-64 testing is surprisingly stable, given the demand for the shared runners; I assume this is because we launch so many parallel jobs in a single MR, and the tests see a lot less queueing time since they can fire immediately after the primary Meson build job and don't have to wait for SCons etc