ci: Should we reduce our target pipeline time?
Our current written policy for Mesa CI allows for quite long pipelines:
Additionally, the test farm needs to be able to provide a short enough turnaround time that we can get our MRs through marge-bot without the pipeline backing up. As a result, we require that the test farm be able to handle a whole pipeline’s worth of jobs in less than 15 minutes (to compare, the build stage is about 10 minutes). Given boot times and intermittent network delays, this generally means that the test runtime as reported by deqp-runner should be kept to 10 minutes.
It is a serious problem that this is not satisfied by one farm in pre-merge (#10273). However, even once that is resolved, our pipelines will still be quite long, around 25 minutes optimistically.
However, looking at the well-behaved farms that follow our policy, they do so with a considerable amount of time to spare. In practice, about 10 minutes instead of 15 minutes. As the number of developers in the project increases, we should expect an increasing number of merge requests needed unless we adopt a subtree approach in Mesa. As such, unless we're ready to seriously talk about adopting subtrees to solve pipeline serialization, we need to decrease pipeline time continuously.
Indeed, as per the existing policy, we have this rule to avoid backing up the merge queue. Last week we had IIRC a dozen or so MRs in the queue at once, and not the layered Vulkan kind of dozen. But they were merging, that was a good week for Mesa CI...
I propose that we make this policy more strict, something like:
Additionally, the test farm needs to be able to provide a short enough turnaround time that we can get our MRs through marge-bot without the pipeline backing up. As a result, we require that the test farm be able to handle a whole pipeline’s worth of jobs in at most 10 minutes (to compare, the build stage is about 10 minutes). Given boot times and intermittent network delays, this generally means that the test runtime as reported by deqp-runner should be kept to 8 minutes.
I believe the Igalia and Valve farms already satisfy this stricter policy. Are there any issues foreseen with this? Igalia and Valve farms together represent the full gamut of hard Mesa CI testing, from full runs of the VKCTS on desktop hardware to GL and VK testing on a hideously underpowered arm board. Given that both of these farms accomplish this fine, I don't see why it's not possible to do so on all of the farms. And given the number of pipelines we process, it's imperative that we take serious steps to alleviate the pipeline time.
(Note this does not necessarily require bumping fractions. In preparation for M1 CI, I've spent serious time profiling CTS runs and have cut GLES3.2 deqp-runner time in half. There is probably low-hanging fruit in other drivers.)