ci: priority queue - "taking too long" issue
Documenting what was discussed on IRC:
We see several "CI is taking too long" replies from Marge, you can check that on in panel of the dashboard: https://ci-stats-grafana.freedesktop.org/d/Ae_TLIwVk/mesa-ci-quality-false-positives?orgId=1&viewPanel=15
It seems that sometimes this happens because the current gitlab policy is to run the oldest jobs first.
A few possible solutions were discussed:
- Implement a new policy in gitlab (to allow users to have different priorities)
- Implement two new endpoints in gitlab: one to get the queue, another one to pick a given job in the queue. This way the prioritization could be in the runner side which would choose the jobs to pick.
- Soft-disable mechanism. We could have a daemon that would check if we have a certain amount of jobs in the waiting queue of a given tag and we could spawn more docker gitlab-runners of this tag when needed to execute /bin/true (so the job wouldn't be executed, but ignored) and report somewhere about this (IRC?)
- Increase the overall timeout
- Adjust timeout of jobs
A few points: We need to check if the bottleneck are in the runners, or available DUT, or if the jobs are failing and getting retried. Maybe it is interesting to start with an analysis regarding the main reasons pipelines timeout.
Please let me know if I'm missing something and your feedback on this. Thanks