Skip to content

ci: Rebalancing farms, removing bare-metal internal retries.

Emma Anholt requested to merge anholt/mesa:ci-updates into main

Rebalancing farms came from looking at some queue-time graphs from lava monitoring, plus the usual side-eye of cheza. The bare-metal timeouts are a bigger change:

ci: Stop doing internal retries in bare-metal.

We have job-level retry on failure now, and will continue to need to in
order to work around fd.o infrastructure flakes.  If we stop doing retry
inside the job, then we can crank down the gitlab-level timeouts on test
jobs to be closer to our CI guidelines and avoid blocking a runner for an
hour when things go wrong (for example, cheza #16 failing to boot in a
recognized way and continuously looping due to the intra-job retry).
Plus, the job logs will be more readable when you don't have two boots in
one job, and we'll get the flakes surfaced in our monitoring dashboards.

If internal retries were really doing useful work we may see an increase
in flakes as a result of this.  I'm committing to turning off boards or
reducing coverage as necessary to handle this.

Merge request reports