venus-lavapipe job decay with an infinite loop
Long story short: We think there is a problem with the gitlab runners used by the venus-lavapipe
job. Non-exhaustive statistics say that when the runner is google-mesa-swrast-1
or google-mesa-swrast-3
, the job will permanently terminate due to a timeout in gitlab. When the runner is google-mesa-swrast-2
, there is a 50-50 chance of a timeout or job ending in less than 15 minutes.
I've been experiencing problems with the venus-lavapipe
job since the uprev on Oct 28th.
The day before, Oct 27th, the scheduled pipeline ran the job uprev_mesa_in_virglrenderer on its daily based plan. It has generated a merge request with the uprev proposal and got merged on Nov 7th.
But in the execution on Oct 28th, in the uprev check, the venus-lavapipe
didn't succeed. The ci-uprev
logs reported it doesn't know (yet) how to manage the expectations of this job. This venus-lavapipe
job reported, after ~13 minutes of execution that some tests were slow, some flaky, and some more failed.
So the tool has to be extended to manage this recently added job. And by now, we already have a proposal for that in the merge request !21 (merged).
But during this bugfix, an issue with this job has been detected. As an example, a local execution of the ci-uprev
pick the pipeline 729028 of mesa to check if the uprev to mesa's revision e891e84f could be viable. So it has generated an uprev commit and triggered the pipeline to see.
The venus-lavapipe
job failed and didn't produce artifacts. Instead, the server stopped it because of the job timeout in gitlab. The tool already has a logic when a job finishes without artifacts, and it retries the job, but with the same results. I've tried it manually for the third time. Always a job timeout.
Investigating the issue, we found an issue with the image used by this job. This venus-lavapipe
extends .venus-lavapipe-test
, which defines MESA_IMAGE_PATH: ${DEBIAN_X86_TEST_IMAGE_PATH}
. This right-side variable comes from "mesa/mesa" in the file ".gitlab-ci/image-tags.yml", whose value is "debian/x86_test-gl". That's why the proposal to change this definition to use the Vulkan image "debian/x86_test-vk". As this Vulkan image doesn't yet have wget installed (like GL has), another change in virglrenderer is needed.
But this didn't solve the problem of the venus-lavapipe
jobs failing due to the job timeout in gitlab. We started to see a pattern when this job fails always in the first run, but sometimes it worked in the retry.
Comparing logs of a job that completes the task (even if it succeeds or it fails and reports test fails and flakes) with a job that has been terminated due to the timeout, we see:
After a while, you can see a similar one, like this example:
This remaining decreases faster than time (perhaps due to the 32 threads in 500-test groups. And it arrives very close to 0, like the following two samples:
- Pass: 16939, ExpectedFail: 1, Warn: 1, Skip: 82821, Flake: 4, Duration: 5:50, Remaining: 14
- Pass: 17528, ExpectedFail: 1, Warn: 1, Skip: 85732, Flake: 4, Duration: 5:52, Remaining: 1
And if there is a line like:
Then the job finishes reporting the results that say if the tests failed or passed.
But in jobs that will be terminated due to the timeout, there is a line like this:
That is the last of this pattern. After that, and perhaps is a wrong track, the bunches of logs often come with:
During this hunt to find the source reason of the issue, a little tool has been developed to do a dichotomous search between two uprev candidates. By now it focuses in the current issue, but this may help others (including myself) in future problems.