Continuous integration comes out of Extreme Programming, which espouses the philosophy that 100% of the code must be tested in automation, and 100% of those tests must pass at all times. Rather than get Maoist about it, most organizations have many tests that fail. These generally fit into one of these categories:
tests that run on crappy hardware
tests that are not under the control of the engineering team
tests which are decent, but do not measure up to the Extreme Programming Purity test.
A typical method for handling the resulting confusion is to dedicate employees to the task of knowing everything about all the test failures on all platforms and products, who look constantly at test results to assess if there is a regression. These people generally spend most of their time looking for a better job, unless they already have a job that they are plenty unqualified for.
Another method is to limit test analysis to changes in test status. Most CI and test automation have some reporting features that filter out failures that existed in a previous run. This limits the pain of looking at tons of failures, but it also limits the number of real regressions that will be found to the ones you caused. It is very easy to miss bugs with this approach, and a project should expect to have a long bug-fixing task before each release. Mesa developers at Intel would dutifully A/B test all of their patches against master, and still encounter weeks of painful debugging when A/B testing a new release candidate against the previous quarterly release.
Yet a third approach: generate skip lists for tests that are known failures. This technique is more reasonable than the first two. Google does this with dEQP and on the Chrome project. This can work if the test suite is stable. It has the drawback that test coverage is lost for failing tests, even when the associated bug is fixed. Teams must periodically (eg after a release) regenerate must-pass lists, or skip lists, or whatever convention they use.
Mesa's CI attempts to improve on this approach by maintaining lists of failing tests, and continuing to run them -- but NOT reporting them as failures. Post-processing of test results converts "expected-failures" into a skip status. New failing tests are added to piglit, dEQP, and vulkancts all the time, and those tests are being written precisely because developers want to fix a bug -- and they don't want it to break again. These tests lists are generally stored in each CI Project, eg deqp-test/skl.conf. The goal is to run all applicable tests, and report a consistent 100% pass rate unless there is a new regression.
When a test changes status (pass-to-fail or vice-versa), someone must look at the change to determine:
is it a bug? If so, write a bug on fdo to track the issue. Then update the CI to expect this failure.
is it a fixed test? Great! update the CI to expect the test to pass.
is it a test change? By definition, this is not a regression. Update the CI to expect the failure. Consider contacting the developer if it looks fishy, but don't write a bug.
is a test flaky, or generating failures in other tests (eg gpu hang)? Uh-oh. This test cannot be run in CI without generating noise. Disable the test. If a mesa commit is generating gpu hang, write a bug and contact the developer to let them know they should prepare to encounter lots of frustration from their colleagues.
All test configuration files are updated by scripts/update_conf.py. This parses all test results in the target build, and re-executes the failing projects on all platforms specified in the build spec's bisect_hardware. Each project accepts the --retest_path parameter, and parses the results at that path to only execute the tests that failed. Limiting the test set to failures drastically reduces the time to update CI config. CI configuration files are modified according to the results, and a patch is mailed to the maintainer.
Mesa test results change several times a day, because dozens of engineers are hard at work writing tests and fixing them. Unfortunately, this means that a developer testing a branch with a slightly older branch point will see test failures for those tests that were fixed after his branch. This noise is very confusing and unacceptable for developers. To accommodate this use case, update_conf.py records the commit responsible for the current test status in the config files. CI post-processing filters out any test status attributed to commits that are after the currently tested branches.
This technique works well enough that even old branches like quarterly releases can still be reliably tested for dot releases and show a 100% pass rate against the master's test configurations.
Sometimes, regressions occur on slow test targets that run on a daily basis. In this case, the bad commit is not obvious. scripts/bisect_project.py iterates over the git history for all platforms, and updates config files as with update_conf.py.
After a few years of running these automated processes, I can make the following observations:
This workflow reliably catches all bugs that can be identified by automated tests. Mesa is now so stable that Google is comfortable picking a random commit off of our tip and shipping it to millions of customers. They don't even tell us about it sometimes. Before this workflow, Google would not even ship our stable releases.
Team productivity is significantly enhanced by this system. Since CI deployment, Intel's Mesa team lost several senior engineers, but still managed to go from last place to its current position (Vulkan1.1, GLES3.2, GL4.5, dEQP conformance).
The workflow is not free. It currently takes more than 50% of the time of a single engineer to track Mesa for Intel. Tracking the project is more expensive when the project is moving faster. The most expensive tasks are deploying new test suites and deploying new hardware. Flaky tests and gpu hangs are the most annoying ongoing issue.
Demand for CI will cause it to expand until it consumes the resources allocated to it. The workflow was (nearly) free when Mesa CI was just running piglit on stable platforms. Because it was effective, developers and management requested expansion to many new platforms, suites, and support for external developers. It now has nearly 100 test systems and runs tens of millions of tests per day.
In spite of the effectiveness of the CI, it is difficult to find competent engineers willing to devote their time to this function. In other organizations I've found that the team lead is the engineer that spends the most time on CI, because of the responsibility he has for the product.