ci: Move traces to nightly

Now that we have nightly runs set up, that's really where traces should live. They do provide some value and we should probably run them and keep an eye on them in case something breaks. However, they produce a constant stream of false negatives which cause compiler developers no end of headaches. As such, shouldn't be part of the marge flow.

What's wrong? I thought they were stable?!?

There are different kinds of stability. Most of the traces we run today are stable in the sense that the exact same trace run with the exact same parameters on the exact same hardware with the exact same driver build will produce the same results when run back-to-back-to-back. In that sense, they are typically quite stable.

However, traces are very sensitive to compiler changes. Even the slightest compiler change which causes things to CSE differently or causes a mul+add to fuse differently may result in a 1-bit change in three pixels of some trace on some hardware. If so much as one bit is different, the hash is different, and the trace job fails. This may sound pretty unlikely but when you multiply by the number of trace jobs in Mesa CI today and then multiply again by the number of hardware platforms, a little becomes a lot. Anyone who works on NIR is constantly fighting with this. Even a totally valid NIR change that passed CI on all the hardware 2 hours ago, if marge rebases on a different NIR change that it interacts with, it could fail to merge because the interaction causes a trace change.

Unfortunately, these kinds of fails don't show up in our usual CI metrics. From the perspective of the CI infrastructure, any time a mesa change causes a 1-bit change in a trace, that's a valid test fail that caught a bug. The bug, however, was that the hash needed to be updated and that's not a real bug.

Compiler testing is hard. There are a lot of complex interactions. Those complex interactions get worse when you consider how sloppy GLSL is about floating-point precision and how we have to take advantage of that slop to get good run-time performance. There are a LOT of valid ways to compile a program and sometimes the results are a tiny bit different. That's just the way GLSL is. Our testing infrastructure need to recognize that and we need to make sure that the tests we run are robust to tiny differences in output. Traces, as we run them today, aren't.

If we had a piglit or dEQP test which was this sensitive to minor changes, we would declare it a bad test and shut it off or label it a flake so it runs but doesn't get counted. Traces aren't really testing for correctness right now, they're testing for "does it do EXACTLY the same thing as last time" which isn't a fair test.

But can't you just update the hashes?

Yes, we can and that's what we've been doing. Yes, there are scripts to make this easier. However, all that misses the point. As I mentioned above, you can have a totally correct change which already passed CI and it can fail to merge because Marge rebased it on top of something else that it interacted with and now the trace is slightly different. This means that NIR MRs pretty regularly fail to merge for what are effectively random reasons outside of the MR author's control. That's a problem.

But traces provide additional code coverage!

Yes, they do. From an API PoV, that coverage has some value. I personally think they're quite over-hyped (I've only seen a handfull of cases where automated trace testing has caught a bug that piglit and dEQP couldn't) but they do provide additional coverage. The problem is that they provide that coverage at an unreasonable cost to developers. From a compiler point of view, they mostly just provide CI noise. This is why I'm suggesting we move them to nightly, not eliminate them. They'll still be running and we can hand-verify them every so often and check in the new hashes. If they do actually find a bug, we can file the bug and fix it.

So our trace infrastructure is bad?

No! None of this means that the people who've been working on the CI infrastructure for traces have done a bad job. The runs are stable (with a fully fixed HW/SW configuration), the comparison interface is great, updating isn't that hard. Hell, it even links you directly to the artifacts directory to view the change which is amazing! It's very good for what it is. Unfortunately, even though they've done a fantastic job, the fundamental problems with traces remain.

Shouldn't we improve trace testing instead of disabling it?

Go for it! I'm happy to see someone figure out how to better do trace-based testing that doesn't have the noise that we have with the current hash-based approach. Until then, we should move them to nightly.

Edited Nov 02, 2023 by Faith Ekstrand

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information