RFC: Image storage of Mesa CI trace reference images
As part of Mesa CI we currently run jobs that replay traces and check their outputs frames using image checksums. When a trace image comparison fails we make the produced image available in the artifacts of the respective gitlab job. In successful runs we currently don't store the images in the artifacts to reduce storage requirements. However, to make the life of developers easier we would also like to provide an easy way to access the reference/expected image and a diff. We want to provide such functionality in the tracie dashboard.
Initially, the per-device reference images were stored alongside the traces in the traces-db repository. This approach was abandoned during review of the traces CI proposal, as it required a two-step procedure whenever a Mesa change required an image update (the actual change in Mesa, and the update of the traces-db repo). Instead we adopted an approach based on image checksums.
Here are some ways forward:
1.0 Since successful runs produce the reference images, we could store them and make them available to tools or users in artifacts (like we do for the failed images). Note that this was initially implemented, but dropped during review due to storage considerations, although we left the ability to store images on demand, if the TRACIE_STORE_IMAGES=1 env/CI variable is set when triggering a pipeline. We only care about the "official" references images, so we can limit image storage to pipelines/jobs we run on Mesa master. Given that we retain artifacts for 4 weeks, a rough calculation gives: 350 Mesa master pushes/month * 20 traces * 400K per reference image = 3 GB of extra storage required at the steady state (20 traces is an arbitrary number, but not unreasonable looking forward). The egress of these images is expected to be quite low. Is such a storage cost acceptable?
1.1 A potentially significant storage optimization on (1.0), is, instead of storing reference images in all Mesa master pipelines, to periodically, e.g., once a day, run a pipeline on Mesa master specifically for storing such images in artifacts. Tools or users can search for the latest such pipeline and get a set of the reference images. Of course, there will be some delay between a change to the reference images and the next scheduled job that stores them, but depending on the use case the delay may be acceptable (and we can trigger one on demand if required).
1.2 We could say that if a tool or user requires reference images, they should trigger an appropriate pipeline on demand and wait for the results. Not very friendly to tools/dashboards, but still an option.
2.0 Use some external storage to automatically store the images. During all trace replay jobs, produced images are uploaded to the external storage keyed by their checksum (only if they are not already present there, of course). Tools or users can access the images keyed by checksum directly from the external storage. The storage requirements are significantly (2 orders of magnitude) lower compare to (1.0), band always readily available (in contrast with 1.1, 1.2). The downside is that since it's desirable to protect write access to the image storage we need to deploy some kind of key/token to the runners that run trace replay jobs. I have a prototype that uses a google cloud bucket for such an image storage.
3.0 Use some external storage to manually store the reference images keyed by their checksum. We could introduce checks in tracie to ensure reference images are available in that storage, to ensure people don't forget to add them when they update checksums in traces.yml. The downside is that this is cumbersome and may turn people off the traces testing effort.
Thoughts on these or other ideas?