Equinix Sunset, future of gitlab.fd.o
Equinix Metal is retiring, and we have to move out of their datacenters.
Foreword:
First, I'd like to thank Equinix Metal for the years of support they gave us. They were very kind and generous with us and even if it's a shame we have to move out on a short notice, all things come to an end.
Also, please keep the discussion here technical, and not fueled by any hate. Equinix doesn't owe us anything, and the fact that they sponsored us for that long showed how dedicated they are to opensource.
Current deployment:
To be able to have a technical discussion, let's talk about what we are currently running. Also, please bear in mind that I'm not a professional sysadmin, and some of the choices I made might not be the best. For some rational of it: #655 (closed)
So we are running, in terms of application, for the gitlab cluster:
- a beefy db: our current disk is 433GB and we are using 401GB out of it. Most of it is CI logs, (which is a pity TBH).
- a smaller db for the registry: we are using only 1.1 GB
- 4 gitaly pods (229GB + 27 GB + 470 GB + 197 GB) to host our git repos
- 8 webservice + couple of pages pods (those are cheap IMO and bound to the performances of gitaly and the db)
- a few sidekiq workers (cheap also)
- some room for other workloads (marge-bot, indico, etc...) -> not really critical if they lag a bit
- a lot of S3 storage:
- main s3 storage (job logs, artifacts, uploads, pages, packages, lfs) -> currently 72.7 TB with a EC:2+1 replication, so roughly 40 TB of data (guesstimate)
- registry s3 -> currently 23.8 TB (EC:2+1), roughly 14 TB of data
- backups s3 -> 6.386 TB of data
- opa s3 (private data protected by JWT) -> 2.278 TB
Anyway, we were running this on (see the spec):
3 * c3.medium.x86 for control plane:
1 x AMD EPYC 7402P 24 cores @ 2.80GHz 64 GB 2 x 240 GB SSD 2 x 480 GB SSD 2 x 10 Gbps
3 * s3.xlarge.x86
2 x Intel Xeon Silver 4214 28 cores @ 2.20GHz 192 GB 2 x 960 GB SSD 2 x 240 GB NVMe (cache) 12 x 8 TB HDD 2 x 10 Gbps
In term of egress we can see over the past month:
- gitlab.fd.o + registry.fd.o + pages webservice -> 50.08TB
- s3.fd.o + indico -> 4.732 TB
Yes, I don't have a finer granularity, but we have a lot of logs available for anyone who wants to parse them.
Note that s3.fd.o is only the public facing S3 service we export, so the gitlab artifacts, uploads, everything gitlab stores in S3 is actually proxied through gitlab itself.
gitlab runners
fd.o gives 5 shared runners for all the projects:
3 * m3.large.x86:
1 x AMD EPYC 7502P 32 cores @ 2.50GHz 256 GB 2 x 240 GB SSD 2 x 3.8 TB NVMe 2 x 25 Gbps
2 * c3.large.arm64
1 x Ampere Altra 80 cores @ 3.00GHz 256 GB 2 x 960 GB NVMe n/a 2 x 25 Gbps
Those are heavily used, but it's important to keep the bulk of it in the same datacenter so they do not generate too much egress traffic for the gitlab cluster.
Admin whishes
a managed db, with replication
I'm bad at administering dbs, even if I know already way too much things that I wished for.
The ideal solution should provide a beefy db, managed externally.
However, there's a CI split coming in the next few releases, and this requires to temporarily double the disk usage of the db, so we need to have a solution that gives us ~900GB of disks, at least while we do that split.
I believe that if we get a much more efficient db than the one we run today (so not having ceph as the backend but direct SSD access with db replication), we should be able to make gitlab a lot more reactive.
external S3 object storage:
Currently we are running Ceph on kubernetes. It's an interesting piece of technology but it has several drawbacks (personal opinion following):
- we need to host the disks ourselves (that was a constraint on Equinix)
- ceph can be tricky to deploy (and breaking it breaks the whole cluster)
- if disks are full or near full, everything shuts down to preserve the integrity of the cluster
- it's a pain for the admin
Having an externally managed storage would makes the admins happier, and probably the system more robust when the disks are approaching their capacity.
It might also make gitlab faster because we won't have to proxy all data as clients will pull the s3 objects directly from s3, bypassing gitlab.
privately accessible kubernetes behind wireguard
Probably one of the best decision I took: not make the kubernetes endpoint publicly available. Not sure if we can keep that with a public kubernetes
enough room to separate the control plane and the worker nodes
Currently we tried to squeeze everything on 6 machines to have high availability. With a different provider, we could have 3 slightly cheaper control plane machine that would do only the control plane stuff, and have good workers which will not have to care about disks management.
Cloudflare or Fastly to keep the AI-scrapers away
AI bots are a pain. We had a quick look today at the logs (not exactly knowing what we are doing, so if anybody wants to help, the door is wide opened), and we saw that we have quite a bunch of AI bots literally pumping our entire gitlab instance.
Problem is they are pulling all the available shas on every git repo, and that accounts for some data (we saw them on the nginx logs, without having time to compute the actual data which was pulled or being able to quatify the entire range of bots).
One solution could be to actually let others do the job for us and Cloudflare has an AI bot filtering system which would prevent them to reach us without involving too much admin time from us.
My personal current opinion
Few years ago, GCP had an open source program and they gave us credits. They suddenly disappeared and we had to pay for the expensive service while working around the clock to migrate out of it.
Equinix gave us 5 years of free datacenter, but now we have to move out from their DC with a 3 months delay or we will have to pay for what we use (and it's expensive).
I personally think we better have fd.o pay for its own servers, and then have sponsors chip in. This way, when a sponsor goes away, it's technically much simpler to just replace the money than change datacenter.
Runners provided by the community, sponsored by companies?
We can also have companies/communities provide runners, but we should monitor how much egress this generate. And if it's too much, we would have to be strict and host the runners ourselves in the datacenter, and make the companies pay for it.