Migration out of Equinix metal (tentative date: March 16, 2025)
As you know we need to migrate the cluster to a new server.
This is a tracking issue of what is happening and what needs to happen.
Pre-migration:
-
deploy a new k8s cluster on Hetzner -
prepare the connection from Faslty (CDN) to Hetzner -
migrate the S3 data from Equinix to Hetzner S3 + Fastly S3 -
pages (pre-import) -
lfs -
packages -
uploads -
artifacts (pre-import, limit up to 2024-03-15 -> ~300GB) -
s3.fd.o - static -
s3.fd.o - dynamic -
registry
-
-
deploy a test db on Hetzner -
deploy a postgresQL HA with autobase (postgresql+pgbouncer+patroni+etcd+keepalived) -
restore a gitlab backup -
benchmarks -
do a major upgrade test (15 -> 16) (couldn't pause pgbouncer, once this is ignored, i.e., stop all db accesses, worked like a charm) -
test backup/restore with what's provided by autobase -
test split db ci/main (1 TB of disk is not enough on the leader, need one extra server with 2TB) -
failover from one db to a replica
-
-
deploy a test gitlab instance on the new cluster -
deploy a test indico instance on the new cluster -
deploy a test S3-proxy with OPA for replacing s3.freedesktop.org -
s3-proxy deployed and can talk to Hetzner S3 -
add Open Policy Agent sidekiq -
rewrite the policy to account for OPA 1.0 new grammar -
fix our policy to also verify the JWT token (this was done in istio) -
test uploads -
test downloads from private and privileged users
-
-
attempt at doing a first rsync of the git data dirs (instead of using backup/restore, this might save us a bit of time, thanks @bibibubap for the suggestion) -
gitaly-0 (resync: ~ 6 min) -
gitaly-1 (resync: ~ 45 min) -
gitaly-2 (resync: ~ 41 min) -
gitaly-3 (resync: ~ 116 min)
-
-
make the equinix cluster upload the backups to hetzner -
prepare maintainance page -
deploy logging system (loki) -
loki + loki-stack deployed (grafana) -
check retention policies
-
-
registry???? -
prepare new runners -
bare metal (KVM, restricted): 3 AX162 -> $ 663 / month -
fleeting: -
https://gitlab.com/hetznercloud/fleeting-plugin-hetzner -
5 CCX33 (8 CPUs, dedicated, 1 job each) -
6 CAX41 (16 CPUs, Arm, 2 jobs each)
-
-
Actual migration:
Gitlab
-
✅ set up a site tracker website -
✅ shutdown of gitlab @ Equinix -
✅ shutdown db accesses-
✅ scale down webservice pods -
✅ scale down sidekiq pods -
✅ kubectl -n gitlab scale deployment --replicas 0 --selector 'app in (webservice, sidekiq, kas, gitlab-exporter)'
-
✅ shutdown gitlab pages, registry and shellkubectl -n gitlab scale deployment --replicas 0 --selector 'app in (registry, gitlab-pages, gitlab-shell)'
-
✅ shutdown cron jobs
-
-
✅ back up everything (and push the backups on Hetzner S3) with gitlab backup tool -
✅ back up kubernetes secrets -
✅ shutdown db accesses on Hetzner -
✅ reinstall db servers -
✅ restore gitaly on Hetzner-
✅ rsync gitaly-0-
✅ chown --recursive 1000:1000 @snippets/ +gitaly @hashed @pools
-
-
✅ rsync gitaly-1-
✅ chown --recursive 1000:1000 @snippets/ +gitaly @hashed @pools
-
-
✅ rsync gitaly-2-
✅ chown --recursive 1000:1000 @snippets/ +gitaly @hashed @pools
-
-
✅ rsync gitaly-3-
✅ chown --recursive 1000:1000 @snippets/ +gitaly @hashed @pools
-
-
-
✅ database work (done in 2 times to avoid storing the double db in the backups)-
✅ restore the latest db dump-
❌ time pg_dump -U postgres -h localhost -p 5433 -Fc gitlab_production | pg_restore -U postgres -h 172.30.1.254 -p 6432 -c -C -d gitlab_production
-OR-- took way too long -> abort
-
✅ time gunzip --stdout db/database.sql.gz | PGPASSWORD=$(yq -r '.postgresql.authentication.superuser.password' /etc/patroni/patroni.yml) psql -U postgres -h 172.30.1.254 -p 6432 gitlab_production
-
-
✅ deduplicate:-
❌ 🤦 ensure enough disk space (leader on extra node with raid 0) -
✅ ensure enough disk space (leader on extra node with raid 0) -
✅ ensure gitlab deployment on Hetzner uses only one db -
✅ time gitlab-rake gitlab:db:decomposition:migrate
-
✅ change helm deployment with multiple databases -
✅ time gitlab-rake gitlab:db:lock_writes
-
✅ time gitlab-rake gitlab:db:truncate_legacy_tables:main
-
✅ time gitlab-rake gitlab:db:truncate_legacy_tables:ci
-
-
✅ backup the split db withpg_dump
-
✅ reinit the db cluster with pgbackrest -
✅ pg_restore
the split databases
-
-
⏳ sync remaining S3: -
✅ re-enable sidekiq + webservice on Hetzner -
⏳ connect Fastly (CDN) to Hetzner -
re-enable everything -
fix the DNS -
redeploy backups jobs
registry
-
put the registry on maintenance mode (no deletes) -
shutdown db access on equinix -
backup db on equinix -
restore db on hetzner -
make the data accessible? -
pull the TBs of images into Hetzner -OR- -
pull the TBs of images into Fastly -OR- -
connect the 2 clusters with kilo while we wait on Fastly
-
s3.freedesktop.org
-
✅ update ingress of s3-proxy deployment on Hetzner -
✅ update DNS -
✅ upload tests withcurl -v --header "Authorization: Bearer $(cat $JWT_TOKEN)" -X PUT --form file=@big-project.tar.gz https://s3-proxy.freedesktop.org/git-cache/some/big-project/
-
✅ amendci-fairy s3cp
for the new upload command-
ci-fairy s3cp
is deprecated in favor of the simpler curl command from above. See mesa/mesa!34120 (merged)
-
bots
ci-runners
-
✅ fleeting -
dedicated -
request HW from Hetzner based on tests above ( kvm
, privileged and others) -
configure the new runners
-
ci-stats
Indico
-
✅ sync customization data pvc -
✅ sync archive data pvc -
✅ dump the old db on Equinix -
✅ restore the new db -
✅ fix the DNS entry -
✅ ensure everything works as expected
# archive-indico-web-0
EQUINIX_IP="1.2.3.4"
EQUINIX_POD=/var/lib/kubelet/pods/__RANDOM_NUMBERS__/volumes/kubernetes.io~csi
EQUINIX_ARCHIVE_PVC=pvc-sth-sth
HETZNER_ARCHIVE_PVC=pvc-sth-sth-else
rsync -e 'ssh -p 2222' -rvA --progress \
root@$EQUINIX_IP:${EQUINIX_POD}/${EQUINIX_ARCHIVE_PVC}/mount/ \
${HETZNER_ARCHIVE_PVC}/mount
# customization-indico-web-0
EQUINIX_CUSTOMIZATION_PVC=pvc-sth-sth
HETZNER_CUSTOMIZATION_PVC=pvc-sth-sth-else
rsync -e 'ssh -p 2222' -rvA --progress \
root@$EQUINIX_IP:${EQUINIX_POD}/${EQUINIX_CUSTOMIZATION_PVC}/mount/ \
${HETZNER_CUSTOMIZATION_PVC}/mount
### old server:
# drop all access to the db
kubectl -n indico scale sts --replicas 0 indico-web
kubectl -n indico exec -ti indico-db-postgresql-0 -- bash
PGPASSWORD=$POSTGRES_PASSWORD pg_dump -U postgres \
-Fc indico > /bitnami/postgresql/indico_dump.sql
kubectl -n indico cp indico-db-postgresql-0:/bitnami/postgresql/indico_dump.sql \
/tmp/indico_dump.sql
### new server:
# drop all access to the db
kubectl -n indico scale sts --replicas 0 indico-web
kubectl -n indico cp /tmp/indico_dump.sql \
indico-db-postgresql-0:/bitnami/postgresql/indico_dump.sql
kubectl -n indico exec -ti indico-db-postgresql-0 -- bash
PGPASSWORD=$POSTGRES_PASSWORD pg_restore -U postgres \
-d postgres -c -C /bitnami/postgresql/indico_dump.sql
kubectl -n indico scale sts --replicas 1 indico-web
Details
Hetzner cloud
deploy a new k8s cluster onMake use of Hetzner-k3s. This takes care of most of the k3s setup, so that's one less thing to worry about.
Topology (subject to changes):
- control plane nodes: 3 CCX23
- workers:
3 CCX63 for the databases (960GB of local disks)- 3 AX41 with an extra 1TB NVME disk (much cheaper)
- 4 CCX43 for gitaly (+ each a 512GB of separate storage)
- some more for the rest of the workload?
Notes
Question: some of those configs has more traffic than others, should we have better control plane nodes (and colocate the db on those)? Colocation is bad, and the Load-balancer will absorb the incoming traffic, so no (maybe we can downscale an use 3 CCX13 instead).
Our current cluster is using a private LAN for kubernetes, with the control plane also on the private LAN. The external access for admins is handled thanks to Wireguard through kilo.
We need to check if we can replicate that, i.e. not exposing the :6443
k8s API to the world with hetzner-k3s.
We can expose only the internal private network, though in the config file, when updating it we need to export the current IP of the person deploying it. So that means:
- add the IP in the config file
- deploy with hetzner-k3s
- remove the IP in the firewall (personally I have a fixed IPv4, but it might be a better rule for other admins).
We don't need to setup a dual stack cluster. The Load Balancer is dual stack and that is enough to provide an external IPv6 address to the cluster.
We can connect dedicated HW and cloud VMs with vswitch (or kilo).
prepare the connection from Faslty (CDN) to Hetzner
- each service (gitlab, pages, registry, grafana, influxdb2, indico) will need to have a dedicated DNS entry:
gitlab.hetzner-lb.k8s.freedesktop.org
,pages.hetzner-lb.k8s.freedesktop.org
, etc... - each of those services needs to have a let's encrypt certificate so we can have TLS between fastly and the service
- no need to keep individual TLS certificate in the cluster if we expose the service through fastly (though
gitlab-direct
might be interesting) - in fastly, we should separate the services by "domain"
- each domain connects to the k8s cluster service by the DNS entry, with TLS
- fastly will handle the per-domain TLS termination, it supports wildcard when we switch to a paid plan
- we need to setup the websocket
/-/cable
on fastly
Question: do we need static objects external storage if we also use fastly in front of gitlab itself (I would say no).
migrate the S3 data from Equinix to Hetzner S3 + Fastly S3
- backups:
- configure the new S3 object store to be Hetzner, and they will flow naturally over there
- registry:
- manually rsync all of the S3 bucket to Fastly S3:
rclone sync --progress
- manually rsync all of the S3 bucket to Fastly S3:
- artifacts, general S3:
- manually rsync
fdo-gitlab-lfs
,fdo-gitlab-packages
,fdo-gitlab-pages
,fdo-gitlab-uploads
to Hetzner S3:rclone sync --progress
- manually rsync the ~ 1 year of
fdo-gitlab-artifacts
to Hetzner S3:rclone sync --progress --max-age 1y
- manually rsync
- OPA:
- manually rsync the buckets to Hetzner (or Fastly?), ensure they are not world writeable like today (because we have an external OPA to enforce policies)
deploy a test gitlab instance on the new cluster
- adapt https://gitlab.freedesktop.org/freedesktop/helm-gitlab-deployment for the new cluster
helmfile -e Hetzner -l chart=gitlab -i apply
- restore
gitaly-0
, it's 691 GB raw, but we deduplicate the object storage, so maybe it will not go smoothly and hand-holding will be required (or live migration).
deploy a test indico instance on the new cluster
- adapt https://gitlab.freedesktop.org/mupuf/indico-k8s for the new cluster
- copy the db to the new cluster (with dump/restore)
- copy the PVC to the new cluster (
archive-indico-web-0
,customization-indico-web-0
)
S3-proxy with OPA for replacing s3.freedesktop.org
deploy a testTBD, but we need S3-proxy with a sidecar with OPA so we can keep s3.freedesktop.org running
shutdown of gitlab @ Equinix
- Disable external ingress
- Put a banner in place of gitlab.freedesktop.org with updated information -> can be done easily with Fastly
back up everything (and push the backups on Hetzner S3) with gitlab backup tool
- the backup tool should already be configured to push to Hetzner, so it's a matter of just running a full backup once external users are not accessing it
back up kubernetes secrets
kubectl get -A secrets -o yaml > secrets.yaml
- little bit of cleanup if required
restore everything on Hetzner
- using the backup tool for gitlab
- be careful with git deduplication -> this might require some hand holding
kubectl apply secrets.yaml
deduplicate db (ci+rest)
connect Faslty (CDN) to Hetzner
- update DNS entries
- tests:
- pages hosts
- gitlab working
- registry working
re-enable everything
- disable the front status page and enable gitlab
Cc: @daniels, @whot, @colinmarc, @TomGudman