[fdo] GitLab downtime: Sunday 11th Nov, 10am UTC

Tue Nov 6 14:27:02 UTC 2018

Hi,
The GitLab migration continues to go well, with now all
non-kernel/archive projects having migrated over! \o/

With this use has come a lot more load, and the time has come for us
to move to more powerful machines. This requires a non-trivial update
on the Google cloud side, which will take a little more downtime than
we're used to.

Whilst this is happening, we'll take the opportunity to configure the
Kubernetes cluster with slightly more aggressive access control - no
functional change whatsoever, but allows us to be more clever about
access control as we bring more services online - and also to move to
online-resizeable volumes. Whilst we still have quite a bit of
headroom in terms of disk space, moving to resizeable volumes is a
reasonably heavyweight operation, so we might as well do it now whilst
it's non-critical, and we're down to upgrade the cluster anyway.

I'd expect about half an hour of downtime for this migration, mostly
dominated by moving to the resizeable disks.

Thanks for your patience with our ongoing works. I expect most of our
remaining outages to be quite short (~5min) for upgrades, apart from
two larger outages looming towards the end of the year.

Firstly, we have nginx acting as a reverse proxy for inbound web
traffic. In our current setup, all traffic arrives from Google's load
balancer with a source IP of 127.0.0.1, and it's not been clear to me
how to fix it. I've filed
https://gitlab.freedesktop.org/freedesktop/freedesktop/issues/50 to
cover this, which is currently playing with the new (and very
different!) nginx-ingress controller in a scratch cluster to see if we
can port this over. This would probably represent ~15min of downtime.

Secondly, we're still looking at moving to GitLab's 'cloud native'
Kubernetes setup. In the current 'omnibus' setup, all GitLab's
services (web request handling, Git repo handling, queued jobs,
serving Pages domains, SSH, etc) all run within a single Docker
container. This container is huge and also runs some very expensive
setup jobs as it starts, which is why every config change requires a
minimum of five minutes' downtime. It also means that we can't scale
horizontally and run services on different machines, but instead need
more and bigger individual machines.

In the cloud-native architecture, every service is run inside a
different container, which gives us horizontal scaling as well as the
ability to make rapid configuration changes, and more
expressive/direct configuration as well. This would be quite a large
infrastructural change, and downtime would be on the order of a couple
of hours as we would need to do a full backup and restore, which takes
a surprisingly long time.

My ideal time to do the cloud-native migration would be around
Christmas when traffic is lowest, but upstream still don't support
GitLab pages in their cloud-native charts, so it's not clear when we
would be able to do this.

Cheers,
Daniel