[fdo] Postmortem: July 17th GitLab outage

Sun Jul 29 13:41:22 UTC 2018

Hi,
On Tues Jul 17th, we had a full GitLab outage from 14:00 to 18:00 UTC,
whilst attempting to upgrade the underlying storage. This was a
semi-planned outage, which we'd hoped would last for approximately
30min.

During the outage, the GitLab web UI and API, as well as HTTPS git
clones through https://gitlab.freedesktop.org, were completely
unavailable, giving connection timeout errors. anongit and cgit
remained completely functional. There was no data loss.

The outage was 'semi-planned' in that it was only announced a couple
of hours in advance. It was also scheduled at one of the worst
possible times: whilst all of Europe and the American east coast is
active, with the west coast beginning to come online, and some of Asia
still being online.

Most of our outages happen early in the European morning, when we see
the lightest usage from only eastern Europe and Asia being online, and
only last for approximately five minutes.

Background
----------------

gitlab.freedesktop.org runs on the Google Cloud Platform, using the
Google Kubernetes Engine and Helm charts[0]. The cluster currently
runs Kubernetes 1.10.x. The service itself runs in a single Kubernetes
Pod, using the latest published GitLab CE image from gitlab.org (at
time of writing this is 11.1.2, however at the time it was 11.0.4).

Some GitLab data is stored in Google Cloud Storage buckets, including
CI job artifacts and traces, file uploads, Git LFS data, and backups.
The Git repositories themselves are stored inside a master data
partition which is exposed to the container as a local filesystem
through a Kubernetes PersistentVolumeClaim; other data is stored
inside PostgreSQL which is again a Kubernetes PVC.

Repository storage is (currently) just regular Git repositories.
Forking a repository simply calls 'git clone' with no sharing of
objects; storage is not deduplicated across forks.

Kubernetes persistent volumes are not currently resizeable. There is
alpha support in 1.10.x, scheduled to become general availability
shortly.

Backups are executed from a daily cronjob, set to run at 5am UTC: this
executes a Rake Rails task inside the master GitLab pod. The backups
cover all data, _except_ that which is already stored in GCS. Due to
legacy reasons, backups are made by first capturing all the data to be
backed up into a local directory; an uncompressed tarball is then
created in the local directory which is uploaded to storage. This
means that the directory used for backups must have a bit over twice
the size of the final backup available to it as free space.

Events
------------

Shortly after 9am UTC, it became clear that the disk space situation
on our GitLab cluster was critical. Some API requests were failing due
to a shortage of disk space. A quick investigation showed this was due
to a large number of recently-created forks of Mesa in particular,
which requires ~2.1GB of storage per repository.

The backup cron job had also started failing for the same reason. This
made resolution quite critical: not only did we not have backups, but
we were only one new Mesa or kernel fork away from totally exhausting
our complete disk space, and potentially exposing users to data loss:
e.g. trying to enter a new issue and having this fail, or being unable
to push to their repositories.

Before 10am UTC, it was announced that we would need urgent downtime
later in the day, in order to expand our disk volumes. At this point,
I spent a good deal of time researching Kubernetes persistent-volume
resizing (something I'd previously looked into for when the situation
arose), and doing trial migrations on a scratch Kubernetes cluster.

At 1pm UTC, I announced that there would be a one-hour outage window
in order to do this, from 2-2:30pm UTC.

At 2pm UTC, I announced the window had begun, and started the yak-shave.

Firstly, I modified the firewall rules to drop all inbound traffic to
GitLab other than that from the IP I was working from, so others did
not see any transient failures but instead just a connection timeout.
It also ensured backup integrity: that we would be able to snapshot
the data at a given point without worrying about losing changes made
after that point.

I took a manual backup of all the GitLab data (minus what was on GCS):
this consisted of letting the usual backup Rake task run to the point
of collecting all the data but stopped before running tar, as running
tar would've exceed the disk space and killed the cluster. Instead, I
ran 'tar' with its output streamed over SSH to an encrypted partition
on a local secure machine.

Secondly, I took snapshots of all the data disks. Each Kubernetes
PersistentVolume is backed by a Google Compute Engine disk, which can
be manually snapshotted and tagged.

Both of these steps took much longer than planned. The backup task was
taking much longer than it had historically - in hindsight, it
should've been clear that with one of our problems being a huge
increase in backup size, that both generating and copying the backups
would take far longer than they previously had. At this point, I
announced an extension of the outage window from 2-3pm UTC.

Snapshotting the disks also took far longer than hoped. I was working
through Google's Cloud Console web UI, which can be infuriatingly
slow: after a (not overly quick) page load, there is a good few
seconds' wait whilst the data loads asynchronously and then populates
page content. Working through to determine which disk was which,
taking snapshots of its content and tagging those snapshots, took some
time. This was compounded by not being familiar with the disk snapshot
UI, and by an abundance of caution: I checked multiple times that we
did in fact have storage of all the relevant volumes.

After this was done, I upgraded the Kubernetes cluster to 1.10.x, and
attempted to resize the persistent volumes, which immediately failed.
It became clear at this point that I had missed two crucial details.
Firstly, that it was not possible to make a static-sized disk
resizeable on the fly: it would require destroying and then recreating
the PersistentVolumes, then restoring the data on to those: either via
restoring a backup image, or by simply copying the old content to the
new volumes. Secondly, it became clear that Google's Kubernetes engine
did _not_ in fact provide support for resizing disks, as it was an
alpha feature which was not possible to enable.

At this point I made sure the old persistent volumes would, in fact,
persist after they had been orphaned. This gave us three copies of the
data: in a local backup, in GCE disk snapshots, and in retention of
the GCE disks themselves. I then spent some time figuring out how to
pause service availability, so we could make the new disks available
to be used by the cluster, without actually starting the services with
a clean slate. This took a surprising amount of time, and was somewhat
fiddly. During this time I also experienced a new failure mode of how
we run Helm: that if some resources were unavailable (due to a typo),
it would block indefinitely for them to become available (never)
rather than fail immediately.

I had long since previously started the process of copying the backups
back towards the cluster, so that we had the option of restoring from
backup if that was a good idea. However, at this point I started
having serious degradation of my network connection: not only did my
upload speed vary wildly (factor of 100x), but due to local issues the
workstation I was using spent some time refusing to route HTTPS
traffic to the Kubernetes control API. Much time was spent debugging
and resolving this.

The preferred option was to restore the previous snapshots into new
disks: this meant we did not have to block on the backup upload and
could be sure that we had exactly the same content as previously. I
started pursuing this option: once I had ensured that the new
Kubernetes persistent volumes had created new GCE disks, I attempted
to restore the disk snapshots into them.

At this point, I discovered the difference between GCE disks, disk
images, and disk snapshots. It is not possible to directly restore a
snapshot into a live disk: you must mount both the target disk and the
source snapshot in a new GCE VM, boot the VM, and copy between the
two. I did this with new GCE disks, and attempted to use those disks
as backing storage for new Kubernetes PersistentVolumes that we could
reuse directly. More time was lost due to the Helm failures above.
Eventually when we got there, I discovered that creating a new
Kubernetes PV/PVC from a GCE disk will obliterate all the content on
that disk, so that avenue was useless.

Quite some hours into the outage, I decided to take a fifteen-minute
break, go for a walk outside and try to reason about what was going on
and what we should do next.

Coming back, I pursued the last good option: stop the Kubernetes
services completely, attach botoh the new enlarged PVs and the old
disk snapshots to a new ephemeral GCE VM, copy directly between those,
stop the GCE VM, and restart Kubernetes. This mostly succeeded, except
for subvolumes. Kubernetes exposes '$disk/mysql/' as the root
mountpoint for the MySQL data volume, whereas mounting the raw disk
exposes '$disk' as the root mountpoint. The copy didn't correctly
preserve the subdirectory, so though we had the Git data accessible,
MySQL was seeing an empty database.

To avoid any desynchronisation, I destroyed all the resources again,
created completely new and empty volumes, created a new GCE VM, and
re-did the copy with the correct directory structure. Coming back up,
I manually verified that the list of repositories and users was the
same as previously, worked through parts of the UI (e.g. could I still
search for issues after not restoring the Redis cache?) and a few
typical workflows. I had also started a full backup task in the
background to ensure that our backup cron job would succeed in the
morning without needing another outage.

Once this was done, around 18:45 UTC, I restored public access and
announced the end of the outage.

A couple of days later, I spent some time cleaning up all the
ephemeral resources created during the outage (persistent volumes,
disks, disk snapshots, VMs, etc).

What went badly
-----------------------

Many things.

The first we realised things were wrong was when people mentioned the
failure on IRC. Setting up a system (probably based on
Prometheus/Grafana, as this is recommend by upstream, integrates well
with the services, and has a good feature set) to capture key metrics
like disk usage, API error rate, etc, and alert via email/IRC when
these hit error thresholds, is a high-priority but also high-time
task. Doing it myself requires learning a fair few new things, and
also downtime (see below) whilst deploying it. So far I have not had a
large enough timeslot.

There is also a single point of failure: I am the only administrator
who works on GitLab. Though Tollef and Keith have access to the Google
organisation and could do so, they don't have the time. If I were not
available, either they would have to go through the process of setting
up their accounts to control Kubernetes and familiarising themselves
with our deployment. This is obviously bad, especially as I am
relatively new to administering Kubernetes (as seen from the failures
in the timeline).

The length of the backup task completely blew out our outage window.
It should've been obvious that backups would take longer than they had
previously; even if not, we could've run a test task to measure this
before we announced an outage window which could never have been met.

My internet connection choosing that exact afternoon to be extremely
unreliable was quite unhelpful. If any of this was planned I would've
been somewhere with a much faster and more stable connection, but
unfortunately I didn't have the choice.

Though I'd tested some of these changes throughout the morning, I
hadn't tested the exact changes. I'm not sure how it is possible to
test some of them (e.g. how do I, with a scratch cluster, test that
persistent volumes created with a several-versions-old Kubernetes will
upgrade ... ?), but certainly I could've at least made the tests a
little more thorough and realistic, particularly for things like the
disk snapshots.

What went well
---------------------

No data loss: we had backups, and lots of them (at least three at
every point from when destructive operations began). There was no
danger at any point of data being lost for good, though some of those
options required unacceptably long downtime when used as a source.

Data being on GCS: having much of our data in cloud storage long
delayed the point we needed to

What we can do in future
----------------------------------

Monitoring, logging, and alerts:
https://gitlab.freedesktop.org/freedesktop/freedesktop/issues/8 is the
task I've long ago filed to get this enabled. If anyone reading this
is familiar with the landscape and can help, I would be immensely
grateful for it.

Better communication: due to its last-minute nature, the outage was
only announced on IRC and not to the lists; I also didn't inform the
lists when the outage was dragging on, as I was so focused on what I
needed to do to bring it back up. This was obviously a bad idea. We
should look into how we can communicate these kinds of things a lot
better.

Cloud Console web UI: the web UI is borderline unusable for
interactive operation, due to long load times between pages and a lot
of inter-page navigation required to achieve anything. It would be
better to get more familiar with driving the console clients in anger,
and use those where possible.

More admins: having myself as the single GitLab admin is not in any
way sustainable, and we need to enlarge our group of admins. Having
someone actually familiar with Kubernetes deployments in particular,
would be a massive advantage. I'm learning on the spot from online
documentation, which given the speed that Kubernetes development moves
at, is often either useless or misleading.

Move away from Omnibus container: currently, as said, every service
behind gitlab.fd.o is deployed into a single Kubernetes pod, with
PostgreSQL and Redis in linked containers. The GitLab container is
called the 'omnibus' container, combining the web/API hosts,
background job processors, Git repository service, SSH access, Pages
server, etc. The container is a huge download, and on start it runs
Chef which takes approximately 2-3 minutes to configure the filesystem
before even thinking about starting GitLab. Total minimum downtime for
every change is 4-5 minutes, and every change makes the whole of
GitLab unavailable: this makes us really reluctant to change
configuration unless necessary, giving us less experience with
Kubernetes than we might otherwise have. GitLab upstream is working on
a 'cloud native' deployment, splitting each service into its own
container which does _not_ spend minutes running Omnibus at startup,
which can be independently configured without impacting other
services. Actually making this move will require multiple hours of
downtime, which will need to be communicated long in advance.

Move storage to resizeable volumes or GCS: at the next point where we
exhaust our disk space, we're going to need to go through this again.
Moving more of our storage to cloud storage, where we can, means that
we don't have to worry about it. When Kubernetes 1.11.x becomes
available through GCP, we can also recreate the disks as resizeable
volumes, which will grow on demand, avoiding this whole process. At
least we've learned quite a bit from doing it this time?

Cheers,
Daniel

[0]: The Helm charts and configuration are available at
https://gitlab.freedesktop.org/freedesktop/ with the exception of a
repository containing various secrets (SSH keys, API keys, passwords,
etc) which are grafted on to the existing chart.