[fdo] Postmortem: July 17th GitLab outage

Tue Jul 31 21:54:31 UTC 2018

We can also save space by using the main repo for private branches,
e.g. my branches would be in refs/heads/mareko/*.

Marek

On Sun, Jul 29, 2018 at 9:41 AM, Daniel Stone <daniel at fooishbar.org> wrote:
> Hi,
> On Tues Jul 17th, we had a full GitLab outage from 14:00 to 18:00 UTC,
> whilst attempting to upgrade the underlying storage. This was a
> semi-planned outage, which we'd hoped would last for approximately
> 30min.
>
> During the outage, the GitLab web UI and API, as well as HTTPS git
> clones through https://gitlab.freedesktop.org, were completely
> unavailable, giving connection timeout errors. anongit and cgit
> remained completely functional. There was no data loss.
>
> The outage was 'semi-planned' in that it was only announced a couple
> of hours in advance. It was also scheduled at one of the worst
> possible times: whilst all of Europe and the American east coast is
> active, with the west coast beginning to come online, and some of Asia
> still being online.
>
> Most of our outages happen early in the European morning, when we see
> the lightest usage from only eastern Europe and Asia being online, and
> only last for approximately five minutes.
>
>
> Background
> ----------------
>
> gitlab.freedesktop.org runs on the Google Cloud Platform, using the
> Google Kubernetes Engine and Helm charts[0]. The cluster currently
> runs Kubernetes 1.10.x. The service itself runs in a single Kubernetes
> Pod, using the latest published GitLab CE image from gitlab.org (at
> time of writing this is 11.1.2, however at the time it was 11.0.4).
>
> Some GitLab data is stored in Google Cloud Storage buckets, including
> CI job artifacts and traces, file uploads, Git LFS data, and backups.
> The Git repositories themselves are stored inside a master data
> partition which is exposed to the container as a local filesystem
> through a Kubernetes PersistentVolumeClaim; other data is stored
> inside PostgreSQL which is again a Kubernetes PVC.
>
> Repository storage is (currently) just regular Git repositories.
> Forking a repository simply calls 'git clone' with no sharing of
> objects; storage is not deduplicated across forks.
>
> Kubernetes persistent volumes are not currently resizeable. There is
> alpha support in 1.10.x, scheduled to become general availability
> shortly.
>
> Backups are executed from a daily cronjob, set to run at 5am UTC: this
> executes a Rake Rails task inside the master GitLab pod. The backups
> cover all data, _except_ that which is already stored in GCS. Due to
> legacy reasons, backups are made by first capturing all the data to be
> backed up into a local directory; an uncompressed tarball is then
> created in the local directory which is uploaded to storage. This
> means that the directory used for backups must have a bit over twice
> the size of the final backup available to it as free space.
>
>
> Events
> ------------
>
> Shortly after 9am UTC, it became clear that the disk space situation
> on our GitLab cluster was critical. Some API requests were failing due
> to a shortage of disk space. A quick investigation showed this was due
> to a large number of recently-created forks of Mesa in particular,
> which requires ~2.1GB of storage per repository.
>
> The backup cron job had also started failing for the same reason. This
> made resolution quite critical: not only did we not have backups, but
> we were only one new Mesa or kernel fork away from totally exhausting
> our complete disk space, and potentially exposing users to data loss:
> e.g. trying to enter a new issue and having this fail, or being unable
> to push to their repositories.
>
> Before 10am UTC, it was announced that we would need urgent downtime
> later in the day, in order to expand our disk volumes. At this point,
> I spent a good deal of time researching Kubernetes persistent-volume
> resizing (something I'd previously looked into for when the situation
> arose), and doing trial migrations on a scratch Kubernetes cluster.
>
> At 1pm UTC, I announced that there would be a one-hour outage window
> in order to do this, from 2-2:30pm UTC.
>
> At 2pm UTC, I announced the window had begun, and started the yak-shave.
>
> Firstly, I modified the firewall rules to drop all inbound traffic to
> GitLab other than that from the IP I was working from, so others did
> not see any transient failures but instead just a connection timeout.
> It also ensured backup integrity: that we would be able to snapshot
> the data at a given point without worrying about losing changes made
> after that point.
>
> I took a manual backup of all the GitLab data (minus what was on GCS):
> this consisted of letting the usual backup Rake task run to the point
> of collecting all the data but stopped before running tar, as running
> tar would've exceed the disk space and killed the cluster. Instead, I
> ran 'tar' with its output streamed over SSH to an encrypted partition
> on a local secure machine.
>
> Secondly, I took snapshots of all the data disks. Each Kubernetes
> PersistentVolume is backed by a Google Compute Engine disk, which can
> be manually snapshotted and tagged.
>
> Both of these steps took much longer than planned. The backup task was
> taking much longer than it had historically - in hindsight, it
> should've been clear that with one of our problems being a huge
> increase in backup size, that both generating and copying the backups
> would take far longer than they previously had. At this point, I
> announced an extension of the outage window from 2-3pm UTC.
>
> Snapshotting the disks also took far longer than hoped. I was working
> through Google's Cloud Console web UI, which can be infuriatingly
> slow: after a (not overly quick) page load, there is a good few
> seconds' wait whilst the data loads asynchronously and then populates
> page content. Working through to determine which disk was which,
> taking snapshots of its content and tagging those snapshots, took some
> time. This was compounded by not being familiar with the disk snapshot
> UI, and by an abundance of caution: I checked multiple times that we
> did in fact have storage of all the relevant volumes.
>
> After this was done, I upgraded the Kubernetes cluster to 1.10.x, and
> attempted to resize the persistent volumes, which immediately failed.
> It became clear at this point that I had missed two crucial details.
> Firstly, that it was not possible to make a static-sized disk
> resizeable on the fly: it would require destroying and then recreating
> the PersistentVolumes, then restoring the data on to those: either via
> restoring a backup image, or by simply copying the old content to the
> new volumes. Secondly, it became clear that Google's Kubernetes engine
> did _not_ in fact provide support for resizing disks, as it was an
> alpha feature which was not possible to enable.
>
> At this point I made sure the old persistent volumes would, in fact,
> persist after they had been orphaned. This gave us three copies of the
> data: in a local backup, in GCE disk snapshots, and in retention of
> the GCE disks themselves. I then spent some time figuring out how to
> pause service availability, so we could make the new disks available
> to be used by the cluster, without actually starting the services with
> a clean slate. This took a surprising amount of time, and was somewhat
> fiddly. During this time I also experienced a new failure mode of how
> we run Helm: that if some resources were unavailable (due to a typo),
> it would block indefinitely for them to become available (never)
> rather than fail immediately.
>
> I had long since previously started the process of copying the backups
> back towards the cluster, so that we had the option of restoring from
> backup if that was a good idea. However, at this point I started
> having serious degradation of my network connection: not only did my
> upload speed vary wildly (factor of 100x), but due to local issues the
> workstation I was using spent some time refusing to route HTTPS
> traffic to the Kubernetes control API. Much time was spent debugging
> and resolving this.
>
> The preferred option was to restore the previous snapshots into new
> disks: this meant we did not have to block on the backup upload and
> could be sure that we had exactly the same content as previously. I
> started pursuing this option: once I had ensured that the new
> Kubernetes persistent volumes had created new GCE disks, I attempted
> to restore the disk snapshots into them.
>
> At this point, I discovered the difference between GCE disks, disk
> images, and disk snapshots. It is not possible to directly restore a
> snapshot into a live disk: you must mount both the target disk and the
> source snapshot in a new GCE VM, boot the VM, and copy between the
> two. I did this with new GCE disks, and attempted to use those disks
> as backing storage for new Kubernetes PersistentVolumes that we could
> reuse directly. More time was lost due to the Helm failures above.
> Eventually when we got there, I discovered that creating a new
> Kubernetes PV/PVC from a GCE disk will obliterate all the content on
> that disk, so that avenue was useless.
>
> Quite some hours into the outage, I decided to take a fifteen-minute
> break, go for a walk outside and try to reason about what was going on
> and what we should do next.
>
> Coming back, I pursued the last good option: stop the Kubernetes
> services completely, attach botoh the new enlarged PVs and the old
> disk snapshots to a new ephemeral GCE VM, copy directly between those,
> stop the GCE VM, and restart Kubernetes. This mostly succeeded, except
> for subvolumes. Kubernetes exposes '$disk/mysql/' as the root
> mountpoint for the MySQL data volume, whereas mounting the raw disk
> exposes '$disk' as the root mountpoint. The copy didn't correctly
> preserve the subdirectory, so though we had the Git data accessible,
> MySQL was seeing an empty database.
>
> To avoid any desynchronisation, I destroyed all the resources again,
> created completely new and empty volumes, created a new GCE VM, and
> re-did the copy with the correct directory structure. Coming back up,
> I manually verified that the list of repositories and users was the
> same as previously, worked through parts of the UI (e.g. could I still
> search for issues after not restoring the Redis cache?) and a few
> typical workflows. I had also started a full backup task in the
> background to ensure that our backup cron job would succeed in the
> morning without needing another outage.
>
> Once this was done, around 18:45 UTC, I restored public access and
> announced the end of the outage.
>
> A couple of days later, I spent some time cleaning up all the
> ephemeral resources created during the outage (persistent volumes,
> disks, disk snapshots, VMs, etc).
>
>
> What went badly
> -----------------------
>
> Many things.
>
> The first we realised things were wrong was when people mentioned the
> failure on IRC. Setting up a system (probably based on
> Prometheus/Grafana, as this is recommend by upstream, integrates well
> with the services, and has a good feature set) to capture key metrics
> like disk usage, API error rate, etc, and alert via email/IRC when
> these hit error thresholds, is a high-priority but also high-time
> task. Doing it myself requires learning a fair few new things, and
> also downtime (see below) whilst deploying it. So far I have not had a
> large enough timeslot.
>
> There is also a single point of failure: I am the only administrator
> who works on GitLab. Though Tollef and Keith have access to the Google
> organisation and could do so, they don't have the time. If I were not
> available, either they would have to go through the process of setting
> up their accounts to control Kubernetes and familiarising themselves
> with our deployment. This is obviously bad, especially as I am
> relatively new to administering Kubernetes (as seen from the failures
> in the timeline).
>
> The length of the backup task completely blew out our outage window.
> It should've been obvious that backups would take longer than they had
> previously; even if not, we could've run a test task to measure this
> before we announced an outage window which could never have been met.
>
> My internet connection choosing that exact afternoon to be extremely
> unreliable was quite unhelpful. If any of this was planned I would've
> been somewhere with a much faster and more stable connection, but
> unfortunately I didn't have the choice.
>
> Though I'd tested some of these changes throughout the morning, I
> hadn't tested the exact changes. I'm not sure how it is possible to
> test some of them (e.g. how do I, with a scratch cluster, test that
> persistent volumes created with a several-versions-old Kubernetes will
> upgrade ... ?), but certainly I could've at least made the tests a
> little more thorough and realistic, particularly for things like the
> disk snapshots.
>
>
> What went well
> ---------------------
>
> No data loss: we had backups, and lots of them (at least three at
> every point from when destructive operations began). There was no
> danger at any point of data being lost for good, though some of those
> options required unacceptably long downtime when used as a source.
>
> Data being on GCS: having much of our data in cloud storage long
> delayed the point we needed to
>
>
> What we can do in future
> ----------------------------------
>
> Monitoring, logging, and alerts:
> https://gitlab.freedesktop.org/freedesktop/freedesktop/issues/8 is the
> task I've long ago filed to get this enabled. If anyone reading this
> is familiar with the landscape and can help, I would be immensely
> grateful for it.
>
> Better communication: due to its last-minute nature, the outage was
> only announced on IRC and not to the lists; I also didn't inform the
> lists when the outage was dragging on, as I was so focused on what I
> needed to do to bring it back up. This was obviously a bad idea. We
> should look into how we can communicate these kinds of things a lot
> better.
>
> Cloud Console web UI: the web UI is borderline unusable for
> interactive operation, due to long load times between pages and a lot
> of inter-page navigation required to achieve anything. It would be
> better to get more familiar with driving the console clients in anger,
> and use those where possible.
>
> More admins: having myself as the single GitLab admin is not in any
> way sustainable, and we need to enlarge our group of admins. Having
> someone actually familiar with Kubernetes deployments in particular,
> would be a massive advantage. I'm learning on the spot from online
> documentation, which given the speed that Kubernetes development moves
> at, is often either useless or misleading.
>
> Move away from Omnibus container: currently, as said, every service
> behind gitlab.fd.o is deployed into a single Kubernetes pod, with
> PostgreSQL and Redis in linked containers. The GitLab container is
> called the 'omnibus' container, combining the web/API hosts,
> background job processors, Git repository service, SSH access, Pages
> server, etc. The container is a huge download, and on start it runs
> Chef which takes approximately 2-3 minutes to configure the filesystem
> before even thinking about starting GitLab. Total minimum downtime for
> every change is 4-5 minutes, and every change makes the whole of
> GitLab unavailable: this makes us really reluctant to change
> configuration unless necessary, giving us less experience with
> Kubernetes than we might otherwise have. GitLab upstream is working on
> a 'cloud native' deployment, splitting each service into its own
> container which does _not_ spend minutes running Omnibus at startup,
> which can be independently configured without impacting other
> services. Actually making this move will require multiple hours of
> downtime, which will need to be communicated long in advance.
>
> Move storage to resizeable volumes or GCS: at the next point where we
> exhaust our disk space, we're going to need to go through this again.
> Moving more of our storage to cloud storage, where we can, means that
> we don't have to worry about it. When Kubernetes 1.11.x becomes
> available through GCP, we can also recreate the disks as resizeable
> volumes, which will grow on demand, avoiding this whole process. At
> least we've learned quite a bit from doing it this time?
>
>
> Cheers,
> Daniel
>
> [0]: The Helm charts and configuration are available at
> https://gitlab.freedesktop.org/freedesktop/ with the exception of a
> repository containing various secrets (SSH keys, API keys, passwords,
> etc) which are grafted on to the existing chart.
> _______________________________________________
> freedesktop mailing list
> freedesktop at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/freedesktop