The night everything went wrong
It was approaching midnight UTC on January 31, 2017. A GitLab.com site reliability engineer — identified in the company's own postmortem as YP — was in the middle of a routine maintenance task: clearing replication lag on a secondary PostgreSQL database node. The primary database had fallen hours behind on replication, and the fix was straightforward: remove the stale data directory on the replica and let it re-sync from the primary.
Tired after a long shift that had already included dealing with a spam attack on GitLab.com, YP ran the delete command. Except the terminal he was working in wasn't connected to the replica. It was connected to db1.cluster.gitlab.com — the primary production database. The command was:
rm -rf /var/opt/gitlab/postgresql/data
By the time he realized the mistake and hit Ctrl+C, roughly 300 GB of PostgreSQL data had been deleted. The primary database — the one serving every GitLab.com user, repository, issue, merge request, and CI pipeline — was gone.
Five backup systems. Zero successful restores.
What happened next was the part that turned a bad incident into a legendary one. GitLab had backup and recovery systems in place. Multiple systems. All of them failed for different reasons, in a chain of failures that reads like a disaster movie script:
1. Regular database backups
GitLab ran nightly pg_dump backups. The problem: the backup process had been broken for months and nobody had noticed, because nobody was regularly testing restores. The backups were there; they just didn't work. The most recent verified restore was far too old to be useful.
2. LVM snapshots
The database server used LVM (Logical Volume Manager) for disk management, and LVM supports point-in-time snapshots. However, LVM snapshots had never been configured on db1. The feature existed in the infrastructure stack but hadn't been enabled for the production database host.
3. Azure disk snapshots
GitLab.com ran on Azure. Azure supports managed disk snapshots at the VM level. Snapshots had been enabled — but they had been disabled five hours earlier as part of an unrelated cost-optimization effort. The last Azure snapshot was taken just before it was turned off, but it was already hours old by the time of the incident.
4. S3 backups
A separate backup pipeline was configured to stream data to S3. Investigation revealed this pipeline had also been broken and was not actually uploading data. Again, no one had verified it end-to-end.
5. pg_dump on a staging server
The team found one survivor: a pg_dump snapshot that had been taken automatically on a staging environment. It was approximately six hours old. It was the only option. GitLab chose to restore from it.
We restored from a staging snapshot that was roughly 6 hours old. Approximately 5,000 projects, 700 users, 4,500 comments and 700 webhooks created in those 6 hours were permanently lost.
The live-stream that shocked the industry
While the engineering team worked through the night to restore the database, GitLab made an extraordinary decision: they live-streamed the entire recovery attempt on YouTube. Engineers narrated what they were trying, what was failing, and what the current status was — in real time, to anyone who wanted to watch.
This was radical transparency at a scale the industry had never seen. Most companies in that situation would have issued a brief status page update every hour and said nothing until service was restored. GitLab let the world watch the sausage being made, including all the moments where a recovery method was tried, failed, and abandoned.
The recovery took approximately 18 hours from the moment the data was deleted to GitLab.com being fully restored. About 18 hours' worth of user data — everything created between the staging snapshot and the incident — was permanently gone.
The postmortem that became a template
GitLab published their full postmortem publicly on their blog. It listed every backup method, why each one failed, and the corrective actions they committed to. The document was not written by a PR team. It was written by the engineers involved, in plain language, with a detailed timeline and no corporate hedging.
The corrective actions they committed to included:
- Run
pg_dumpbackups to S3 every 24 hours and test restores every week - Enable and verify LVM snapshots on all production database hosts
- Re-enable and verify Azure disk snapshots
- Set up a secondary replica in a separate geographic region
- Add a runbook for every critical maintenance operation requiring a second engineer to verify the target host before execution
- Introduce a backup monitoring dashboard with explicit alerts when a backup job fails or hasn't run recently
The list is notable not because it's surprising, but because every item on it represents something that was already understood to be a best practice — and hadn't been done. This is the uncomfortable truth the postmortem made undeniable: GitLab didn't fail because they lacked knowledge. They failed because knowledge and implementation are not the same thing.
What every SysOps engineer must take from this
The GitLab incident is examined in cloud certification courses for a reason. It illustrates, with brutal clarity, the difference between having a backup strategy and having a tested backup strategy.
The 3-2-1 rule exists for exactly this reason
The gold standard for backups is three copies of data, on two different media types, with one copy off-site. GitLab nominally met parts of this on paper. The problem is the rule implicitly assumes all three copies are verified to be restorable. A backup that has never been restored is a hypothesis, not a backup.
Human error is a systems problem, not a people problem
YP made a mistake. But he made it in a system where a single tired engineer could delete a production database with one command and no confirmation prompt. Modern runbooks for destructive operations on production systems require a second pair of eyes — not because engineers are careless, but because fatigue and context-switching are real and documented causes of accidents. The solution is procedural and technical, not individual.
Never run destructive commands on production without verifying the host
The immediate technical lesson is simple: before running any command that deletes or overwrites data, confirm twice which host you are connected to. Many teams now use shell prompt customization or /etc/motd banners to make the environment visually unmistakable when you connect to production.
# A simple safeguard: put this in your .bashrc on production hosts
export PS1="\[\e[1;31m\][PRODUCTION: \h]\[\e[0m\] \w \$ "
Test your backups. Then test them again.
Backup testing should be a scheduled, monitored, mandatory task — not something that happens when someone remembers to do it. The correct answer to "when was the last time we restored from backup in a staging environment?" should never be "I'm not sure."
How this shows up in your certification exam
The GitLab incident maps directly to exam objectives across several certifications. If you're studying for AWS SysOps Administrator (SOA-C02), Azure Administrator (AZ-104), Azure Solutions Architect (AZ-305), or CompTIA Security+, expect scenarios that test whether you understand:
- RTO and RPO: GitLab's effective RPO was 6 hours (the age of the last good backup). Their RTO was 18 hours. Exam questions frequently ask you to design architectures that meet specific RTO/RPO targets — and the answer always involves tested, automated, geographically redundant backups.
- Backup verification: AWS Backup, Azure Backup Center, and similar services include restore testing features. Knowing that a backup job completing successfully is not the same as the backup being restorable is a common exam distinction.
- Snapshot strategies: Azure disk snapshots, AWS EBS snapshots, RDS automated backups — all of these have retention policies, and all of them must be tested. Disabling snapshots for cost reasons without a compensating control is the kind of decision the GitLab incident warns against.
- Change management for destructive operations: CompTIA Security+ covers change management processes specifically because human error during maintenance is one of the leading causes of availability incidents.
One last thought
The engineer who ran the command didn't lose his job. GitLab were explicit about this. They understood that blaming an individual for a systemic failure creates a culture where people hide mistakes rather than report them immediately — which makes the next incident worse. YP hit Ctrl+C the moment he realized the error and immediately told his team. That response, in the 20 seconds after the mistake, is the reason the recovery went as well as it did.
The real lesson isn't about rm -rf. It's about the gap between the systems you think you have and the systems you actually have. GitLab closed that gap the hard way. Your job as an infrastructure professional — and as someone studying for a cloud or operations certification — is to close it before the incident, not after.