The night everything went wrong

At around 17:20 UTC on 31 January 2017, GitLab.com was under attack. Spammers were hammering the database, creating thousands of fake accounts and flooding tables with garbage data, causing severe replication lag between the primary database server (db1.cluster.gitlab.com) and its replica (db2.cluster.gitlab.com). An on-call SRE began working to clear the lag by removing the bloated data directory on the replica and resynchronising it cleanly from the primary.

The SRE was exhausted. They had multiple terminal windows open. They had been working the incident for hours. They typed the rm -rf /var/opt/gitlab/postgresql/data command — and hit enter on the wrong window. The one connected to db1. The primary.

Within seconds, the command had started chewing through 300 GB of live production PostgreSQL data. The engineer noticed almost immediately and killed the process, but by then roughly 300 GB had been deleted. GitLab.com went offline. The postmortem document, which GitLab published openly, contains the engineer’s own comment in the live doc while it was happening: “I just deleted the wrong database. I need to fix this.”

“Accidentally removed db1 data dir instead of db2.” — from GitLab’s live public postmortem document, written in real time

Five backup systems. Zero working recoveries.

Here is the part that turned a bad afternoon into a legendary cautionary tale. GitLab had not one but five different mechanisms that should have provided a recovery path. Not a single one could restore the database to within a few minutes of the incident.

Mechanism Status Why it failed
Automated S3 backups
nightly pg_dump to object storage
FAILED Had not run successfully in 6 months due to a misconfiguration. Nobody had noticed.
Azure disk snapshots
VM-level block snapshots
FAILED Had been disabled to cut costs several months prior. Not re-enabled.
Logical database backups
pg_dump on the replica
FAILED The pg_dump job was silently failing — the process ran but produced empty or corrupt output. No alert fired.
Database replica (db2)
streaming replication copy
FAILED Was hours behind due to the replication lag that triggered the incident in the first place. Also had its data directory partially wiped during the cleanup attempt.
LVM snapshot
manual snapshot taken earlier
PARTIAL The SRE had taken a snapshot 6 hours before the incident as a precaution. This was the only viable recovery path — but it was 6 hours stale.

The team restored from the LVM snapshot. GitLab.com came back online approximately 18 hours after it went down. The restored database was current as of 17:20 UTC on 31 January — everything created between that timestamp and the start of the recovery was gone. About 5,000 projects, 4,789 Git repositories, 700 users, and an unknown number of comments, issues, and merge requests were affected.

The most transparent postmortem in tech history

What made this incident famous was not just the scale of the failure. It was what GitLab did next. While the recovery was still in progress, GitLab opened a public Google Doc titled “GitLab.com database incident” and invited anyone on the internet to watch in real time. Engineers wrote notes as they worked. Mistakes were documented as they happened. The whole disaster played out in front of tens of thousands of readers.

GitLab also posted a full public postmortem on their blog within 24 hours, disclosing every failure in detail: which backups were broken, why, for how long, and who was responsible for each one. They created a public issue tracker item to fix every gap they had identified. They gave the SRE involved their full support. They did not fire anyone.

The tech industry took notice. The GitLab postmortem became required reading for anyone who runs production systems — not because the mistakes were exotic, but because they were embarrassingly ordinary. Every team has a backup that hasn’t been tested. Every team has a cost-cutting decision that disabled a safety net. Most teams just haven’t been in the position of discovering both facts at the same time.

The disaster recovery lessons for your certification

1. An untested backup is not a backup

The single most important lesson from GitLab 2017 is that backups only count if you regularly restore from them in a real environment. The S3 backups had been broken for six months. The logical dumps were producing corrupt output. Nobody knew — because nobody had done a restore test. The fix is simple: automate a regular restore test and alert on failure. AWS Backup, Azure Backup Center, and Velero (for Kubernetes) all support automated restore validation. If you’re studying SOA-C02 or AZ-305, expect questions about validating backup integrity, not just creating backups.

2. Silent failures are the most dangerous kind

The pg_dump job was completing. The process exited 0. It just wasn’t producing valid output. This is a silent failure pattern that defeats monitoring if you only check whether a job ran rather than whether it produced useful output. The fix: always verify the artifact, not just the process. For database backups that means checking the file size, running a pg_restore --list or equivalent, and alerting if output is suspiciously small or invalid.

3. RTO and RPO are commitments, not hopes

Recovery Time Objective (RTO) — how long you can afford to be down — and Recovery Point Objective (RPO) — how much data loss you can tolerate — are central concepts in cloud certification exams. GitLab’s implicit RPO was “near zero”. Their actual RPO on the day was 6 hours, because that was the age of the only working backup. These two numbers must match; if they don’t, you have a gap that only shows up during an incident. AWS offers tools like RDS automated backups with 5-minute RPO, and Multi-AZ for near-zero RTO. Azure has geo-redundant SQL and Site Recovery for cross-region failover. Know what SLA your architecture actually delivers, not what you wish it delivered.

4. Multi-terminal fatigue is a real operational hazard

The SRE was in the right directory on db2 in one window and the wrong one on db1 in another. Typing in the wrong terminal while fatigued is not carelessness — it is an ergonomic failure mode that happens to everyone who manages multiple servers simultaneously. Modern mitigations include: colour-coded or clearly labelled terminal titles, PS1 prompts that display the hostname prominently, infrastructure access tools (AWS SSM Session Manager, Azure Bastion) that clearly identify the target, and — most importantly — requiring a separate confirmation prompt for destructive operations on production hosts.

5. Cost-cutting on safety infrastructure is a liability, not a saving

The Azure disk snapshots had been disabled to reduce costs. The SRE who made that decision almost certainly weighed it as a reasonable trade-off at the time. The true cost became visible only on 31 January 2017 when the snapshots were the difference between a 30-minute recovery and an 18-hour outage plus six hours of permanent data loss. Cloud certification exams frame this as a risk management question: what is the cost of a backup strategy vs the cost of a recovery without it? The answer is almost always that the backup is cheaper.

Why this story matters for your certification exam

The GitLab 2017 incident maps directly to exam objectives across multiple certifications:

If you see a scenario question that gives you a choice between “configure a backup policy” and “configure a backup policy and schedule automated restore tests”, the answer is always the second option. GitLab is why.

What GitLab did after

The real ending to this story is a good one. Within weeks of the incident, GitLab:

GitLab’s transparency turned a catastrophic incident into a trust-building moment. Their user retention after the incident was remarkably high, specifically because users felt the company had been honest with them. That’s not just a feel-good footnote — it is a practical argument for writing postmortems that name the real failures instead of softening them.

The SRE who ran the command continued working at GitLab.

Want to test your disaster recovery knowledge? Try our free AWS SysOps Administrator SOA-C02 practice quiz or the AZ-305 Solutions Architect Expert quiz — both cover RTO/RPO, backup validation, and business continuity scenarios in depth.