What actually happened
Facebook's autonomous system, AS32934, runs a global backbone that interconnects its data centers. On the day of the incident, an engineer ran a routine command intended to assess the available capacity of the global backbone. According to Facebook’s own post-mortem, the command had a bug: it caused all the BGP-speaking border routers to evaluate the change as if it would disconnect the entire backbone.
An audit tool was supposed to catch exactly this kind of risky change. Unfortunately, the audit tool also had a bug, so it didn’t stop the push. The configuration change went out, the backbone routers started withdrawing BGP advertisements, and within a couple of minutes Facebook had cleanly removed itself from the internet’s routing table.
The DNS chain reaction
This is the part that turns a network outage into a global outage. Facebook’s authoritative DNS servers (the ones that answer queries for facebook.com, instagram.com, whatsapp.net) live inside Facebook’s network. Those DNS servers have a built-in safety check: if they ever lose their connection to Facebook’s backbone, they automatically withdraw their own BGP route advertisements, on the assumption that something is broken and they shouldn’t be answering queries with stale data.
So when the backbone went down, the DNS servers correctly detected the failure and pulled themselves off the internet. Now there were no nameservers anywhere on the public internet that could resolve facebook.com. Recursive resolvers around the world (Google’s 8.8.8.8, Cloudflare’s 1.1.1.1, every ISP’s caching resolver) started returning SERVFAIL.
Worse, the billions of mobile clients running the Facebook and WhatsApp apps started retrying aggressively. The retry storm pushed DNS query volume from the public internet to roughly 30× normal levels, briefly stressing other services that share infrastructure with Facebook’s edge.
The takeaway: BGP withdrew the routes, but it was DNS that turned a backbone problem into a brand-extinction event. When your customers can’t resolve your name, you don’t exist.
Why it took six hours to fix
Here is where the story turns from a networking problem into a Hollywood script. With the backbone down:
- Remote management was gone. Engineers couldn’t SSH into the routers because the routes to those routers were the things that had just been withdrawn.
- Internal tools were down. Every internal Facebook system — chat, ticketing, on-call rotation, even the corporate VPN — ran on the same infrastructure. Engineers couldn’t coordinate over the tools they normally used.
- Physical access was blocked. Facebook’s data centers use electronic badge readers to control physical access. The badge system depends on the network. So when engineers physically arrived at the data centers, the doors wouldn’t open. There are reports of staff having to use angle grinders on server cages.
- The out-of-band network was undertested. A proper out-of-band (OOB) management network is supposed to be the lifeboat for exactly this scenario, but Facebook’s OOB capacity was limited and access procedures had not been recently rehearsed.
The actual recovery required an engineer with physical access, a console cable, and the credentials to manually restore the BGP advertisements router by router. Once the backbone came back up, the DNS servers re-advertised their prefixes, recursive resolvers started getting answers again, and Facebook slowly returned to the internet over the next hour. Total estimated revenue impact: north of $60 million for one day, plus a 5% drop in Meta stock.
The networking lessons every CCNA candidate should know
1. BGP is the glue that holds the internet together — and it’s fragile
Border Gateway Protocol is a path-vector routing protocol that exchanges reachability information between autonomous systems. It is one of only a handful of protocols that the entire global internet depends on. There is no built-in “undo”: when you withdraw a prefix, every BGP speaker in the world updates its routing table within seconds. CCNA covers eBGP basics; CCNP and CCIE go deep into route selection, communities, and policy. If you take one thing from this story: BGP changes need staging, simulation, and a rollback plan.
2. DNS depends on reachability, and reachability depends on routing
Authoritative nameservers must be reachable, which means they must be advertised by something. If you host your own DNS inside your own AS, design it so a routing failure cannot take your nameservers offline. Use an external secondary (Route 53, NS1, Dyn) or distribute your authoritative DNS across multiple ASes. CCNA covers DNS lookups, recursion vs iteration, and record types — but the operational lesson is architectural: don’t put all your DNS eggs in one autonomous system.
3. Out-of-band management is not optional
Every production network needs an OOB path that does not depend on the production data plane. Cellular modems, dedicated leased lines, or a separate physical network — whatever it takes, you need a way to reach a console port when the main network is on fire. Test it quarterly. The Facebook outage is the textbook example of what happens when you don’t.
4. Change management is a control, not a hindrance
The proximate cause of the outage was a buggy command. The root cause was a chain of missing controls: no canary deployment for the BGP change, an audit tool that itself had a bug, and no automated rollback on loss of reachability. Modern network automation (Ansible, NetBox, BGP route reflector simulation, dry-run modes) exists to catch exactly this. Use it.
5. Single points of failure hide in unexpected places
Facebook’s badge readers, internal chat, ticketing, on-call paging, VPN, and DNS were all directly or indirectly dependent on the same backbone. None of them looked like single points of failure on their own. The lesson is to map your dependencies: when you lose Service X, what else stops working? If the answer is “everything,” you have a problem.
Why this story still matters for your certification
If you’re studying for CCNA 200-301, CCNP Enterprise, AWS Advanced Networking (ANS-C01), or really any networking-adjacent cert, the Facebook 2021 outage is a goldmine of exam-relevant concepts in one self-contained story. Expect to see questions like:
- “Which protocol is used by ISPs to exchange reachability information between autonomous systems?” — BGP.
- “What happens to traffic destined for a prefix that has been withdrawn from BGP?” — it becomes unreachable; recursive DNS lookups for hosts in that prefix will fail.
- “Why is out-of-band management important in a production network?” — it provides a recovery path independent of the production data plane.
- “What does a DNS resolver return when it cannot reach an authoritative nameserver?” — SERVFAIL.
- “Which design principle would have mitigated the 2021 Facebook outage?” — secondary DNS in a separate AS, canary BGP deployments, or independent OOB management.
One last thought
The most uncomfortable part of the story isn’t the technical mistake. Engineers run buggy commands every day; that’s why we have controls. The uncomfortable part is how many independent safety nets failed in the same direction: the audit tool, the rollback automation, the OOB network, the badge reader fallback, the DNS architecture. Each of them, on its own, looked like a reasonable engineering decision. Combined, they produced a six-hour, sixty-million-dollar outage.
Whenever you’re studying a routing protocol, a DNS concept, or a high-availability pattern, try to remember October 4 2021. Networks fail in stories, not in single commands.
Want to study BGP, DNS and routing for the CCNA? Try our free CCNA 200-301 practice quiz — 110 scenario-based questions, no signup required.