$72,000 AWS Bill Overnight — Real Cloud Cost Horror Stories

The four-minute window

Security researchers at GitGuardian have documented a grim statistic: when an AWS access key is pushed to a public GitHub repository, the average time before it is detected and exploited by an automated scanner is less than four minutes. Not hours. Not days. Minutes.

The bots are always watching. They continuously scan GitHub’s public event stream — every push, every commit, every gist — using pattern matching to extract strings that look like AWS access keys (AKIA... followed by a 40-character secret). The moment they find a hit, they authenticate against the AWS API, enumerate what permissions the key has, and start launching instances.

Their preferred workload is cryptocurrency mining. A leaked key with ec2:RunInstances permission and no Service Control Policy to cap instance types is effectively a blank check for free compute. The attackers spin up p3.16xlarge GPU instances — $24.48 per hour each — across every region they can reach. They don’t need the operation to last long. A single overnight run across a dozen regions clears five figures before sunrise.

“I woke up to 47 unread AWS billing alert emails. Except I had never set up billing alerts. The emails were from AWS itself, telling me my estimated monthly bill had exceeded $50,000 — then $60,000 — then $70,000.” — anonymous developer, r/aws

How it keeps happening

The leaked-key scenario is the most dramatic, but it’s far from the only way developers hand AWS an open invoice. The patterns that produce four- and five-figure surprise bills repeat across companies of every size.

The forgotten weekend instance

A machine learning engineer spins up a p3.2xlarge instance on Friday afternoon to train a model. The job finishes in three hours. She closes her laptop and leaves for the weekend — the instance keeps running. By Monday morning, 60 hours of GPU compute have accumulated at $3.06 per hour: $183.60 for a single instance. Multiply that by a team of ten engineers with the same habit and you’ve explained a $20,000 monthly overage before a single line of production code runs.

The solution is trivially simple in retrospect: an AWS Instance Scheduler, a CloudWatch alarm that triggers on zero network I/O for more than two hours, or even a simple Terraform rule that enforces a StopTime tag on all dev instances. Almost nobody sets these up until after the first painful bill.

NAT Gateway: the silent meter

NAT Gateway charges are one of the most common AWS bill surprises for teams migrating workloads to VPCs. The service costs $0.045 per hour to run, which sounds reasonable. What catches teams off guard is the data processing charge: $0.045 per GB of traffic that flows through it — in addition to any standard data transfer fees.

A microservices architecture where every service calls an external API, routes through a NAT Gateway, and pulls container images from Docker Hub on every deployment can easily push hundreds of gigabytes per day through a single NAT Gateway. Teams have reported monthly NAT Gateway charges exceeding $8,000 for workloads they assumed were “basically free” because the EC2 instances themselves were small.

The fix is architectural: use VPC endpoints for AWS services (S3, DynamoDB, ECR) so that traffic stays on the AWS private network and bypasses the NAT Gateway entirely. It’s a one-afternoon refactor that can eliminate thousands of dollars a month.

The “deploy to all regions” mistake

CloudFormation StackSets, Terraform for_each over a list of regions, a misconfigured CI/CD pipeline that iterates region codes — there are many ways to accidentally deploy infrastructure to all 30+ AWS regions simultaneously. Each region gets its own NAT Gateways, load balancers, RDS instances, and Elastic IPs. The bill for a full multi-region deployment of a non-trivial architecture can exceed $50,000 per month.

One particularly memorable incident involved a developer who intended to deploy a proof-of-concept application to us-east-1. A copy-paste error in a Terraform variable left the region list set to a previous colleague’s test value of all available regions. The infrastructure deployed successfully. The CloudFormation stack status was green. The billing dashboard was not.

S3 request pricing: a trap for the unwary

S3 storage is famously cheap — $0.023 per GB per month for standard storage. What surprises teams is that AWS also charges per API request: $0.005 per 1,000 PUT/COPY/POST/LIST requests and $0.0004 per 1,000 GET requests. Those numbers sound tiny until you do the math on a poorly designed application.

An application that lists the contents of an S3 bucket on every page load, on a site with 100,000 daily visitors, is making 100,000 LIST requests per day. At $0.005 per thousand, that’s $0.50 per day — fine. But LIST requests on a bucket with millions of objects are paginated and require multiple API calls per listing. An application doing deep listings can rack up millions of requests daily and generate hundreds of dollars in request charges on top of the storage bill.

The pattern compounds when combined with S3 event notifications triggering Lambda functions that in turn write back to S3, triggering more notifications in an accidental feedback loop. Teams have reported S3 bills climbing from $30/month to $3,000/month overnight when a new feature introduced this kind of recursive trigger.

Bedrock and SageMaker: AI cost surprises

The most recent category of cloud cost horror belongs to AI services. A startup experimenting with Amazon Bedrock left a batch inference job running against Claude 3 Opus without a token budget cap. The job was processing a dataset that turned out to be 40× larger than estimated. By the time the team noticed, $72,000 in Bedrock token charges had accumulated — in a single billing period, against a company with a monthly AWS budget of $4,000.

SageMaker real-time endpoints are another trap: a provisioned endpoint charges by the hour whether it receives a single request or a million. A developer who spins up a SageMaker endpoint for a demo, presents it, and then forgets to tear it down will find a multi-hundred-dollar charge on next month’s bill. Unlike Lambda, SageMaker does not scale to zero.

The five defenses that actually work

After reading enough of these stories, the preventive measures become obvious — which makes it all the more remarkable how rarely they’re in place when the incident happens.

1. AWS Budgets with hard action alerts

Create an AWS Budget for every account with an alert at 50%, 80%, and 100% of your monthly budget. Go further: configure a Budget Action that automatically applies an IAM policy denying ec2:RunInstances when spend crosses 100%. This won’t stop all runaway costs, but it will stop the most expensive ones. It takes ten minutes to set up and has zero ongoing cost.

2. Never put credentials in code — ever

Use IAM roles for EC2, ECS, Lambda, and every other AWS compute service. Use AWS Secrets Manager or SSM Parameter Store for third-party credentials. Add a pre-commit hook (tools like git-secrets or detect-secrets) that scans every commit for credential patterns before they ever leave your machine. If you must use access keys for CI/CD, use short-lived credentials via sts:AssumeRole and rotate them automatically.

3. Service Control Policies at the organization level

Use AWS Organizations SCPs to deny expensive instance types in development accounts. A policy that denies ec2:RunInstances for instance types matching p3.*, p4d.*, and trn1.* in non-production accounts means that even if a developer’s key is leaked, the attacker cannot spin up GPU mining rigs. SCPs apply even to the root user of member accounts, making them the strongest guardrail available.

{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringLike": {
      "ec2:InstanceType": ["p3.*", "p4d.*", "g4dn.*", "trn1.*"]
    }
  }
}

4. VPC endpoints for AWS services

Create Gateway endpoints for S3 and DynamoDB (free) and Interface endpoints for ECR, Secrets Manager, and any other AWS service your workloads call frequently. This routes traffic over the AWS private network, eliminates NAT Gateway data processing charges for that traffic, and reduces latency as a bonus.

5. CloudWatch billing anomaly detection

AWS Cost Anomaly Detection uses machine learning to identify unusual spend patterns and can alert you within hours of a runaway cost event, rather than at the end of the billing cycle. It can be scoped to a specific service (e.g., “alert me if EC2 spend in any single day exceeds 200% of my 30-day average”) and costs nothing beyond the SNS notification.

Why this matters for your AWS certification

Cost optimization is a first-class domain on every AWS certification exam. The AWS Well-Architected Framework’s Cost Optimization pillar is explicit: you are expected to know how to right-size instances, use billing alerts, implement tagging strategies, and choose the correct pricing model (On-Demand vs. Reserved vs. Spot) for a given workload.

For the AWS Cloud Practitioner (CLF-C02), expect questions about AWS Budgets, Cost Explorer, the AWS Free Tier limits, and the shared responsibility model as it applies to cost control. The exam loves asking which tool sends an alert when you approach a threshold (Budgets) versus which tool shows you historical spend (Cost Explorer).

For the AWS SysOps Administrator (SOA-C02), cost optimization is a full exam domain (12% of the exam). You need to know how to configure Cost Anomaly Detection, how Reserved Instance coverage reports work, how Compute Optimizer produces right-sizing recommendations, and how to use AWS Config rules to detect untagged or oversized resources automatically.

For the AWS Solutions Architect Associate (SAA-C03), the exam frequently presents cost-optimization scenarios in the context of architecture decisions: when to use S3 Intelligent-Tiering versus lifecycle policies, when a NAT Gateway is unnecessary because a VPC endpoint exists, and how to architect a system that stays within a defined monthly budget using Auto Scaling with a maximum capacity cap.

“Which service sends an alert when your AWS spend reaches a defined threshold?” — AWS Budgets (not Cost Explorer, which is read-only reporting).
“A company wants to prevent EC2 instances in development accounts from being larger than t3.medium. What is the most effective approach?” — Service Control Policy applied at the OU level.
“An application in a private subnet makes frequent API calls to S3. Which change reduces data transfer costs?” — Create a Gateway VPC endpoint for S3.

The uncomfortable truth

Every engineer who reads one of these horror stories thinks the same thing: “That would never happen to me.” And then, six months later, they get the email.

Cloud billing horror is not an intelligence problem. The developers who leak keys to GitHub are not careless or incompetent — they are often experienced engineers who were moving fast, who knew the risk in the abstract, and who simply didn’t have the right guardrails in place before something went wrong. The solution is not to be more careful. The solution is to build systems where a moment of inattention cannot cost you $72,000.

Set up the budget alerts today. Add the pre-commit hook today. Create the SCP today. All three together take less than an hour. The story of the $72,000 overnight bill is compelling precisely because it is so preventable — and yet it keeps happening, to smart people, at well-funded companies, over and over again.