Reinforce CloudWatch alarming, Systems Manager patching, and DR strategies while commuting or working out. New episodes covering SOA-C02 operations topics drop weekly.
About the exam
Why earn the AWS SysOps Administrator?
SOA-C02 is the AWS certification for cloud operators — professionals responsible for deploying, managing, and monitoring production AWS workloads. It tests real-world operational skills, not just architectural knowledge.
- Validates hands-on operational skills: CloudWatch, SSM, Config, GuardDuty, VPC troubleshooting
- Proves ability to implement high-availability architectures: Multi-AZ, Auto Scaling, Route 53 failover
- Demonstrates automation expertise: CloudFormation, Systems Manager, EC2 Image Builder
- Opens cloud operations, DevOps, and SRE roles — median salary $120–$150k for AWS-certified ops engineers
- Completes the AWS Associate trilogy alongside SAA-C03 and DVA-C02
- Unique feature: SOA-C02 includes an optional exam lab section testing hands-on AWS console skills
Exam blueprint
SOA-C02 exam domains
Six domains spanning the full operations lifecycle. Monitoring and Networking are the heaviest domains — make CloudWatch and VPC your strongest areas.
Course content
7 modules · ~35 hours
Each module maps to one or more exam domains. Work through them in order or focus on your weak areas using the practice test to guide you.
Monitoring, Logging & Remediation3 lessons
The heaviest domain at 20%. Master CloudWatch alarms (standard, composite, M-of-N evaluation), CloudWatch Logs metric filters and Insights queries, AWS Config managed rules and auto-remediation, EventBridge rules for event-driven operations, CloudTrail for audit and integrity validation, Systems Manager OpsCenter for operational incident management, and VPC Flow Logs for network analysis. Understand when to use CloudWatch vs Config vs CloudTrail vs GuardDuty for different monitoring scenarios.
📖 Read in-depth chapter ▾
CloudWatch alarms are the SysOps engineer's primary signal. The exam tests both alarm mechanics (M-of-N evaluation, missing-data handling) and composition (combining multiple alarms into one composite signal).
- Standard alarms: threshold on a metric over an evaluation period. Three states: OK, ALARM, INSUFFICIENT_DATA. Actions on state transitions — typically SNS notify or Auto Scaling action.
- M-of-N evaluation: "alarm when 3 out of the last 5 data points exceed threshold". Reduces flapping vs simple "any breach fires". The exam-canonical setting for production CPU/memory alarms.
- Missing data treatment: Missing / NotBreaching / Breaching / Ignore. Default Missing is fine for most cases; NotBreaching when no data legitimately means OK; Breaching for tight SLAs that need to alert on metric-pipeline failures.
- Composite alarms: combine multiple alarms with AND/OR/NOT logic. Use for "alarm when CPU is high AND memory is high" (single noise-free incident) or "alarm when 5xx errors AND health check failing" (correlated signal).
- Anomaly detection alarms: ML-trained baseline per metric. Alarm fires when current data deviates from the predicted band. Useful for traffic / latency where static thresholds don't fit.
- Metric math: compose expressions over multiple metrics — e.g.,
(m1 / m2) * 100for error rate percentage. Alarms can evaluate metric-math expressions directly.
A web tier needs to alarm on "error rate > 1% sustained for 5 minutes" without false positives during deploy noise. Standard alarm on a single metric won't work — need metric math m1 = sum(5xx), m2 = sum(requests), expression = m1/m2*100 with M-of-N: alarm if > 1% on 3 of 5 data points. Missing data = NotBreaching (low-traffic periods don't fire). Action: SNS topic notifying the on-call team via PagerDuty.
CloudWatch Logs ingests text from EC2, Lambda, container, and on-prem sources. Metric filters convert log patterns into CloudWatch metrics; Logs Insights queries log data ad-hoc. SOA-C02 asks both.
- Log groups + log streams: a log group is a namespace with retention + permissions; a stream is a single source (one EC2 instance, one Lambda invocation). Retention: 1 day to forever, per group.
- Metric filters: pattern on log events that increments a CloudWatch metric. Pattern syntax supports JSON ("$.user = 'alice'"), regex, or simple terms. Common pattern: count "ERROR" log lines → alarm on rate.
- Subscription filters: stream log events to Kinesis Data Streams, Lambda, or OpenSearch. Use for cross-account log aggregation or for ML/SIEM ingestion.
- Logs Insights: ad-hoc KQL-like query language.
fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 100. Aggregations:stats count() by bin(5m). - Pricing: per-GB ingested + per-GB stored per month. Drop noisy logs at source (CloudWatch Agent filter) or sample with subscription filter → Kinesis Firehose with transformation.
- Live Tail: real-time tail of one or more log groups for quick debugging. Limited to 1 hour per session.
An application logs to CloudWatch Logs. Need: (1) alarm when "OutOfMemoryError" appears more than 5 times in 5 minutes. Solution: metric filter on the log group pattern OutOfMemoryError incrementing a custom metric app/OOMErrors. Alarm on the metric with sum > 5 over 5-minute period, M-of-N 1-of-1. (2) Ad-hoc investigation: Logs Insights query filter @message like /OutOfMemoryError/ | stats count() by bin(5m) to see the time distribution.
CloudTrail records WHO did WHAT WHEN. Config records WHAT THE STATE WAS at a point in time. EventBridge reacts to events as they happen. Together they give you control-plane audit + drift detection + automation. The exam contrasts them constantly.
- CloudTrail: records management-plane API calls (CreateBucket, DeleteVolume) and optionally data-plane events (S3 object reads, Lambda invocations). One trail per account, multi-region default. Integrity validation via signed digest files. Forensic source-of-truth for "who deleted this?".
- AWS Config: continuously records resource configuration state. Rules evaluate config against desired state (managed rules from AWS, custom rules in Lambda). Non-compliance triggers EventBridge events. Optional auto-remediation via SSM Automation.
- EventBridge: the event bus. AWS service events (EC2 state changes, CloudTrail events, GuardDuty findings) land on the default bus; custom and partner events on additional buses. Rules match events and route to targets (Lambda, Step Functions, SNS, SQS, ECS task, …).
- EventBridge Scheduler: cron-style scheduled triggers. Replaces CloudWatch Events scheduled rules with a richer expression language and one-time schedules.
- OpsCenter (Systems Manager): aggregates operational issues (alarms, findings, custom events) into "OpsItems" workflow. Useful for incident management overlay; less common in greenfield.
- Use-pattern triad: CloudTrail for forensics, Config for drift/compliance, EventBridge for automation. They overlap but solve different problems — exam scenarios usually have one clear best fit.
Requirement: any S3 bucket created without server-side encryption must be auto-remediated within 5 minutes. Design: AWS Config managed rule s3-bucket-server-side-encryption-enabled. Non-compliance → EventBridge rule on Config compliance change event → SSM Automation document that enables default encryption on the bucket. CloudTrail records the creation event as the auditable source-of-truth for who originally created the misconfigured bucket.
Reliability & Business Continuity3 lessons
Build systems that survive failures. Covers EC2 Auto Scaling (target tracking, step scaling, lifecycle hooks for zero-downtime deployments), Route 53 routing policies and health checks (especially Failover routing), RDS Multi-AZ failover behavior and replica promotion, AWS Backup for cross-region backups, S3 Cross-Region Replication and versioning, Aurora Global Database, and DR strategies (backup/restore vs pilot light vs warm standby vs active-active) with their respective RTO/RPO tradeoffs.
📖 Read in-depth chapter ▾
Auto Scaling is the AWS native way to match capacity to demand. The exam tests scaling policy choice, instance termination order, and lifecycle hooks for zero-downtime deploys.
- Scaling policies: Target tracking (set "CPU = 50%", AWS adjusts) — simplest and recommended default. Step scaling (CloudWatch alarm-driven, +N instances per breach severity) — used for non-linear scale needs. Scheduled (calendar trigger). Predictive (ML-based forecast for recurring patterns).
- Cooldown: the wait between scale actions, prevents flapping. Default 300 seconds. Tune with the application's warm-up time — too short = oscillation, too long = lag.
- Termination policies: control which instance dies when scaling in. Defaults: OldestLaunchTemplate → AllocationStrategy (Spot) → OldestInstance → Default. Override per-ASG; AZ-balance is always preserved first.
- Lifecycle hooks: pause instances during launch (Pending:Wait) or termination (Terminating:Wait) so SSM Run Command / Lambda can run pre-attach or pre-detach scripts. Critical for graceful drain on terminate.
- Warm pools: pre-initialised stopped instances kept ready for near-instant launch. Slashes the "minutes to first request" time when scaling responds to a sudden spike.
- Instance refresh: rolling replacement of every instance in an ASG with the new launch template version. Configurable min healthy + warm-up time. Used for AMI updates without downtime.
An e-commerce ASG runs 4-20 instances. Spike-response too slow because instances take 5 minutes to boot + warm cache. Fix: enable warm pool with 5 stopped instances at desired state. Add lifecycle hook on Pending:Wait so an SSM Automation runs cache-prewarming before the instance attaches to the LB target group. Target tracking policy CPU=60% with 60s cooldown. Result: spike response drops from 6 min to ~30s.
Route 53 routing policies + health checks plus database failover features (RDS Multi-AZ, Aurora Global Database) are the building blocks of regional and global failover. SOA-C02 tests when each routing policy fires and how RDS failover actually behaves.
- Route 53 routing policies: Simple, Weighted (% split), Latency (region with lowest latency to user), Failover (primary + secondary with health check), Geolocation (per country/continent), Geoproximity, Multi-value answer, IP-based.
- Health checks: endpoint, calculated (combine multiple), CloudWatch-alarm. Failover policy requires health checks on both primary and secondary (secondary inverted or alias to a known-healthy resource).
- RDS Multi-AZ (single-region HA): synchronous standby in another AZ. Failover triggers on instance failure / Multi-AZ → automatic DNS flip in 60-120 seconds. Standby is NOT readable — purely an HA feature.
- RDS read replicas: async replicas. Same-region (read scale) or cross-region (DR + lower-latency reads). Promote a replica to primary for manual failover or to break replication for a separate DB.
- Aurora Multi-AZ: different from RDS Multi-AZ — Aurora has one writer + multiple readers (1-15), all in the cluster volume across 3 AZs. Failover promotes a reader (~30 seconds). Aurora is preferred for new builds.
- Aurora Global Database: cross-region replication with sub-second RPO, < 1 minute promote-time for cross-region failover. Up to 5 secondary regions. Right for global active-active reads.
A global SaaS needs < 5 min RTO for primary-region database failure. Design: Aurora Global Database primary in us-east-1, secondary in eu-west-1 (sub-second RPO). On region failure: manually promote secondary (or use the headless failover for unplanned). Route 53 with Failover routing — primary endpoint pointing at the us-east-1 Aurora writer endpoint with a health check; secondary endpoint pointing at eu-west-1. DNS TTL 60s. Total RTO: ~2 minutes promote + DNS propagation.
The DR strategy choice is always a cost / RTO / RPO trade-off. AWS Backup centralises the policies; S3 CRR + lifecycle covers object data. SOA-C02 tests the strategy ladder and the AWS-specific primitives that implement each step.
- DR strategy ladder: Backup & Restore (cheapest, RTO hours), Pilot Light (DB warm, app servers stopped, RTO hours), Warm Standby (scaled-down full stack, RTO minutes), Active/Active multi-region (full prod both sides, RTO < 1 minute, expensive).
- AWS Backup: central backup management across EC2/EBS/RDS/DynamoDB/S3/EFS/FSx/Storage Gateway. Backup plans = schedule + retention + lifecycle to Glacier. Cross-region copy for DR. Backup vaults with vault lock for immutability.
- S3 Cross-Region Replication (CRR): async object-level replication to another region. Configurable per prefix or tag. Requires versioning enabled on both source and dest. Use for cross-region DR of S3 data or compliance copies.
- S3 Same-Region Replication (SRR): same as CRR but within one region — useful for cross-account replication or aggregation.
- EBS snapshots: incremental block-level backups in S3. Cross-region copy is a separate API. Fast Snapshot Restore (FSR) pre-warms snapshots to eliminate the first-read latency penalty.
- S3 Versioning + Object Lock: versioning preserves every version (ransomware recovery). Object Lock adds WORM immutability (compliance retention). Both layered for max protection.
A regulated workload requires RTO 30 minutes / RPO 5 minutes cross-region. Design: warm standby pattern. Primary in us-east-1; DR in us-west-2 with scaled-down 25% capacity. Data: Aurora Global Database for DB (Lesson 2.2). S3 CRR on the user-uploads bucket (sub-15s replication lag for most objects). AWS Backup daily backups of EC2 + EFS, cross-region copy to us-west-2, retention 35 days with vault lock for compliance. Route 53 Failover routing flips the DNS in 30 seconds on health-check failure.
Halfway through the reliability module? Reinforce Auto Scaling and DR strategy tradeoffs by listening to the CertQuests podcast — concise audio breakdowns of exactly these scenarios for your commute.
▶ Open SpotifyDeployment, Provisioning & Automation3 lessons
Automate everything a SysOps engineer manages. CloudFormation: drift detection (what changed outside the stack?), Change Sets (preview before applying), StackSets for multi-account/region deployments with automatic deployment for new accounts. Systems Manager: Run Command for ad-hoc execution, State Manager for configuration compliance, Patch Manager with baselines and maintenance windows, Automation documents for multi-step runbooks, and Session Manager for bastion-free access. EC2 Image Builder for golden AMI pipelines. Elastic Beanstalk deployment policies (All at Once, Rolling, Rolling with Additional Batch, Immutable).
📖 Read in-depth chapter ▾
CloudFormation is the AWS-native IaC substrate. SOA-C02 focuses on operational concerns — detecting manual changes (drift), previewing updates (Change Sets), and deploying at scale (StackSets).
- Drift detection: compares stack template's expected state to actual deployed state, lists properties that differ. Run on demand from the console / API. Use to find resources someone modified outside the stack.
- Change Sets: preview an update before executing — added/modified/removed resources, replacement vs in-place. Best practice for production stack updates: always create + review before Execute.
- StackSets: deploy ONE template to many accounts AND/OR many regions simultaneously. Target by AWS Organizations OU. Auto-deployment applies the stack to new accounts that join the OU.
- Stack policy: JSON that allows/denies updates on specific resources. Prevents accidental replacement of stateful resources (databases) during a CloudFormation update.
- DeletionPolicy / UpdateReplacePolicy: per-resource lifecycle attributes. Retain keeps the resource on stack delete; Snapshot creates a final snapshot first. Critical for databases and S3 buckets with data.
- Stack failure behaviour: rollback on failure (default), or DisableRollback for forensic debugging. Failed-create stacks must be DeleteStack'd before retry.
Operations team adds a manual NSG rule on a CloudFormation-managed VPC to unblock a customer. Six months later, the stack is updated to add a new subnet — the manual rule disappears, causing a production outage. Fix going forward: schedule weekly drift detection on critical stacks; alert via EventBridge to SNS when drift detected. Every prod stack update goes through a Change Set reviewed in the PR. Multi-account governance: a StackSet deploys the org-wide IAM password policy + Config rules to every account in the OU, auto-deployed to new accounts.
Systems Manager (SSM) is the SysOps swiss army knife — run commands, enforce config, patch instances, store secrets, access without SSH. SOA-C02 tests which SSM capability fits which operational need.
- Run Command: ad-hoc command execution across many managed instances. Target by tag, AZ, resource group. Audit logged. Replaces SSH-and-execute-a-script workflows.
- Session Manager: browser- or CLI-based shell access to instances WITHOUT SSH/RDP/Bastion. Requires the SSM Agent + IAM permissions. Audit logged. The modern replacement for jump hosts.
- State Manager: declarative config compliance — "instance X must always have application Y at version Z". Periodic enforcement, drift remediation. Use for fleet-wide standard config.
- Patch Manager: OS patch baselines (which CVEs / packages auto-approved) + maintenance windows (when to apply). Cross-instance compliance reports. Replaces hand-rolled cron-based patching.
- Automation documents: multi-step runbooks (e.g., "snapshot the disk, then upgrade, then test"). Trigger manually, on schedule, or via EventBridge for incident auto-response.
- Parameter Store + Secrets Manager: both store config + secrets. Parameter Store is cheaper (free up to 4 KB) + integrates with KMS for SecureString. Secrets Manager adds automatic rotation. Pick by whether rotation matters.
An ops team has 200 EC2 instances and needs: (1) standard config drift remediation, (2) monthly OS patching during a 4-hour window, (3) browser shell access without SSH. Setup: SSM Agent on every instance with the AmazonSSMManagedInstanceCore IAM role. State Manager association applies a CloudWatch Agent config document daily. Patch Manager with a custom patch baseline (approve security patches automatically after 3 days), maintenance window first Sunday of the month 02:00-06:00 UTC. Session Manager for admin access — public SSH NSG rule deleted.
EC2 Image Builder governs the golden-AMI pipeline. Elastic Beanstalk's deployment policies are the SOA-C02 canonical example of progressive deployment strategies.
- EC2 Image Builder: declarative pipelines for AMI building. Components (install / validate / test), recipes (combine components + parent image), pipelines (run on schedule or trigger). Replaces hand-built golden AMIs and Packer for AWS-only orgs.
- Inspector v2 integration: Image Builder pipelines can run Amazon Inspector against the built AMI for CVE scanning — block deploy if criticals found.
- Beanstalk deployment policies: All at once (fast, downtime risk), Rolling (batch by batch, in-place), Rolling with additional batch (one extra batch — no capacity dip), Immutable (provision NEW ASG, swap when healthy — safest, slowest), Traffic Splitting (canary).
- Beanstalk environment types: single-instance (cheap dev), load-balanced + auto-scaled (production). Worker tier for SQS-backed background jobs.
- Beanstalk configuration sources: .ebextensions YAML files in source bundle (most precise control); environment variables; saved configurations (reusable). Layered with later-applied winning.
- Beanstalk + RDS gotcha: if you let Beanstalk create an RDS in the environment, the DB lives and dies with the environment. Better: create RDS separately and supply the connection string via env vars / Secrets Manager.
A team uses Beanstalk for a customer-facing API. Need: zero-downtime deploys, fast rollback. Choice: Immutable deployment policy — Beanstalk provisions a new ASG, runs health checks, then swaps. Rollback = revert to previous version, which provisions yet another ASG. Slower than Rolling but safer; rollback is instant if something fails post-deploy. AMI built via EC2 Image Builder with Inspector v2 scanning on every pipeline run — pipeline fails if CVEs above critical threshold.
Security & Compliance3 lessons
Security is 16% of SOA-C02 but underpins every other domain. Master GuardDuty (threat detection findings + EventBridge-based automated remediation), AWS Inspector v2 (CVE scanning, network reachability), AWS Security Hub (aggregated multi-account security posture), Amazon Macie (PII and sensitive data discovery in S3), IAM Access Analyzer (external access findings), KMS automatic key rotation, CloudTrail log file integrity validation, AWS Organizations SCPs for preventive controls, WAF rate-based rules and geographic restrictions, and S3 Block Public Access at account level.
📖 Read in-depth chapter ▾
Three security services that often appear together — GuardDuty for runtime threat detection, Inspector for vulnerability scanning, Security Hub for aggregated posture. SOA-C02 asks which one finds what.
- GuardDuty: ML + threat-intelligence-based detection of anomalous behaviour from VPC Flow Logs, DNS logs, CloudTrail. Findings: cryptocurrency mining, port scanning, exfiltration, IAM key compromise. Cross-region aggregation by enabling per region.
- Inspector v2: CVE scanning for EC2 + ECR + Lambda. Continuous (re-scan on package change) vs scheduled. Network reachability for EC2 ("which of my instances are reachable from the internet on port 22?"). Integrated with Security Hub findings.
- Security Hub: aggregation layer — collects findings from GuardDuty, Inspector, Macie, Config, third-party. Single dashboard, single API. Runs compliance standards (AWS Foundational Security Best Practices, CIS, PCI DSS) as checks against your environment.
- IAM Access Analyzer: flags resources accessible from OUTSIDE your trust zone (account / org). Catches "S3 bucket policy accidentally allows another account" without manual policy review.
- Macie: ML-driven discovery of PII / PHI / financial data in S3. Sample-based by default; full scan on demand. Outputs findings into Security Hub.
- Cross-account aggregation: all three (GuardDuty / Inspector / Security Hub) support delegated administrator in Organizations — one security account aggregates findings from every member account.
An org with 30 accounts: designate a security tooling account. Delegate GuardDuty + Inspector + Security Hub + Macie admin to it. Auto-enable each on all member accounts via Organizations. Security Hub aggregates findings; AWS Foundational Security Best Practices + CIS Foundations Benchmark standards enabled. Critical findings route via EventBridge to PagerDuty. IAM Access Analyzer scans every account; flagged cross-account access triggers a Slack notification with the principal + resource.
Encryption keys (KMS), web-app protection (WAF), and the per-bucket / per-account S3 defenses are perennial SOA-C02 topics. Get the defaults wrong and you ship publicly-readable buckets or encrypt with the wrong key custody.
- KMS key types: AWS-managed (free, in your account but Microsoft controls), Customer-Managed (you control rotation, key policy, audit), AWS-owned (in AWS-internal account, free, no audit). Per-service default; you flip to CMK when key custody matters.
- KMS rotation: automatic annual rotation for symmetric CMKs (free, transparent). Asymmetric and HMAC keys require manual rotation. Rotation creates new backing material; older material kept to decrypt old data.
- KMS key policy + grants: key policy is required (no IAM-only access). Grants are short-lived programmatic permissions used by AWS services (e.g., EBS at volume create). Auditable in CloudTrail.
- WAF: deploys on CloudFront, ALB, API Gateway, AppSync, App Runner. Web ACL contains rules (managed, custom, rate-based). Rate-based rule limits requests per source IP — essential bot defense.
- WAF managed rule groups: AWS-managed (Common Rule Set / Known Bad Inputs / SQL DB / Linux / etc.), Marketplace, custom. Enable AWS Core Rule Set as baseline.
- S3 Block Public Access: account-level setting + bucket-level setting. When ON, NEW or CHANGED public ACLs/policies are denied regardless of intent. AZ-500 expects you to ENABLE this org-wide via SCPs.
A SaaS platform with public S3 (static site) and private S3 (customer data) in the same account. Setup: S3 Block Public Access at account level ON for new buckets (catches accidents). Public site bucket has bucket-level BPA OFF + a least-privilege public-read policy. Customer-data bucket encrypted with Customer-Managed KMS key (annual rotation ON), Inspector v2 scanning the EC2 fleet, WAF on the CloudFront in front of the API with rate-based rule (2000 req/5min per IP) + AWS Core Rule Set.
AWS Organizations governs multi-account structure. Service Control Policies (SCPs) cap what any role in a member account can do. SOA-C02 expects you to design preventive controls via SCPs and detective via Config / Security Hub.
- Organizational Units (OU): nested grouping of accounts. Common structure: Root → Security / Infrastructure / Sandbox / Workloads (with sub-OUs for prod / non-prod / business unit).
- SCPs: JSON policies attached at root / OU / account level. CAP what IAM principals can do — they're a ceiling, not a grant. Best practice: deny-list rather than allow-list (allow-list breaks on every new AWS service).
- Common SCP patterns: deny disable-CloudTrail, deny disable-GuardDuty / Config, deny IAM changes outside the management account, deny regions outside the approved list, deny S3 BPA changes, deny KMS key deletion.
- SCPs don't apply to: the management account, service-linked roles, or root user when MFA-protected for specific actions. Always provision break-glass in the management account exempt from SCPs.
- AWS Control Tower: opinionated landing zone — pre-built OU structure, mandatory + strongly-recommended guardrails (preventive = SCP, detective = Config rule), account factory. The fastest way to stand up a multi-account org securely.
- Centralized billing + consolidated discounts: Organizations gives one bill across all accounts. Reserved Instances + Savings Plans purchased in one account apply across the org by default.
An org with 50 accounts: deploy AWS Control Tower for the landing zone. OU structure: Security (audit + log archive accounts), Infrastructure (shared networking + identity), Sandbox, Workloads/Prod, Workloads/NonProd. Mandatory guardrails enabled on all OUs. Strongly-recommended guardrails on Workloads/Prod: deny S3 BPA changes, deny disable CloudTrail / GuardDuty, deny actions outside us-east-1 and eu-west-1. Sandbox OU has a SCP capping spend via a separate billing alarm + auto-isolation Lambda.
Test your knowledge on Domains 1–4 before moving to networking and cost.
Networking & Content Delivery3 lessons
Networking is tied for the heaviest domain at 18%. VPC fundamentals: subnets, route tables, Internet Gateway, NAT Gateway vs NAT Instance (HA differences), Security Groups (stateful) vs NACLs (stateless, rule ordering). VPC connectivity: Peering (missing route table entries are the #1 failure cause), Transit Gateway for hub-and-spoke replacing N×(N-1)/2 peering connections, VPC Endpoints (interface vs gateway types), and AWS PrivateLink for cross-account service exposure. Hybrid: Site-to-Site VPN dual tunnels, Direct Connect + VPN failover with BGP. CloudFront: cache behaviors, TTL settings, Origin Access Control for S3, Origin Shield for dynamic content acceleration. Route 53 routing policies and health checks.
📖 Read in-depth chapter ▾
VPC questions are the most-tested category on SOA-C02. The exam consistently asks about the SG-vs-NACL distinction, NAT Gateway HA, and the route-table interactions that determine which packets go where.
- Subnet types: public (route to IGW), private (no IGW route, optional NAT for egress), VPN-only (route to VPN GW). Subnet is the AZ boundary — one subnet per AZ for HA designs.
- NAT Gateway vs NAT Instance: NAT Gateway is managed, zone-redundant within AZ, no admin needed, ~$0.045/hr + bandwidth. NAT Instance is self-managed EC2 — legacy, avoid. For multi-AZ HA, deploy ONE NAT Gateway per AZ.
- Security Groups (stateful): implicit deny, explicit allow. Return traffic always allowed. Reference other SGs as source/destination (e.g., "allow from sg-webtier"). Apply at the ENI level.
- NACLs (stateless): rule-number-ordered list with allow + deny. Apply at subnet level. Return traffic NOT automatic — open ephemeral-port range (1024-65535) outbound for clients, or full 0-65535 for the path. Rule numbers in increments of 100 leave room for inserts.
- SG vs NACL pick: SG always (instance-level intent). NACL as defensive secondary layer (subnet-level deny — e.g., "block this attacker IP"). Don't try to do all your filtering at NACL.
- VPC Flow Logs: capture ENI traffic to CloudWatch Logs / S3. Format includes ACCEPT/REJECT — REJECTs are typically SG/NACL drops. Use to diagnose "why can't A reach B" without packet capture.
A multi-AZ VPC needs HA NAT for private-subnet egress: deploy 3 NAT Gateways, one per AZ; each AZ's private subnet route table points 0.0.0.0/0 at its own AZ's NAT Gateway. Cross-AZ NAT routing would work but adds per-GB cost — keep traffic in-AZ. SGs control instance access; NACLs as a defense-in-depth layer with a specific blocked-IP list at subnet level. VPC Flow Logs at REJECT level for forensic visibility.
Beyond a single VPC, the connectivity options each have different scaling and cost trade-offs. SOA-C02 tests when you should reach for each.
- VPC Peering: 1:1 connection between two VPCs (same or different account / region). Non-transitive — A↔B + B↔C does NOT give A↔C. Manual route table entries on BOTH sides required (the most-common SOA failure cause).
- Transit Gateway (TGW): hub-and-spoke for VPCs + VPN + Direct Connect. Replaces N×(N-1)/2 peerings with one TGW per region. Route tables enable selective transitivity. Pricier per-GB than peering but scales.
- VPC Endpoints — Gateway: S3 and DynamoDB only. Route-table entries direct subnet traffic to the AWS service over the AWS backbone — no internet, no NAT cost. Free.
- VPC Endpoints — Interface (PrivateLink): ENI in your subnet with a private IP that resolves to an AWS service or your own service. Per-hour cost + per-GB. Use for cross-account / cross-VPC service exposure WITHOUT routing tables touching the consumer's networking.
- PrivateLink as service-provider: expose a service (NLB-fronted) to other AWS accounts via PrivateLink — they create an interface endpoint targeting your service. Cleanly separates consumer/provider VPCs.
- Decision pattern: 1-2 VPCs to interconnect → peering. 3+ VPCs / multi-region / hybrid → TGW. Access to AWS services → Gateway endpoints (S3/DynamoDB) or Interface endpoints. Cross-account services → PrivateLink.
An org with 15 VPCs across 3 accounts needs full mesh connectivity + access to S3 + a centralised auth service in a security VPC. Choice: Transit Gateway in each region with all 15 VPCs attached. Two TGW route tables: workload (full transitivity to all workload VPCs + auth via PrivateLink target) and shared (only allows the security VPC + S3). Each VPC has a Gateway endpoint for S3. Auth service exposed via PrivateLink from the security account.
CloudFront is the AWS CDN. SOA-C02 tests origin security (OAC), cache behaviours, and invalidation patterns.
- Distributions: one CloudFront distribution per public site / app. Origins: S3 bucket, ALB, EC2, MediaPackage, custom origin (any HTTP server). Multi-origin with path-based routing.
- Origin Access Control (OAC): replaces legacy Origin Access Identity (OAI). CloudFront signs requests to S3; the bucket policy allows ONLY the distribution. Public access to the bucket can stay BLOCKED. The modern S3-origin pattern.
- Cache behaviours: per path-pattern: which origin, TTLs, allowed methods, cookie/header forwarding. Multiple behaviours allow different caching for
/api/*(TTL 0) vs/static/*(TTL 1y). - TTLs and invalidation: object TTL controlled by Cache-Control headers on origin response or by behaviour Min/Default/Max TTL. Invalidation forces edge cache to re-fetch — paid per path beyond the first 1000/month, slow (minutes).
- Origin Shield: regional shield layer in front of origin — collapses multiple POP requests into one. Reduces origin load and per-GB transfer. Worth enabling for any non-trivial CloudFront usage.
- Signed URLs / signed cookies: time-bound access to private content. Signed URLs for one-asset downloads (e.g., paywall video); signed cookies for multi-asset access (e.g., the user's whole gallery).
- CloudFront Functions vs Lambda@Edge: CloudFront Functions (lightweight, viewer request/response only, < 1 ms, ~$0.10/M) vs Lambda@Edge (full Node/Python runtime at edge, slower, pricier). Pick Functions for header rewrites; Lambda@Edge for richer logic.
A photo-sharing app: CloudFront distribution fronts an S3 bucket of user photos. OAC wired between distribution and bucket; bucket BPA stays ON. Behaviour A: /api/* with TTL 0 + cookie forwarding (dynamic). Behaviour B: /static/* with TTL 1 year + Origin Shield. Private user gallery served via signed cookies set by the auth Lambda after sign-in. CloudFront Functions add a security header on every response.
Cost & Performance Optimization3 lessons
Cost optimization is 12% but deeply integrated into all other domains — every question type has a "most cost-effective" variant. Key concepts: Reserved Instances vs Savings Plans (Standard RI = max discount for fixed workloads; Compute Savings Plans = flexibility across families/regions), Spot Instances for fault-tolerant batch jobs with 2-minute interruption notice handling, S3 storage class selection and lifecycle policies (Standard → Standard-IA → Glacier Deep Archive), AWS Compute Optimizer for right-sizing recommendations, AWS Cost Anomaly Detection for ML-based spend alerts, Trusted Advisor cost checks, and inter-AZ vs cross-region data transfer costs.
📖 Read in-depth chapter ▾
Compute is the biggest line item; pricing model choice is the biggest lever. SOA-C02 tests when to use each and how to handle Spot interruption gracefully.
- Standard RI vs Convertible RI: Standard = max discount, locked instance family/size (with size flex inside family). Convertible = exchangeable for different family/OS, smaller discount. Almost always pick Compute Savings Plans over Convertible RI.
- EC2 Instance Savings Plan: commit to $/hr in a specific instance family + region. Max discount, less flexible.
- Compute Savings Plan: commit to $/hr across EC2 + Fargate + Lambda + Sagemaker + any family/region/OS. Slightly less discount than Instance SP, but the flexibility usually wins for evolving workloads.
- Spot Instances: up to 90% off, can be reclaimed with 2-minute notice. Best for fault-tolerant batch, dev/test, stateless. Use Spot via Auto Scaling Group with mixed instance types + multiple AZs for resilience.
- Spot interruption handling: listen for the 2-minute notice via instance metadata (
http://169.254.169.254/latest/meta-data/spot/instance-action). Drain LB target, flush in-memory state, exit gracefully. Or use ASG lifecycle hooks to react automatically. - Cost-anomaly tools: AWS Cost Anomaly Detection (ML alerts on unusual spend), Cost Explorer (cube-style analytics), Cost & Usage Reports (detailed CSV for Power BI / QuickSight).
A 24/7 web tier (10-instance baseline + 4× peak) and a batch-job tier: cover baseline with a 3-year Compute Savings Plan at $/hr matching 10 c5.large equivalents (~60% off On-Demand). Peak burst: On-Demand instances. Batch job tier: Spot via ASG with diversified instance types across 3 AZs. Cost Anomaly Detection alerts on +20% week-over-week. Total: ~40% lower than All-On-Demand.
S3 storage cost differs by ~25× between Standard and Deep Archive. Picking the right class + automating tier transitions with lifecycle policies is the canonical S3 cost optimisation.
- Storage classes: Standard (frequent, default), Intelligent-Tiering (auto-tiers based on access patterns, monitoring fee per object), Standard-IA (≥30 days, cheaper storage + retrieval fee), Cold (≥90 days, cheaper still), One Zone-IA (one AZ, cheap, low durability tolerance), Glacier Instant Retrieval (ms retrieval, cheap), Glacier Flexible Retrieval (min-hours retrieval), Glacier Deep Archive (hours retrieval, cheapest).
- Intelligent-Tiering: auto-moves objects between Frequent / Infrequent / Archive Instant Access tiers based on observed access. Avoids the "guess access pattern" problem. Monitoring fee per object (skip for huge numbers of tiny objects).
- Lifecycle policies: declarative rules to transition between classes and / or expire. Run daily, no extra cost. Standard pattern: transition Standard → Standard-IA at 30 days → Glacier Flexible at 90 → Deep Archive at 365.
- Lifecycle on incomplete multipart uploads: the most overlooked cost optimisation — abort incomplete multipart uploads after 7 days. Saves storage for failed uploads sitting in S3 forever.
- Storage Class Analysis: S3 feature that observes access patterns + recommends lifecycle transitions. Run for 30+ days on a bucket before crafting lifecycle policies.
- Glacier retrieval pricing: Standard ~3-5h, Bulk ~5-12h cheapest, Expedited < 5 min most expensive. Plan retrieval cost into compliance budgets.
Customer-data S3 bucket grows 1 TB/month. Access pattern: heavy first 30 days, sporadic for 90, near-zero after. Set up lifecycle policy: Standard → Standard-IA at 30 days → Glacier Flexible at 90 → Deep Archive at 365 → expire at 7 years. Add incomplete multipart upload abort at 7 days. Use Storage Class Analysis for new bucket types where pattern isn't yet known. Total cost: ~75% lower than All-Standard.
Right-sizing comes from data, not gut feel. AWS provides specific tooling for the data; SOA-C02 expects you to know which tool surfaces which signal — and the surprisingly common transfer-cost gotchas.
- AWS Compute Optimizer: ML-driven right-sizing recommendations for EC2 / EBS / Lambda / ECS on Fargate. Identifies over-provisioned or under-provisioned resources based on observed metrics. Free.
- Trusted Advisor: rules-based recommendations across cost / security / performance / fault tolerance / service limits. Business or Enterprise Support unlocks the full set. Surface idle ELBs, underutilised EBS volumes, idle RDS, unassociated EIPs.
- Cost Explorer: cube-style analytics on cost and usage data — pivot by service, tag, account, time. Save reports, schedule emails. Use to find runaway services week-over-week.
- Cost & Usage Reports (CUR): detailed billing data delivered to S3 as CSV / Parquet. Ingest into Athena / QuickSight / Power BI for custom dashboards.
- Data transfer costs: the biggest hidden cost on AWS. In-AZ free. Cross-AZ within a region: ~$0.01/GB both directions. Cross-region: ~$0.02/GB. Internet egress: $0.05-$0.09/GB. NAT Gateway: ~$0.045/GB on top of the per-hour. Pin in-AZ for chatty pipelines.
- S3 Inventory + S3 Storage Lens: Inventory = scheduled report of all objects + metadata to a target bucket. Storage Lens = account-wide usage dashboards with cost optimisation insights.
An audit identifies $50k/month avoidable spend. Pull Compute Optimizer recommendations — finds 30% of EC2 instances over-provisioned. Pull Trusted Advisor — 15 unassociated EIPs ($0.005/hr each), 8 idle RDS instances. Cost Explorer reveals NAT Gateway costs >$8k/month from a chatty cross-AZ Lambda; refactor Lambda to use VPC endpoints for S3 + DynamoDB (zero transfer cost). S3 Storage Lens shows 12 TB of incomplete multipart uploads in older buckets — add lifecycle to abort.
Exam Lab Skills — Hands-On AWS Console3 lessons
SOA-C02 is unique among AWS Associate exams: it optionally includes exam labs where you perform real tasks in a live AWS environment. This module covers console skills you must be able to perform under time pressure: creating CloudWatch alarms and log metric filters, configuring Auto Scaling group lifecycle hooks, deploying CloudFormation stacks and detecting drift, running SSM Run Command and Session Manager sessions, configuring S3 bucket policies and lifecycle rules, creating VPC endpoints and updating route tables, and reviewing GuardDuty/Config findings. Practice these in a free-tier AWS account or AWS skill builder labs.
📖 Read in-depth chapter ▾
SOA-C02 labs put you in front of a live AWS console with a scenario, success criteria, and a time budget. Most candidates lose points not because they don't know AWS but because they ran out of time on a single lab while the next two went un-attempted.
- Lab format: typically 2-3 labs out of 65 questions, ~20% of the score weight. You have one shared time budget across labs + MCQs (~190 minutes total). Labs auto-grade against success criteria the moment you click "Done".
- Time budget rule of thumb: 20 minutes per lab MAX. If you hit 20 min without success criteria met, save what's done and move on — partial credit is awarded per criterion.
- Skim all labs first: when you reach a lab section, read all the success criteria across all labs before doing anything. Order your work — knock out the easiest first, leave the longest for last.
- Read criteria literally: success criteria are checked exactly. "Create a CloudWatch alarm named app-cpu-alarm" — the name MUST match. Wrong name = 0 credit on that criterion even if everything else works.
- Region / account context: the lab has its own AWS account + region. Check the region selector before clicking around. Resources you create are auto-deleted at lab end — no cleanup needed.
- Save and exit pattern: if you must move on, click "Save" not "Done" — Done marks the lab complete and starts grading. Save lets you return if time allows.
Lab 1: 4 criteria, 3 trivial + 1 complex CloudFormation change-set. Lab 2: 6 criteria, all medium. Lab 3: 3 criteria, all hard. Strategy: tackle Lab 1's 3 trivial criteria (10 min, 75% of Lab 1 score). Lab 2 in full (18 min). Lab 3's lowest-hanging criterion if time (5 min, 33% of Lab 3 score). Skip Lab 1's hard criterion and Lab 3's complex ones. Total > 60% of lab score in < 35 min vs going deep on one lab and timing out.
Specific console flows appear over and over in SOA-C02 labs. Practicing these to muscle-memory means you spend exam time on the scenario logic, not the click hunt.
- CloudWatch alarm + SNS: Alarms → Create alarm → pick metric → set threshold + period + M-of-N → action: SNS topic (existing or create new). The most-tested lab — practice the SNS topic creation flow inline.
- Log metric filter + alarm: Log groups → log group → Create metric filter → pattern + metric name → then create alarm on that metric. Two-step flow — easy to forget the alarm step.
- CloudFormation stack + Change Set: CloudFormation → Create stack → upload template → parameters → review → Create. For updates: Update → Replace template → Create change set → Execute. Always Change Set, never direct Update.
- Auto Scaling group lifecycle hook: ASG → Instance management → Lifecycle hooks → Create hook → state transition (launch / terminate) + notification target (SNS / SQS) + heartbeat timeout.
- SSM Session Manager: Systems Manager → Session Manager → Start session → pick instance → opens browser shell. Lab may require enabling KMS encryption on session data — check Preferences first.
- S3 bucket policy + lifecycle: S3 → bucket → Permissions tab (bucket policy in JSON editor — they often hand you the policy). Management tab → Create lifecycle rule → scope (whole bucket / prefix / tag) → actions (transition + expire).
- VPC endpoint: VPC → Endpoints → Create → pick service (S3, DynamoDB Gateway; everything else Interface) → VPC + route table (Gateway) or subnet + SG (Interface).
A typical lab: "Create a CloudWatch alarm named web-cpu-high that fires when EC2 CPUUtilization > 80% for 5 minutes (1 of 1), notifying SNS topic web-alerts (create if missing). Add a metric filter on log group /aws/ec2/web matching ERROR, name WebErrors, then create an alarm on it named web-error-alarm at threshold > 5 per 5 min." Practice this exact flow until it's < 4 minutes end-to-end.
Two weeks of focused lab practice is the difference between a confident attempt and a panic. Build a runbook of the 8-10 most common scenarios and execute each one cold under a timer.
- Free-tier practice account: create a separate AWS account just for lab practice. Set a $1 budget alert. Use only free-tier-eligible services (t3.micro, default S3, CloudWatch standard tier).
- AWS Skill Builder labs: some are free, others bundled with AWS-paid training. Skill Builder labs grade the same way as exam labs — closest possible simulation.
- Build a personal runbook: 1-page cheat-sheet per scenario (CloudWatch alarm, metric filter, lifecycle hook, Change Set, etc.) with exact click paths and gotchas. Review the night before.
- Common pitfalls: wrong region (the lab pins you to one — check before clicking); permissions errors (read the criterion — usually the IAM role is pre-provisioned and you just need to USE it); typos in resource names (criteria are exact-match).
- Skipping criteria: if one criterion is taking too long, MARK the lab Saved and move on. Coming back with fresh eyes often unblocks. Don't sink 25 min into a single criterion when you have 3 more across other labs.
- "Done" vs partial: remember Done = final grade. Some candidates leave one criterion unattempted and click Done, scoring partial credit — better than running out of time and getting 0 on that lab AND missing MCQs.
- Day-of habits: read all MCQs first (skip labs); flag the hard MCQs; do labs around the 60-minute mark when fresh but acclimatised; return to flagged MCQs in the final 30 min.
A 2-week prep plan: Days 1-3 — build the practice account, run through 5 canonical labs cold. Days 4-7 — Skill Builder lab pack, time-box each. Days 8-10 — write the personal runbook, drill the muscle-memory list from Lesson 7.2. Days 11-13 — full mock exams + scored lab simulation. Day 14 — rest, review runbook, sleep early. Exam-day: read all MCQs first, mark hards, do labs at minute 60, return to flagged at minute 150.
Top 4 mistakes candidates make on SOA-C02
- Confusing monitoring tools: CloudWatch = metrics/logs/alarms. CloudTrail = API audit history. AWS Config = resource configuration compliance. GuardDuty = threat detection. Knowing which tool answers which question type is critical.
- Skipping lifecycle hooks: The difference between health check grace period (prevents premature termination), default cooldown (prevents rapid scale-out), and lifecycle hooks (pauses instances for custom initialization) is heavily tested.
- Overlooking VPC routing: VPC Peering, Transit Gateway, and VPC Endpoints all require explicit route table entries. The most common trick question: "peering is set up but traffic doesn't flow" → missing routes.
- Ignoring the exam labs: Candidates who only study theory but never use the AWS console struggle with the lab portion. Spend at least 10 hours practicing the most-tested operations in a real AWS free-tier account.
Study roadmap
5-week study plan
Assumes 1 hour per weekday + 2 hours each weekend day (~7 hours/week). Adjust to your schedule.
Monitoring + Foundations
Complete Module 1 (CloudWatch, CloudTrail, Config, EventBridge). Set up a free-tier AWS account and create your first CloudWatch alarms and log metric filters hands-on. Take the practice test once to establish your baseline score.
Reliability + Deployment
Complete Modules 2–3. Practice creating Auto Scaling lifecycle hooks and CloudFormation drift detection in the console. Listen to CertQuests podcast episodes on disaster recovery strategies during commutes.
Security + Compliance
Complete Module 4. Enable GuardDuty and Inspector on your free-tier account to see real findings. Practice creating AWS Config rules and reviewing compliance. Study SCP structure and cross-account IAM role patterns.
Networking + Cost
Complete Modules 5–6. Build a VPC with public/private subnets, NAT Gateway, and VPC Endpoint in your account. Run a cost analysis using Cost Explorer to understand your spending patterns. Practice S3 lifecycle policy configuration.
Exam Labs + Full Review
Complete Module 7. Take the practice test 2–3 more times targeting >85% score. Use AWS Skill Builder exam labs if available. Focus review on your consistently missed question categories. Schedule your exam.
Ready to test your SOA-C02 knowledge?
60 scenario-based practice questions covering all 6 exam domains. Free, no signup, instant feedback on every answer.
Related certifications
Complete the AWS path
SOA-C02 pairs well with SAA-C03 and DVA-C02 to cover all three AWS Associate specializations.