Reinforce Lambda patterns, DynamoDB design, and CI/CD concepts while commuting. New episodes covering DVA-C02 topics drop weekly.
About the exam
Why earn the AWS Developer Associate?
DVA-C02 is the go-to certification for developers building cloud-native and serverless applications on AWS. It validates that you can architect, build, deploy, and debug production-grade AWS workloads — not just use the console.
- Proves hands-on AWS development skills: Lambda, DynamoDB, API Gateway, SQS/SNS
- Validates CI/CD knowledge: CodePipeline, CodeBuild, CodeDeploy, Elastic Beanstalk
- Demonstrates security practices: IAM, Cognito, KMS, Secrets Manager
- Opens doors to senior developer and cloud engineer roles — median salary $130k+
- Natural next step after Cloud Practitioner (CLF-C02) or alongside SAA-C03
- Valid for 3 years; earns AWS re:Certify continuing education credits
Exam blueprint
DVA-C02 exam domains
Four domains weighted by importance. Development is the largest — make sure Lambda, DynamoDB, and API Gateway are your strongest areas.
Course content
8 modules · ~35 hours
Each module focuses on an exam domain cluster. Work through them in order or jump to your weak areas.
AWS Lambda — Deep Dive3 lessons
Lambda is the heart of DVA-C02. This module covers the full Lambda lifecycle: cold starts and provisioned concurrency, execution roles, environment variables, layers, VPC integration, event source mappings (SQS, Kinesis, DynamoDB Streams), synchronous vs asynchronous invocations, retries and DLQs, Lambda Destinations, concurrency limits, and traffic shifting with aliases and versions.
📖 Read in-depth chapter ▾
Every Lambda invocation runs inside a managed execution context. AWS reuses the context for subsequent invocations (a warm start) but creates a new one when scaling out or after idle time (a cold start). Knowing what gets reused — and what doesn't — is the difference between a 30ms response and a 2-second response in the same function.
- Cold start phases: download package, start runtime, initialise handler code (this is where module-level imports and SDK clients run), then invoke the handler. Total cold-start time depends on package size + runtime: small Node.js ~150ms, Python ~250ms, .NET ~500ms-1s, Java >1s.
- Reusable execution context: code OUTSIDE the handler runs once per cold start and stays warm across invocations. Cache SDK clients, DB connections, and config there. Anything INSIDE the handler runs on every invocation — keep it lean.
- Provisioned Concurrency: pre-warms a set number of execution environments so they never cold-start. Configured per alias (you can't apply it to
$LATEST). Costs continuously even when idle — only use for latency-sensitive APIs. - Reserved Concurrency: caps the maximum concurrent executions for a function. Different from Provisioned — Reserved is a limit (used to protect downstream systems), Provisioned is a pre-warm. You can set both.
- Account concurrency: the AWS account has a soft limit (default 1,000 concurrent executions across all functions in a region). Reserved Concurrency carves out a guaranteed slice from this account pool; unreserved functions share the remainder.
- SnapStart (Java): a newer feature that snapshots the initialised JVM and resumes from the snapshot — reduces Java cold starts from 2-3 seconds to ~200ms. Currently free, Java-only, must publish a new version to use it.
A latency-sensitive API behind API Gateway sees p99 spikes of 1.5s at low traffic. Diagnosis: low traffic means Lambda scales down to zero between requests, and each new request pays a cold-start tax. Fix: enable Provisioned Concurrency = 2 on the production alias and route API Gateway at that alias. P99 drops to ~80ms. To protect a downstream legacy database from a sudden Lambda fan-out, also set Reserved Concurrency = 50 so a runaway invocation count can't take down the DB.
Lambda's value comes from the breadth of trigger sources it integrates with. There are three invocation models — synchronous (API Gateway, ALB), asynchronous (S3, SNS, EventBridge), and event source mappings (SQS, Kinesis, DynamoDB Streams). The differences determine retry behaviour, error handling, and idempotency requirements.
- Synchronous invocation: caller waits for the response. API Gateway → Lambda is the canonical pattern. Errors propagate back to the caller — no automatic retries. Idempotency optional; you control retry logic on the client side.
- Asynchronous invocation: Lambda accepts the event and returns immediately; processing happens in the background. Two automatic retries with exponential backoff. After exhausted retries, the event can be sent to a Dead Letter Queue (SQS/SNS) or a Lambda Destination (preferred — supports success and failure routing to SQS/SNS/EventBridge/Lambda).
- Event source mapping (poll-based): Lambda's runtime polls the source (SQS, Kinesis, DynamoDB Streams, Kafka). For SQS, success deletes the message; failure leaves it in the queue (visibility timeout-based retry). For Kinesis/DynamoDB Streams, failure blocks the shard — use bisect-on-error and a DLQ to drain poison messages.
- SQS batch size and partial batch responses: Lambda receives up to 10 (Standard) / 10 (FIFO) messages per invocation. With
ReportBatchItemFailuresenabled, return only the failed message IDs and the rest are deleted — vital to avoid reprocessing the whole batch on one bad message. - Lambda Destinations vs DLQ: Destinations are richer than the legacy DLQ — they support routing both success and failure events to a target (SQS / SNS / EventBridge / Lambda), include full execution metadata, and are configured per alias/version. Prefer Destinations for new builds.
- Idempotency: any non-sync source can deliver the same event more than once. Use idempotency keys in the event payload to detect retries (cache key in DynamoDB with TTL). The AWS Lambda Powertools libraries provide this out of the box.
An order-processing pipeline reads from SQS and writes to DynamoDB. Occasional malformed orders break processing and the entire batch retries. Fix: enable ReportBatchItemFailures in the event source mapping; the function returns the failed message IDs and SQS deletes the others. Add a DLQ on the source queue with maxReceiveCount = 3 so poison messages drain to the DLQ after 3 retries. For the rare success-path notification, wire a Lambda Destination on success to an SNS topic that fans out order-confirmation emails.
Lambda's deployment story is two primitives: immutable versions snapshot the code + configuration, and mutable aliases are named pointers (with weights) at one or two versions. Together they enable canary and linear deployments without re-platforming. VPC integration is the extra wrinkle: necessary for private resources, but with implications for cold start and IP planning.
- $LATEST and versions: every code change updates
$LATESTin place. Publishing a version creates an immutable snapshot with a numeric version (1, 2, 3...). Pinning callers (API Gateway integration, event source mapping) to a version means upgrades require a deploy of the consumer, not just a Lambda push. - Aliases: a named pointer at one OR two versions. Aliases support weighted aliases — e.g.,
prodat 90% version 8 and 10% version 9. API Gateway / event source mappings target the alias so you control traffic shifting without changing consumers. - Traffic shifting patterns: all-at-once (instant cut-over, fastest, riskiest), linear (X% every Y minutes — e.g., 10% every 10 minutes), canary (small % then jump — e.g., 10% for 10 min then 100%). All managed declaratively by CodeDeploy or SAM
DeploymentPreference. - Lambda Layers: shared library bundles attached to one or more functions. Max 5 layers per function, max 250 MB unzipped (including all layers + function code). Common use: shared SDK clients, common utilities, large native binaries (Pillow, libpq).
- VPC integration: attach the function to subnets + security groups to reach RDS, ElastiCache, EFS, internal ALBs. AWS auto-provisions Hyperplane ENIs in the subnet — cold-start impact is now negligible (used to be 10+ seconds). Plan subnet IP space: each Hyperplane ENI consumes one IP from the subnet.
- Internet egress from VPC: a VPC-attached Lambda loses public internet access — it lives in private subnets. To reach the internet (3rd-party API, Bedrock), add a NAT Gateway in the VPC and route the Lambda's subnets through it. Cost gotcha — NAT is per-GB.
A team deploys a new Lambda version that interacts with an internal RDS instance. Design: attach Lambda to two private subnets (different AZs) with a security group allowing 5432 to the RDS SG. Publish version 5, create an alias prod pointing at it with weight 1.0. CI/CD then publishes version 6, updates the alias to 90% v5 + 10% v6 via SAM DeploymentPreference: Canary10Percent10Minutes. CloudWatch Alarms watch error rate; if the alarm fires within the 10-minute canary window, CodeDeploy automatically rolls back to v5. After the window, alias jumps to 100% v6.
API Gateway — REST, HTTP & WebSocket APIs3 lessons
Build and secure APIs the AWS way. Covers REST vs HTTP API differences, Lambda proxy vs custom integration, request/response mapping templates, stage variables, caching and throttling, usage plans and API keys, Lambda authorizers, Cognito User Pool authorizers, CORS configuration, canary deployments, and WebSocket APIs for real-time bidirectional communication.
📖 Read in-depth chapter ▾
API Gateway has two flavours that get confused all the time: REST API (the original, feature-complete) and HTTP API (newer, simpler, ~70% cheaper). The exam loves edge-case questions on which features exist in which flavour, because picking the wrong one costs you either money or capability.
- HTTP API: the modern, low-cost option. JWT authorizers via Cognito or any OIDC provider, simple request/response, auto-deployed stages, CORS first-class. NO request/response mapping templates, NO API keys / usage plans, NO caching, NO WAF integration (until recently — check the current docs). Use for serverless microservices that don't need transformation.
- REST API: the full-feature option. Lambda + Cognito authorizers, request/response mapping (VTL), API keys + usage plans, response caching, WAF, private API endpoints with VPC interface endpoints. Costs ~3.5× HTTP API. Use when you need any of those features.
- Lambda integrations: proxy integration (the default — full request passed as JSON, full response expected back) vs custom integration (you write mapping templates to translate). Proxy is what 95% of teams use; custom only when you must integrate a non-Lambda backend or transform the contract.
- Stages and stage variables: a stage is a named deployment of the API (
dev,prod). Stage variables are key/value pairs accessible from integrations and authorizers — use them to point at different Lambda aliases per environment (e.g.,${stageVariables.lambdaAlias}). - Throttling and caching: account-level throttle (10,000 RPS soft limit), stage-level throttle (override per stage), method-level throttle (per-route override). Caching only on REST API — TTL up to 1 hour, cache key includes path + headers you opt into. Cache cost scales with size.
- Endpoint types: Edge-optimised (CloudFront in front, default for REST), Regional (your region only, faster from same-region clients), Private (REST only — accessible only via VPC interface endpoint). HTTP APIs are always Regional.
A SaaS startup needs a public API with three properties: (1) cost-efficient, (2) authenticated via Cognito User Pool JWT, (3) protected by simple per-stage rate limits. Choice: HTTP API with a JWT authorizer pointing at the Cognito issuer, stages dev and prod with stage-level throttling (e.g., 100 / 1000 RPS). Saves ~70% vs REST. If later the team needs request transformation or response caching, they migrate to a REST API — but only then.
API Gateway has four authorizer types, each suited to a different identity story. The exam asks you to pick the right one given a description of the client population — corporate users, consumer app, server-to-server, third-party integration.
- IAM authorizer (Sigv4): the caller signs requests with AWS credentials. Right for service-to-service inside AWS — e.g., a Lambda in another account calling your API. Free, no Lambda invocation per call. Doesn't fit end-user clients.
- Cognito User Pool authorizer: the caller presents a Cognito JWT in the Authorization header. API Gateway validates the JWT against the configured Cognito pool. Right for consumer apps where Cognito handles sign-up / sign-in / MFA / federation.
- Lambda authorizer (custom): a Lambda function you write evaluates the request (token-based on Authorization header, OR request-based on multiple inputs) and returns an IAM policy that allows or denies. Used for third-party JWT validation (Auth0, Okta, Firebase), custom business rules, or PII-aware decisions. Cache results to avoid invoking on every request — TTL up to 1 hour.
- JWT authorizer (HTTP API only): built-in OIDC JWT validation against any issuer (Cognito, Auth0, Azure AD). Same feel as Lambda authorizer but no Lambda — config-only, free, fast. Use whenever the identity story is "any OIDC IdP".
- API keys + usage plans: distinct from authorizers — API keys identify the CALLING SYSTEM (not the user), bind to usage plans that enforce per-key throttle + quota (RPS + monthly request count). Use for paid-tier rate limiting or partner API access. Always combine with an authorizer for real auth — API keys alone are not authentication.
- CORS: handled by API Gateway via OPTIONS preflight responses. HTTP API has first-class CORS config (just declare allowed origins/methods/headers). REST API requires enabling CORS on each method — easy to miss and the most common "my API doesn't work in the browser" complaint.
A B2B SaaS API needs: (1) end-user auth via the customer's own SSO (any OIDC), (2) partner-level rate limits on shared keys, (3) admin endpoints callable only from internal AWS Lambda. Design: HTTP API with a JWT authorizer pointing at the customer's OIDC issuer for user routes; API keys + usage plans on top of the JWT for per-partner quotas; a separate REST API (or routes within the HTTP API) with IAM authorization for the internal admin endpoints. Three identity stories, three authorizer types, one logical API.
How you ship to API Gateway determines the blast radius of a bad release. Add CloudWatch + X-Ray and you can find regressions before customers do. WebSockets are the off-spec topic that always catches one or two candidates per exam.
- Canary deployments (REST API): deploy a new version to the same stage but route a small % of traffic to it. Run for a defined window; promote if metrics are clean; roll back by re-deploying the prior version. Integrates with CodeDeploy for automatic rollback on alarm.
- HTTP API auto-deploy: changes to an HTTP API can be auto-deployed to the configured stage — simpler but less control. Use stage variables and feature flags to decouple risky changes from the deploy.
- CloudWatch metrics: per-stage / per-method
4XXError,5XXError,Count,Latency,IntegrationLatency. Build dashboards for p50/p99 latency and error rate; alarm on 5XX rate > 1% over 5 minutes. - Access logging: per-request log entries (JSON) sent to CloudWatch Logs or Kinesis Firehose. Enable to debug specific requests by request ID. Costs ingestion fees — sample if traffic is huge.
- X-Ray tracing: enable on the stage to capture per-segment latency through the integration. Lambda must opt-in to X-Ray (env var + IAM permissions). Surfaces "DynamoDB call took 800ms" inside an otherwise unexplainable 1.2s API response.
- WebSocket APIs: a separate API type for bidirectional, stateful connections. Routes are keyed by
$connect,$disconnect,$default, and message-payload routes you define. Storage of connection IDs is your job — DynamoDB is the canonical backing store. Backend (typically Lambda) callsPostToConnectionvia the management API to push messages to specific clients.
A trading-platform team needs a real-time price-update feed to thousands of browsers. Design: WebSocket API with $connect Lambda that writes the connectionId + userId to a DynamoDB connections table, $disconnect Lambda that deletes the row. A separate price-publisher Lambda triggered by a Kinesis stream reads all active connections and calls PostToConnection per subscriber. Add X-Ray for end-to-end latency tracing. Per-stage CloudWatch alarm on ConnectCount sudden drops signals a backend issue.
DynamoDB — Design, Query & Optimization3 lessons
DynamoDB is the most-tested data store on DVA-C02. Master partition key and sort key design, avoiding hot partitions, LSI vs GSI, Query vs Scan and when to use each, ProjectionExpression, ConditionExpression, optimistic locking with version attributes, DynamoDB Streams, TTL, transactions (TransactGetItems/TransactWriteItems), DAX for caching, and on-demand vs provisioned capacity.
📖 Read in-depth chapter ▾
DynamoDB is designed for known access patterns — get this exact entity by this exact key. Get the key design right and your application scales linearly; get it wrong and you'll fight hot partitions for years. The exam tests partition-key design more than any other DynamoDB topic.
- Partition key (PK): required, hashes to determine which physical partition stores the item. Choose a PK with high cardinality and even access distribution. Bad:
status("active"/"deleted") — all "active" items land in one partition. Good:userId— millions of distinct values, evenly accessed. - Sort key (SK): optional second part of the primary key. Items with the same PK are stored sorted by SK and Queryable as a range. Common pattern: PK =
userId, SK =orderTimestamp— Query gives "all orders for this user, newest first" in O(log n). - Hot partition: traffic concentrated on one PK value (or a few). DynamoDB's adaptive capacity helps but doesn't eliminate the issue. Mitigation: write sharding — append a random suffix (
userId#0...userId#9) to spread writes, then Query all 10 shards in parallel on read. - Single-table design: store multiple entity types (users, orders, products) in one DynamoDB table, distinguished by PK/SK prefixes (
USER#123,ORDER#456,PRODUCT#789). Lets one Query return a hierarchical record (user + their last 10 orders) in one request. - Composite-key Query patterns: with PK + SK, Query supports
begins_with,between,<,>,=on the SK. Reverse SK with a timestamp-formatted SK to query newest-first cheaply. - Attribute size and item limits: max 400 KB per item. Total all attributes including names. For large objects, store metadata in DynamoDB and the blob in S3 with the S3 URL as an attribute.
An e-commerce app needs: (1) get user profile by userId, (2) list all orders for a user, newest first, (3) get a single order by orderId. Single-table design: PK = USER#{userId}, SK = PROFILE for the profile item; PK = USER#{userId}, SK = ORDER#{2026-05-16T18:34:00Z}#{orderId} for each order. Query (1) is GetItem with PK + SK=PROFILE. Query (2) is Query with PK + begins_with(SK, "ORDER#") and ScanIndexForward=false. Query (3) needs a GSI with PK=ORDER#{orderId}.
Once your keys are designed, the two read operations are Query (fast, indexed) and Scan (slow, sequential). Secondary indexes give you alternate access paths into the data. Capacity mode controls the cost model — on-demand pays per request, provisioned pays per reserved capacity unit.
- Query: must specify PK; can filter on SK; reads one partition. Returns up to 1 MB per page, paginated via
LastEvaluatedKey. UseProjectionExpressionto fetch only needed attributes. Always the right answer for known-entity reads. - Scan: reads every item in the table (or index). Filter expressions are applied AFTER the read — you still pay RCU for every item read, even if filtered out. Avoid Scan except for periodic reporting jobs against small tables.
- Global Secondary Index (GSI): alternate (PK, SK) pair indexing the SAME table. Eventually consistent (lag is usually <1 second but can spike). Has its own provisioned capacity. Up to 20 per table. Use for "I need to look up an order by orderId but the table PK is userId".
- Local Secondary Index (LSI): alternate SK against the SAME PK. Must be created with the table (cannot add later). Limited to 5 per table. Strongly consistent reads available. Use for "all orders by user, sorted by total instead of by timestamp".
- Capacity modes: On-demand pays per million read/write request units — no capacity planning, scales to any traffic. Provisioned reserves RCU/WCU per second, supports auto-scaling and Reservations. Provisioned is ~5-7× cheaper at steady high traffic. On-demand is the default choice for variable / unpredictable workloads.
- RCU/WCU math: 1 RCU = 1 strongly-consistent read up to 4 KB, or 2 eventually-consistent. 1 WCU = 1 write up to 1 KB. Transactional read costs 2× RCU. Transactional write costs 2× WCU. Plan for the largest item, not the average.
The order-lookup pattern from Lesson 3.1 needs to support "find an order by orderId" without knowing the user. Create a GSI on the orders table with PK = orderId, no SK, and project KEYS_ONLY (cheapest). Lookup logic: Query the GSI by orderId to get the user/order key, then GetItem from the main table. Total cost: 1 RCU (eventually consistent GSI Query) + 0.5 RCU (eventually consistent GetItem on a small item). On-demand mode keeps this trivially cheap at unpredictable traffic.
The "everything else" lesson — features that DVA-C02 tests heavily but that don't fit the keys / queries / capacity story. Get these right and your data layer becomes resilient (transactions), event-driven (streams), self-cleaning (TTL), and microsecond-fast (DAX).
- DynamoDB Streams: an ordered change log per shard for the past 24 hours. Stream view types:
KEYS_ONLY,NEW_IMAGE,OLD_IMAGE,NEW_AND_OLD_IMAGES. The standard event source for Lambda-driven cross-region replication, analytics fan-out, or trigger-style validation. - Optimistic locking: add a
versionattribute, includeConditionExpression: version = :expectedon every update, and increment version on each write. Concurrent updates with a stale version receiveConditionalCheckFailedException— the client retries with the latest value. No actual locks taken. - Transactions:
TransactWriteItemsandTransactGetItemsgroup up to 100 actions into one ACID-compliant operation. All succeed or all roll back. Cost: 2× the equivalent non-transactional units. Use for cross-item invariants ("transfer money from account A to account B"). - TTL (Time To Live): add a numeric attribute holding a Unix timestamp. DynamoDB deletes items past their TTL within ~48 hours, asynchronously, free. Useful for session storage, idempotency keys, cache layers. Items past TTL but not yet deleted still appear in Query/Scan — filter them out client-side if necessary.
- DAX (DynamoDB Accelerator): in-memory write-through cache cluster in front of DynamoDB. Microsecond reads for cache hits, sub-millisecond writes. Only works with DynamoDB native operations (not PartiQL with FetchAll, not all index types). Sit DAX in private subnets — connect via the cluster endpoint.
- Global Tables: multi-region active-active replication. Tables in each region accept writes; conflicts resolved by last-writer-wins on the per-region timestamp. Use for global low-latency reads/writes; not a substitute for backup (a Global Table replicates a delete to every region within seconds).
A funds-transfer API debits one account and credits another. Naive design with two UpdateItem calls can leave money in limbo if the second fails. Correct: TransactWriteItems with two ConditionExpression updates — debit only if balance >= :amount, credit unconditionally. Both succeed or both fail. Wrap in optimistic-locking version checks if concurrent transfers on the same account are possible. Emit the transfer record to a DynamoDB Stream so a Lambda can asynchronously notify the user via SNS.
Halfway through? Reinforce DynamoDB and Lambda patterns by listening to the CertQuests podcast — concise audio episodes covering exactly these topics for your commute.
▶ Open SpotifyMessaging — SQS, SNS, Kinesis & EventBridge3 lessons
Decoupled event-driven architectures are a core DVA-C02 topic. Learn when to use SQS (durable queuing), SNS (fan-out pub/sub), Kinesis Data Streams (ordered high-throughput streaming), and EventBridge (event routing with schema registry). Key concepts: SQS visibility timeout and DLQ, FIFO vs Standard queues, SNS subscription filters, Kinesis shard management and Lambda parallelization factor, and EventBridge rules and targets.
📖 Read in-depth chapter ▾
SQS is the durable queue. Producers write, consumers poll, the queue keeps the message until acknowledged. Two queue types — Standard (high throughput, at-least-once, no ordering) and FIFO (exactly-once, ordered, lower throughput) — and a small set of operational levers that the exam tests heavily.
- Standard vs FIFO: Standard supports unlimited throughput, best-effort ordering, at-least-once delivery (duplicates possible). FIFO is strict ordering inside a MessageGroupId, exactly-once when
ContentBasedDeduplicationis on, max ~3000 msg/sec with high throughput mode. FIFO queue names must end with.fifo. - Visibility timeout: when a consumer receives a message, it's invisible to others for the visibility timeout (default 30s, max 12h). If the consumer fails before deleting, the message becomes visible again. Tune to slightly longer than your slowest consumer; too short causes duplicate processing.
- Long polling: set
ReceiveMessageWaitTimeSeconds = 20(max). The Receive API call holds open until a message arrives or 20s elapse — drastically cuts cost vs short polling (~$0.40/M reqs adds up fast on a hot consumer). - Dead Letter Queue (DLQ): a second SQS queue that receives messages that have been received but not deleted N times (
maxReceiveCount, default 3). Lets poison messages drain off the main queue. DLQ for FIFO must also be FIFO; DLQ for Standard can be Standard. - Message size: up to 256 KB. For larger payloads, store the body in S3 and put the S3 URL in the SQS message (the Extended Client Library does this automatically). Compresses large-payload pipelines without changing producer/consumer code much.
- Delay queues and delays per message: a queue-level
DelaySeconds(up to 15 minutes) postpones visibility for every new message. Per-message delay is supported on Standard only. Useful for scheduled retries or pacing downstream systems.
An e-commerce order pipeline processes ~5,000 orders/sec at peak. Orders for the same customer must be processed in order, but order between different customers doesn't matter. Choice: FIFO queue with high-throughput mode + MessageGroupId = customerId so each customer becomes a strict-order shard while different customers run in parallel. Visibility timeout 60s (consumer p99 is 30s). DLQ with maxReceiveCount = 5. ContentBasedDeduplication on, so the producer's retry of the same body doesn't double-process.
SNS and EventBridge both deliver events to multiple subscribers, but they target different patterns. SNS is the simple fan-out pub/sub — one topic, many subscribers, optional filter. EventBridge is the routing bus — one event, many rules that fire targets based on content matching against a schema. Picking between them is a frequent exam question.
- SNS topics: publish once, fan-out to many subscribers. Subscriber types: SQS, Lambda, HTTP(S) endpoint, email, SMS, mobile push, Kinesis Data Firehose, application endpoints. Standard topics (best-effort order, at-least-once) and FIFO topics (with FIFO SQS subscribers, deduplication, group-by-MessageGroupId).
- SNS subscription filters: JSON policy attached to a subscription that tests message attributes (and optionally body in newer regions). Only matching messages are delivered. Avoid sending every message to every subscriber when only a subset cares.
- Cross-region/cross-account fan-out: SNS topics can have subscribers in other regions and accounts. The standard pattern for "one publisher, multiple business-unit consumers".
- EventBridge: a managed event bus with rules. Each rule has an event pattern (content-based filter) and one or more targets (Lambda, Step Functions, SQS, SNS, Kinesis, ECS, Batch, etc.). Schema registry can auto-discover and version event schemas. Three bus types: default (AWS service events), custom (your events), partner (SaaS integrations like Auth0, Datadog, Shopify).
- Scheduled rules: EventBridge supports cron-style rules (the modern replacement for CloudWatch Events). Trigger Lambda on a schedule with a richer expression language than the legacy CloudWatch syntax.
- EventBridge Pipes (newer): point-to-point integration with optional filter and enrichment between source (SQS/Kinesis/DynamoDB Stream) and target (Lambda/Step Functions/SNS/etc.). Replaces the "Lambda just to translate" anti-pattern for many integrations.
An order-published event needs to: (1) fan out to a billing service, an inventory service, and a notifications service; (2) only the notifications service cares about orderTotal > $500 (VIP customers). Choice A (SNS): one topic, three SQS subscribers, an SNS subscription filter on the notifications subscriber matching {"orderTotal": [{"numeric": [">", 500]}]}. Choice B (EventBridge): one event, three rules. SNS is simpler and free for the fan-out itself. EventBridge wins when content routing gets more complex or you want schema discovery.
When you need ordered, high-throughput, replayable streams (real-time clickstream, telemetry, financial ticks), Kinesis Data Streams is the answer. The mental model is fundamentally different from SQS — partitioned by shard, consumers track position, retention up to 365 days. Exam questions focus on shard scaling and Lambda integration parallelism.
- Shards: the unit of throughput. Each shard supports 1 MB/sec or 1000 records/sec write, 2 MB/sec read. Capacity scales by adding shards. Provisioned mode (you size manually) vs On-Demand mode (auto-scales up to peak ~30× over 24h).
- Partition key: producer-supplied; hashed to determine which shard. Records with the same partition key always land in the same shard, preserving order WITHIN that key. Even distribution is your job — a hot partition key creates a hot shard exactly like DynamoDB.
- Retention: 24 hours by default, configurable up to 365 days. Replay-capable — a downstream consumer can re-process the last 7 days by resetting its position. Cost increases with retention.
- Lambda as consumer: Lambda polls the stream via event source mapping. Parallelization factor (1-10) increases how many Lambda invocations process records FROM ONE SHARD concurrently — useful when downstream work is slow and records-per-shard exceeds Lambda's throughput. Use carefully — invocations within a shard see records out of original order.
- Failure handling for stream consumers: a failing batch BLOCKS the shard (Lambda retries indefinitely by default). Mitigations:
BisectBatchOnFunctionError(split the batch and retry halves),MaximumRetryAttempts(cap retries),OnFailuredestination (SQS/SNS DLQ for unprocessable records). Without these, one poison record halts the entire shard. - Enhanced Fan-out: per-consumer dedicated 2 MB/sec read throughput, push-based via HTTP/2. Use when multiple consumers compete for the same shard's read capacity. Costs more — only justified for 3+ consumers.
A telemetry pipeline ingests sensor readings at 50 MB/sec, must preserve per-sensor ordering, and runs a real-time anomaly detector. Design: Kinesis Data Stream with 50 shards (on-demand mode for variable load), PartitionKey = sensorId. Anomaly Lambda as consumer with parallelizationFactor = 1 (we need order per sensor — bumping the factor would interleave records). Add BisectBatchOnFunctionError + OnFailure destination SQS DLQ so a poison record doesn't block the shard. Retention bumped to 7 days for replay during incidents.
Security — Cognito, IAM, KMS & Secrets Manager3 lessons
Security is 26% of DVA-C02. This module covers Cognito User Pools vs Identity Pools and when to use each, ID token vs Access token vs Refresh token, fine-grained DynamoDB/S3 access with identity pool policy variables, IAM roles for Lambda and cross-account access, resource-based policies, KMS envelope encryption with GenerateDataKey, Secrets Manager automatic rotation, SSM Parameter Store SecureString, and S3 presigned URLs.
📖 Read in-depth chapter ▾
Cognito has two products that share a brand but solve different problems. User Pools are the directory; Identity Pools are the AWS-credential vending machine. Most apps use both together — the exam loves the "which one for which job" distinction.
- User Pool: a managed user directory — sign-up, sign-in, password reset, MFA, social logins, custom attributes, hosted UI. Issues JWTs (ID + Access + Refresh tokens) to authenticated users. Use it whenever your app needs an "account" concept.
- Identity Pool (Federated Identities): exchanges a token (from Cognito User Pool OR third-party IdP OR Google/Facebook/Apple) for TEMPORARY AWS credentials via STS AssumeRoleWithWebIdentity. Use to grant browser/mobile apps direct AWS access (S3 upload, DynamoDB read).
- JWT trio: ID token contains user identity claims (use for "who is this user"). Access token authorises access to resources (use to call APIs). Refresh token long-lived, swap for new ID+Access. Default expiry: ID/Access 1h, Refresh 30d.
- Policy variables in Identity Pool roles:
${cognito-identity.amazonaws.com:sub}resolves to the user's unique identity ID. Use in IAM policies ("Resource": "arn:aws:s3:::mybucket/${cognito-identity.amazonaws.com:sub}/*") to give each user their own folder without writing per-user policies. - Cognito Triggers (User Pool Lambda): hook into pre-signup, post-confirmation, custom message, pre-auth, post-auth, pre-token-generation. Use pre-token-generation to inject custom claims (org_id, tier) into the JWT.
- SAML / OIDC federation: User Pools can federate to a corporate IdP (Okta, Azure AD via SAML or OIDC). Users authenticate at the IdP; Cognito issues its own JWT with the federation claims merged in. Common in B2B SaaS.
A mobile app needs (1) user accounts with email + Google sign-in, (2) per-user private S3 photo storage. Design: Cognito User Pool for sign-up/sign-in (Google federated). After auth, the app exchanges the User Pool JWT at the Identity Pool for temporary AWS credentials. The Identity Pool authenticated role grants s3:GetObject/s3:PutObject on mybucket/${cognito-identity.amazonaws.com:sub}/* — each user can read/write only their own folder, no Lambda required between client and S3.
IAM on the DVA-C02 exam is heavy on practical role design rather than policy syntax memorisation. Understanding execution roles, trust relationships, and cross-account access patterns is what separates a passing score from a fail.
- Execution role: the IAM role a Lambda (or ECS task, EC2, etc.) assumes when it runs. Grants the function permissions to call AWS APIs. The role's trust policy must allow
lambda.amazonaws.comas principal. - Identity-based vs resource-based policies: identity-based policies attach to IAM users/roles ("this role can do X"). Resource-based policies attach to resources ("anyone listed here can do X to me") — S3 bucket policies, Lambda resource policies, KMS key policies. Resource-based policies enable cross-account access without role chaining.
- STS AssumeRole: the cross-account mechanism. Account A's role has a trust policy allowing Account B's principal to assume it. Account B calls
sts:AssumeRoleand gets temporary credentials that act as Account A's role for up to 12 hours. - Lambda invocation permission (resource policy): for cross-account Lambda invokes, add a resource-based policy on the Lambda using
lambda:AddPermission— that's how API Gateway in account B is allowed to invoke a Lambda in account A. - Policy evaluation logic: deny anywhere wins. Explicit allow needed somewhere. Resource-based policies CAN grant access without an identity-based policy in the same account (cross-account requires both ends).
- Permissions boundary: the maximum permissions a role/user can have. Used to delegate role-creation safely — devs can create roles, but the permissions boundary caps what those roles can do.
A Lambda in account A needs to read DynamoDB in account B. Design: in account B, create an IAM role with dynamodb:GetItem on the target table; its trust policy allows the Lambda execution role from account A. The Lambda code calls STS AssumeRole at startup (cached for ~50min), uses the temporary credentials in a DynamoDB client. Alternative for one-off cross-account writes: a resource-based policy on the DynamoDB table itself — simpler but doesn't carry over to Streams or backups.
Encryption keys and secrets management round out the security domain. The exam distinguishes Secrets Manager from Parameter Store routinely, and asks about KMS envelope encryption mechanics whenever a workflow involves "encrypt this 500 MB file".
- KMS envelope encryption: KMS keys are 4 KB and can't encrypt large objects directly. Instead: call
GenerateDataKeywhich returns a plaintext data key AND an encrypted version of that data key. Encrypt your large blob locally with the plaintext data key, discard the plaintext, store the encrypted blob + the encrypted data key together. To decrypt:Decryptthe data key with KMS, use plaintext data key to decrypt the blob. - Customer-Managed Keys (CMK) vs AWS-managed keys vs AWS-owned keys: CMK is the only one where you control the key policy, rotation, deletion. CMK comes in symmetric (default), asymmetric (RSA/ECC), and HMAC variants.
- Secrets Manager: rotates secrets automatically via a Lambda rotation function. Built-in rotation for RDS, DocumentDB, Redshift, Redshift Serverless. Higher cost than Parameter Store. Use whenever automatic rotation matters.
- SSM Parameter Store SecureString: stores secrets encrypted with KMS. Free (Standard tier, up to 4 KB per parameter), no built-in rotation. Use for config that rotates manually or rarely (DB endpoints, feature flags, ENV-prefixed values).
- S3 presigned URLs: a signed URL with embedded credentials and short expiry that allows the holder to GET or PUT a specific S3 object without AWS credentials. Common pattern: client calls Lambda → Lambda generates presigned URL → client uploads directly to S3, bypassing Lambda payload limits (6 MB sync, 256 KB async).
- S3 server-side encryption options: SSE-S3 (AWS-managed key, AES256), SSE-KMS (KMS CMK, auditable, more expensive — invokes
Decrypton every GET unless caching is on), SSE-C (you provide the key on every request — rare).
A user uploads a 500 MB report through a web app. The Lambda backend can't accept 500 MB payloads. Design: Lambda exposes an API that, on request, calls S3.getSignedUrl('putObject', ...) with a 15-minute expiry and returns the URL to the client. The browser PUTs the file directly to S3. S3 is configured with SSE-KMS using a CMK. After upload, S3 ObjectCreated event triggers a downstream Lambda. To decrypt: that Lambda has kms:Decrypt on the CMK; reads the encrypted data-key from the object metadata; uses KMS to decrypt it; uses the plaintext data key to decrypt the report locally.
CI/CD — CodeCommit, CodeBuild, CodeDeploy & CodePipeline3 lessons
The deployment domain (24%) focuses on AWS-native CI/CD. Learn CodePipeline stage architecture (Source → Build → Deploy), CodeBuild buildspec.yml phases, compute types, and artifact caching, CodeDeploy deployment strategies for EC2/ECS/Lambda (AllAtOnce, Rolling, Blue/Green, Canary, Linear), AppSpec hooks (BeforeInstall, AfterInstall, ValidateService), CloudWatch alarm rollback, and CodeArtifact for private package management.
📖 Read in-depth chapter ▾
CodePipeline orchestrates the source → build → deploy flow. CodeBuild executes the actual build. Together they're AWS's native CI alternative to GitHub Actions / GitLab CI / Jenkins. Exam questions focus on stages, actions, and where artifacts flow.
- Pipeline anatomy: stages run sequentially; actions within a stage can run in parallel or sequentially. Each action has an input artifact and an output artifact (S3-backed). Common stages: Source → Build → Test → Deploy. Stages can include manual approval actions.
- Sources: CodeCommit (deprecated for new accounts), GitHub, Bitbucket, GitLab, S3, ECR (for container builds). Push triggers webhook events on connected repos; CodeStar Connections handles the OAuth.
- CodeBuild buildspec.yml: phases are install (runtime + tools), pre_build (login, fetch deps), build (compile/test), post_build (push artifacts).
artifactssection names what to upload.cache.pathscaches dependency dirs across builds — major time savings for npm/maven. - CodeBuild compute types: Small (3 GB / 2 vCPU) → Large (15 GB / 8 vCPU) → 2XLarge (145 GB / 72 vCPU). Pay per minute, billed in 1-min increments. Use Arm Graviton compute for ~20% cost savings when your build supports it.
- Environment variables and Parameter Store / Secrets Manager: CodeBuild buildspec can reference SSM Parameter Store or Secrets Manager secrets directly — no need to bake credentials into the project. Use
parameter-storeorsecrets-managerkeys in the env section. - CodeArtifact: managed package repo for npm, Maven, PyPI, NuGet. Supports upstream proxies to public registries with caching. Use to enforce internal package use, mitigate supply-chain attacks, and lock versions across teams.
A Node.js Lambda needs a pipeline: GitHub → build (with cached node_modules) → deploy via CloudFormation. Design: CodePipeline with three stages. Source: GitHub via CodeStar connection. Build: CodeBuild project with buildspec phases (install nvm-aliased Node 20, pre_build runs npm ci, build runs npm test + sam build, artifacts publish the .aws-sam directory). Cache S3 location for node_modules. Deploy: CloudFormation action with the SAM-built template. Manual approval before the prod deploy stage. Total pipeline runtime: ~3 minutes after cache warm-up.
CodeDeploy automates the deploy phase. The exam tests deployment strategies (how traffic shifts) and AppSpec hooks (what runs at each phase). Get the strategy wrong and you either ship too fast or take down production.
- Lambda strategies: AllAtOnce (instant cut-over, riskiest), Canary (X% for N minutes, then 100%), Linear (X% every N minutes, e.g., 10% every 10 min). CodeDeploy shifts traffic between two Lambda aliases via weighted aliases.
- ECS strategies: Rolling (replace tasks in batches — supports minimumHealthyPercent / maximumPercent), Blue/Green (provision new task set, swap target group at the ALB). Blue/Green supports the same Canary/Linear shift percentages as Lambda.
- EC2/on-prem strategies: In-place (replace one or batches at a time — needs CodeDeploy agent on the instance), Blue/Green (provision new ASG, swap behind ALB). In-place is what you SHOULD avoid for prod — no rollback path if the agent fails mid-deploy.
- AppSpec hooks (Lambda):
BeforeAllowTraffic(run smoke tests before any traffic shifts),AfterAllowTraffic(validation after the shift completes). Each hook is itself a Lambda. Return a status to CodeDeploy via the SDK. - AppSpec hooks (EC2):
ApplicationStop,BeforeInstall,AfterInstall,ApplicationStart,ValidateService. Scripts in the appspec.yml run on the target instance via the CodeDeploy agent. - Automatic rollback: CodeDeploy can roll back when a CloudWatch alarm fires during deployment. Configure the deployment group with the alarm; if it transitions to ALARM during the shift window, CodeDeploy reverses the alias weight back to the old version. Set this up — manual rollback during an incident is too slow.
Production Lambda deployment with safety: deployment group uses Canary10Percent10Minutes. BeforeAllowTraffic hook runs an integration test against the new alias before any production traffic is shifted. AfterAllowTraffic hook runs a smoke test against the production alias after 100% shift. CloudWatch alarm: error rate > 1% on the new alias over 5 minutes. If the alarm fires during the 10-min canary window, CodeDeploy auto-rolls back to v(N-1). Combined with the SAM DeploymentPreference in template, this gives full GitOps deploys with no manual intervention.
Blue/Green deployments for ECS and EC2 use load-balancer-level target-group swaps. The mental model is two parallel environments (blue = current, green = new) where CodeDeploy decides when to flip. Exam scenarios test whether you understand which traffic goes where during each phase.
- Two target groups: the ALB has two target groups (TG-blue and TG-green) registered on different listener rules. CodeDeploy controls which TG receives production traffic (the "production listener") and which receives test traffic (the "test listener", optional).
- Deployment flow (ECS): CodeDeploy creates the green task set, registers tasks into TG-green, runs your AppSpec test hook against the test listener (if configured), then shifts production listener from blue to green. After a configurable wait window, terminates the blue task set.
- Original revision termination: by default 5 minutes after successful shift. Tunable up to many hours — increase if you want a manual gate before tearing down the old environment.
- Test listener: optional separate port (typically 8080) on the ALB pointing at TG-green. Lets validation hooks call the new version with real traffic-shape requests before production users see it.
- EC2 Blue/Green: CodeDeploy provisions a new ASG with the new version, registers its instances into TG-green, and swaps. The old ASG can be retained (manual decommission) or destroyed automatically. Faster rollback than in-place — original ASG is still running.
- Cost vs safety trade-off: Blue/Green doubles infrastructure for the duration of the deploy window. For small services that cost is trivial. For massive ECS clusters, factor it into the deploy plan.
An ECS service runs 20 tasks behind an ALB. Deployment goal: shift to a new image with a 5-minute validation window, automatic rollback on 5XX rate > 0.5%. Setup: CodeDeploy deployment group references the ECS service. The ALB has TG-blue (current 20 tasks) and TG-green (empty). Listener rules: prod on :443 → TG-blue; test on :8080 → TG-green. On deploy, CodeDeploy starts 20 new tasks into TG-green, runs BeforeAllowTraffic hook against :8080, then shifts :443 from TG-blue to TG-green. CloudWatch alarm watches 5XX. After 5 minutes without alarm, old task set terminates; with alarm, traffic reverts and green tears down.
Infrastructure as Code — CloudFormation, SAM & CDK3 lessons
Every AWS developer needs IaC skills. This module covers CloudFormation template anatomy (Parameters, Mappings, Conditions, Resources, Outputs), Change Sets for safe updates, cross-stack references with Exports/ImportValue, custom resources, DeletionPolicy (Retain, Snapshot, Delete), and stack failure behavior. Then SAM: Transform declaration, AWS::Serverless::Function, sam local invoke/start-api for local testing. Finally CDK: L1/L2/L3 constructs, cdk synth/deploy, and why CDK synthesizes to CloudFormation.
📖 Read in-depth chapter ▾
CloudFormation is the substrate underneath every AWS IaC tool. Templates declare desired state; CloudFormation reconciles. The exam tests template sections (Parameters, Mappings, Conditions, Resources, Outputs), update mechanics (Change Sets), and resource lifecycle attributes (DeletionPolicy, UpdateReplacePolicy).
- Template sections:
Parameters(input values),Mappings(lookup tables — region → AMI),Conditions(boolean expressions used to gate resources),Resources(the actual AWS resources to create),Outputs(values to expose, optionally exported for cross-stack). - Intrinsic functions:
!Ref(resource ID or parameter value),!GetAtt Resource.Attribute(specific attribute),!Sub "${var}"(string substitution),!FindInMap,!If [Condition, ValueIfTrue, ValueIfFalse],!Join. - Change Sets: a preview of what an update will do — added, modified, removed resources, and whether modifications require replacement. Create a Change Set; review; Execute (or Discard). Best practice: always create a Change Set for production stack updates instead of direct UpdateStack.
- Cross-stack references: a stack can
Exportoutputs (must be unique per region); other stacksImportValuethem. The downside: once an export is consumed, the producing stack can't update or delete that output. For loose coupling, use Parameter Store or Systems Manager references instead. - Lifecycle attributes:
DeletionPolicy: Retainkeeps a resource on stack delete (databases, S3 buckets with data);Snapshotcreates a final snapshot first (RDS, EBS);Delete(default).UpdateReplacePolicyapplies when a property change forces resource replacement. - Stack failure behaviour: default rollback on failure — undo any resources created in the failed update.
DisableRollbackkeeps the partial state for debugging. Failed-create stacks must be DeleteStack'd before retry; failed-update stacks can be retried after fixing the template.
A stack defines an RDS instance and a Lambda that connects to it. Risk: rerunning the stack with a different DBInstanceClass forces replacement, losing data. Mitigations: set DeletionPolicy: Snapshot AND UpdateReplacePolicy: Snapshot on the RDS resource. Before deploying any change touching the DB, create a Change Set to preview replacement actions. For the Lambda's DB endpoint reference, prefer Parameter Store (the Lambda reads /myapp/db-endpoint at runtime) over CloudFormation ImportValue — that decouples future RDS stack rebuilds from the Lambda stack.
Export/ImportValue creates tight coupling — use Parameter Store for loosely coupled references.
SAM (Serverless Application Model) is CloudFormation with shorthand for Lambda, API Gateway, and DynamoDB. Templates start with Transform: AWS::Serverless-2016-10-31; SAM CLI handles local testing and packaging. The exam asks about the SAM-specific resources and the local-dev workflow.
- Transform directive:
Transform: AWS::Serverless-2016-10-31at the top of the template tells CloudFormation to expand SAM shorthand into native resources during deployment.AWS::Serverless::Functionwith a few lines expands to a full Lambda + execution role + log group + event mappings. - AWS::Serverless::Function: the SAM Lambda resource. Properties:
CodeUri,Handler,Runtime,Events(Api, S3, SQS, SNS, EventBridge, Schedule). Inline policies viaPoliciesproperty — point at AWS-managed policy names, ARNs, or inline policy docs. - AWS::Serverless::Api: SAM API Gateway resource. Supports auth, CORS, stage variables. Often you don't need to declare it explicitly — declaring an
Apievent on a Function auto-creates a default API. - SAM CLI commands:
sam build(bundles dependencies into.aws-sam/build),sam local invoke(run Lambda locally in Docker),sam local start-api(run API Gateway + Lambdas locally),sam deploy(package + deploy via CloudFormation). - DeploymentPreference: SAM resource attribute that wires CodeDeploy.
Type: Canary10Percent10Minutes,Type: Linear10PercentEvery10Minutes,Type: AllAtOnce. AddAlarmslist — auto-rollback if any alarm fires during the deploy. - SAM packaging:
sam packageuploads artifacts to S3 and outputs a deploy-ready template with S3 URIs. Used by CodePipeline-driven flows where a build step runs sam package and the deploy step uses the output template directly.
A team prototypes a new API locally. sam local start-api runs API Gateway + Lambdas on localhost:3000, executing the same code that will run in AWS. They iterate, then commit. CI runs sam build + sam deploy --guided for staging. The template's Function has DeploymentPreference: Type: Canary10Percent10Minutes with a 5XX-rate alarm, so every prod deploy is auto-canary'd. Cuts iteration time vs deploying-to-test-each-time, and provides safety on top.
sam local is the productivity lever for iteration. DeploymentPreference wires CodeDeploy without writing CodeDeploy resources manually.
CDK lets you define infrastructure in TypeScript, Python, Java, C#, or Go. The CDK compiler (cdk synth) produces a CloudFormation template that gets deployed. You get types, autocomplete, loops, and unit tests — but the substrate is still CloudFormation. The exam tests construct levels and the synth/deploy flow.
- Construct levels: L1 = raw CloudFormation resources (Cfn* classes), 1:1 mapping. L2 = curated AWS abstractions (e.g.,
Bucket,Function) — sensible defaults, fewer required props. L3 = patterns (e.g.,LoadBalancedFargateService) — opinionated combos of L2 constructs solving a common problem. - App, Stack, Construct hierarchy: a CDK App contains one or more Stacks; a Stack contains Constructs. Each Stack synthesises to one CloudFormation stack. Use one Stack per deployable unit / environment.
- cdk commands:
cdk synthoutputs CloudFormation.cdk diffcompares your code's synth output to the deployed stack.cdk deploydeploys;cdk destroytears down.cdk bootstrapcreates the CDKToolkit stack (the S3 bucket and IAM roles CDK uses) — once per account/region. - Context and environment: the
envproperty on a Stack ties it to a specific account and region. Without it, the Stack is environment-agnostic but can't use environment-specific Lookups. CDK pulls account-specific data (AMIs, VPCs) intocdk.context.jsonfor reproducibility. - Assets: CDK auto-uploads Lambda code, Docker images, and arbitrary files referenced in your code to the bootstrap S3 bucket / ECR. No manual
sam packagestep. - Testing: CDK supports unit testing of synthesised templates via the
aws-cdk-lib/assertionsmodule — assert "the stack contains an S3 bucket with versioning enabled" without deploying.
A team writes a CDK stack in TypeScript that creates a Lambda (from ./src), an API Gateway HTTP API integration, and a DynamoDB table. The stack uses L2 constructs: new lambda.Function(...), new apigatewayv2.HttpApi(...), new dynamodb.Table(...). A unit test asserts the table has BillingMode: PAY_PER_REQUEST. CI runs cdk synth to produce CloudFormation, then cdk deploy to push. cdk diff in the PR review shows exactly what will change before merging.
cdk diff in CI for safety. Same deployment substrate (CloudFormation) — same rollback / change-set guarantees.
Observability & Optimization — X-Ray, CloudWatch & Step Functions3 lessons
The troubleshooting domain (18%) demands deep observability skills. Master AWS X-Ray: active tracing, segments/subsegments, annotations (indexed, filterable) vs metadata (not indexed), sampling rules, and capturing AWS SDK calls with captureAWSv3Client. CloudWatch: Lambda standard metrics (Duration, Errors, Throttles, ConcurrentExecutions), custom metrics via PutMetricData, Log Insights query syntax, and alarms for automated rollbacks. Step Functions: state machine design patterns (sequential, parallel, wait, error catch/retry), Express vs Standard workflows, and when to use Step Functions vs SQS for orchestration.
📖 Read in-depth chapter ▾
X-Ray traces a request across all the AWS services it touches — API Gateway → Lambda → DynamoDB → SNS — so you can find the slow segment without guessing. The exam tests annotations vs metadata, sampling rules, and which services support native integration.
- Segments and subsegments: a segment is one service's contribution to the trace (one Lambda invocation, one API Gateway request). Subsegments are nested operations (an SDK call, a DB query). Segments aggregate into a trace identified by a trace ID propagated end-to-end.
- Active tracing: enable on Lambda and API Gateway via a single config flag (or SAM property). The runtime auto-captures segments + AWS SDK subsegments. No code change.
- Annotations vs metadata: annotations are key/value pairs indexed by X-Ray — filterable in the console (
annotation.user = 'alice'). Limit ~50 per segment. Metadata is unindexed — visible in the trace UI but not searchable. Use annotations for facets (userId, region, route); metadata for debug context (request body, headers). - captureAWSv3Client / captureHTTPSGlobal: SDK helpers that wrap AWS clients and Node's https module so all outgoing calls automatically generate subsegments. Without them, your downstream calls appear as black boxes.
- Sampling rules: default rule is 1 req/sec + 5% of additional reqs. Custom rules (priority-ordered) match by service name, http method, URL — set higher sample rates for low-traffic endpoints, low rates for chatty ones. Sampling decisions made at the entry point and propagated.
- ServiceLens and Insights: X-Ray ServiceLens combines traces + metrics + alarms into a service map view. X-Ray Insights surfaces anomaly-based notifications when error or latency suddenly deviates from baseline.
An API has p99 spikes intermittently. Enable X-Ray active tracing on API Gateway + Lambda. In the Lambda code, const AWSXRay = require('aws-xray-sdk-core'); wrap the AWS SDK clients; add segment.addAnnotation('userId', event.requestContext.authorizer.userId). After a day of traffic, query the trace UI: filter by responsetime > 1 AND annotation.userId = 'X'. Service map shows DynamoDB Query taking 800ms of the 1.2s response. Now you have something to optimise.
CloudWatch is the operational telemetry plane — metrics for time series, Logs for searchable text, alarms for actions. The exam asks both about default Lambda metrics and about Log Insights query syntax.
- Lambda standard metrics: Invocations, Errors, Duration (avg / max / p50 / p99), Throttles (concurrency-limit denials), ConcurrentExecutions, UnreservedConcurrentExecutions, ProvisionedConcurrencyUtilization. Free metrics, 1-minute granularity.
- Custom metrics:
PutMetricDatawrites custom metric values. Cost ~$0.30 per metric per month. Embed dimensions (function name, region) but keep dimension cardinality bounded — high cardinality = many metrics = expensive. - Embedded Metric Format (EMF): write logs in a special JSON schema; CloudWatch auto-parses them into metrics. No
PutMetricDataAPI call — cheaper at scale because you pay per log line, not per metric value. - Alarms: evaluate a metric (or expression) over N periods. Static threshold, anomaly detection, missing data treatment (Missing / NotBreaching / Breaching / Ignore). Multi-metric expressions use
SELECT-style syntax (m1 / m2 > 0.05). - Log Insights query syntax: pipe-style.
fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50. Aggregations:stats count() by bin(5m),stats avg(@duration) by @log. Saves to dashboards. - Lambda Insights: opt-in deep telemetry (memory utilisation, network throughput, runtime details) via a managed layer. Adds ~$0.20/M invocations. Use when standard metrics aren't enough to diagnose performance.
A Lambda has rising error rate. Investigation: Log Insights query filter @message like /ERROR/ | stats count() by bin(5m) shows the spike's timeline. Drill in: filter @message like /ERROR/ | parse @message "error: *" as err | stats count() by err bins by error string. Top error is "Throttling" → DynamoDB is throttling the Lambda. Check the standard Throttles metric and the DynamoDB UserErrors. Fix: switch DynamoDB to on-demand mode, set an alarm on Errors > 5% for future regressions.
PutMetricData at scale. Log Insights' pipe syntax answers "what's failing and how often" in 30 seconds. Alarms turn metrics into actions.
Step Functions orchestrate Lambdas (and other AWS services) into state machines. They handle retries, parallel execution, human-approval waits, and ACID-like cross-service rollback with Saga patterns. The exam asks when to use Step Functions vs SQS vs direct Lambda invocation.
- State machines: Amazon States Language (ASL) JSON defines states. State types: Task (invoke a Lambda / service), Choice (branch), Parallel (run branches concurrently), Map (iterate over a collection), Wait (pause for time/timestamp), Pass (no-op transformation), Succeed, Fail.
- Standard vs Express workflows: Standard — durable, up to 1-year runtime, exactly-once execution, $25/M state transitions. Express — high-throughput (100k/sec), 5-minute max, at-least-once, billed by duration + memory + count. Express for high-volume short-lived; Standard for long-running workflows.
- Retry and Catch: per-state
Retryhandles transient failures with exponential backoff.Catchmatches specific error codes and routes to a different state. Cleaner than wrapping every Lambda in retry logic. - Service integrations: two flavours — Request Response (call service, get response, move on), Run a Job (.sync) (call service, wait for completion before next state — supports Batch jobs, Glue, ECS tasks), Wait for Callback (.waitForTaskToken) (pause until an external system calls back with a task token — for human approvals, third-party webhooks).
- Saga pattern: a transaction across multiple services modeled as a sequence of forward steps each with a compensating-undo step. If step 3 fails, run undo-step-2 and undo-step-1 in reverse order. Step Functions Parallel + Catch routes are the natural fit.
- When NOT to use Step Functions: simple "Lambda calls Lambda" doesn't need orchestration overhead — just invoke directly. High-volume fire-and-forget event processing — SNS or SQS is cheaper. Step Functions cost adds up at hundreds-of-thousands-per-day workflows.
An order-fulfillment workflow: validate order → reserve inventory → charge card → ship order. If charge fails, release inventory. If ship fails, refund card + release inventory. Design: a Step Functions Standard workflow. Each step is a Lambda task with Retry on transient errors and Catch on business errors routing to a compensating-action branch. Inventory release and card refund themselves are Lambda tasks. Standard workflow's exactly-once semantics ensure no double-charge.
Exam quick-reference
High-frequency DVA-C02 concepts
These patterns appear repeatedly on the exam. Knowing them cold will save minutes per question.
{ "statusCode": 200, "headers": {}, "body": "..." }. Any other shape causes a 502 Bad Gateway.
Test your DVA-C02 knowledge now
60 scenario-based questions covering all 4 exam domains. Progress saved locally — no signup required.
Study plan
Pass DVA-C02 in 6 weeks
A structured week-by-week plan for working developers who can dedicate 5–6 hours per week.
- Week 1: Module 1 (Lambda) + Module 2 (API Gateway). Build a simple REST API with Lambda and test locally with SAM.
- Week 2: Module 3 (DynamoDB). Design a table schema from scratch, practice Query vs Scan, implement optimistic locking.
- Week 3: Module 4 (SQS/SNS/Kinesis). Set up an SQS-triggered Lambda with a DLQ, implement an SNS fan-out pattern.
- Week 4: Module 5 (Security). Configure Cognito User Pools, add a Cognito authorizer to API Gateway, rotate a Secrets Manager secret.
- Week 5: Module 6 (CI/CD) + Module 7 (IaC). Build a CodePipeline: CodeCommit → CodeBuild → CodeDeploy to Lambda with Canary strategy.
- Week 6: Module 8 (Observability) + full practice test. Review wrong answers. Listen to CertQuests podcast episodes for topic reinforcement. Sit the exam.
The CertQuests podcast covers Lambda edge cases, DynamoDB design anti-patterns, and CI/CD war stories — all mapped to DVA-C02 exam objectives. Perfect for study sessions when you can't be at a screen.
Related certifications
Continue your AWS journey
DVA-C02 pairs well with these certifications — they share significant topic overlap.