Data engineering got its own AWS cert — here is what changed
In April 2024, AWS launched the Certified Data Engineer – Associate (DEA-C01), positioning data engineering as a first-class discipline in the AWS certification hierarchy alongside Solutions Architecture and SysOps Administration. The new cert replaced the Data Analytics Specialty as the primary credential for AWS data practitioners — and the change was more than a tier demotion. The Specialty demanded expert-level breadth across domains many data engineers rarely touch in production. The Associate targets exactly the skills data engineers use daily: designing and running ETL pipelines, selecting and configuring the right data stores, orchestrating multi-step workflows, and enforcing governance that lets analysts access data safely at scale.
The timing reflects a structural shift in enterprise technology. Every large organization now operates a data platform — most commonly a data lake on Amazon S3 backed by AWS Glue for ETL, Amazon Redshift for warehousing, Amazon Athena for ad-hoc SQL, and AWS Lake Formation for access control. The data engineers who design and maintain these platforms are in acute shortage relative to enterprise demand. DEA-C01 gives hiring managers a standardized, validated signal: this candidate understands the full AWS data engineering stack at the production level, not just the names of the services.
From a career perspective, DEA-C01 holders in 2026 command $135k–$175k USD at the associate tier, with higher ranges for candidates who combine the credential with hands-on Redshift or Kinesis production experience in data-intensive industries (fintech, healthcare analytics, adtech). Senior data engineers holding DEA-C01 with four or more years of AWS pipeline experience are reporting total compensation of $180k–$220k in major US technology markets — comparable to the AWS SAP-C02 premium and reflecting genuine scarcity of qualified practitioners at the senior end of the data engineering talent pool.
Domain 1 — Data Ingestion and Transformation (34%)
The single highest-weighted domain. Anchor your Domain 1 preparation in these three service families.
AWS Glue — ETL engine, Data Catalog, and data quality
Glue is the primary exam service for Domain 1. Know three job types: Python Shell (lightweight Python scripts, no Spark, low DPU cost, suitable for small transformations); ETL (Apache Spark under the hood, PySpark or Scala, Worker Type G.1X/G.2X autoscaling); and Ray (Python-native distributed processing for ML-adjacent data workflows). Glue DynamicFrames differ from Spark DataFrames in a key way: DynamicFrames handle schema inconsistencies natively — each record can carry a different schema, which is critical for messy JSON or CSV where field sets vary across records. ResolveChoice handles ambiguous column types; apply_mapping and rename_field are transformation operations the exam tests directly.
Glue Crawlers scan S3, JDBC databases, and other sources to populate the Glue Data Catalog with table metadata (schema, partition info, S3 prefix). The Catalog decouples metadata from physical data — Athena, Redshift Spectrum, and EMR all query the same logical dataset without duplicating it. Know how partition metadata stays current: MSCK REPAIR TABLE in Athena, Crawler re-run, or direct Partition API calls when crawler overhead is unacceptable for large partition counts.
- Glue job bookmarks: track which data the job has already processed, enabling incremental loads that avoid reprocessing the entire dataset on every run.
- Glue Data Quality (built on Deequ): define completeness, uniqueness, value-range, and referential integrity rules; block or warn pipeline runs when rules fail.
- Glue DataBrew: no-code, visual data profiling and transformation; correct answer when the question asks for a solution that lets non-engineers clean and transform data without writing PySpark.
Amazon Kinesis — Streams vs Firehose vs Analytics
The most frequently tested streaming topic in DEA-C01. Know the three services clearly and the constraint that determines which one is correct:
- Kinesis Data Streams (KDS): shard-based, producer-consumer model. 1 MB/s write, 2 MB/s read per shard. Retention 24 hours to 365 days. Standard consumers share the 2 MB/s per-shard read bandwidth; enhanced fan-out consumers each get 2 MB/s per shard via HTTP/2 push. Use KDS when you need millisecond-latency, multiple independent consumer applications reading the same stream, or the ability to replay data from a known offset.
- Kinesis Data Firehose (KDF): fully managed delivery to S3, Redshift, OpenSearch, Splunk, or HTTP endpoints. Buffers data (60–900 seconds or 1–128 MB) and delivers in batches. Supports inline Lambda transforms for format conversion. Use Firehose when the question specifies “minimum operational overhead” or “no consumer code” — it is a managed loading service, not a general streaming platform with custom consumers.
- Kinesis Data Analytics / managed Apache Flink: real-time stream processing with SQL or Java/Python Flink applications. Correct when the scenario requires windowing, aggregation, anomaly detection, or stateful transformations on a live stream beyond what Firehose inline Lambda can deliver.
Amazon MSK, AWS DMS, and columnar data formats
Amazon MSK (Managed Streaming for Apache Kafka): use when you need Kafka APIs, Kafka Connect for CDC pipelines, Schema Registry for Avro/Protobuf message schemas, or need to migrate an existing Kafka workload to AWS without rewriting producer/consumer code. MSK Serverless for unpredictable traffic patterns. The exam distinguishes MSK (Kafka API, ecosystem tools) from Kinesis (AWS-native, no ecosystem dependencies).
AWS DMS (Database Migration Service): streaming CDC replication from Oracle, SQL Server, MySQL, PostgreSQL into S3, Redshift, or Kinesis. DMS Serverless scales automatically for variable migration workloads. Aurora zero-ETL integration with Redshift removes the need for DMS in Aurora→Redshift pipelines.
Data formats and why they matter on the exam:
- Parquet / ORC: columnar, excellent compression, predicate pushdown — reduce Athena scan cost by 60–85% vs JSON. Use for data at rest in the lake that will be queried with column projections or filters.
- Avro: row-oriented, schema-evolution-friendly, standard in Kafka/MSK with Schema Registry for streaming pipelines where schema changes must be handled gracefully without pipeline rewrites.
- JSON / CSV: human-readable, row-oriented; use for raw landing zones. Convert to Parquet after initial ingestion to reduce all downstream query costs.
Partitioning: partition S3 data by date (year/month/day) at minimum. Over-partitioning — thousands of tiny files each under 128 MB — causes S3 listing overhead and slow Athena queries. Use Glue compaction jobs to merge small files into right-sized partitions before querying.
Domain 2 — Data Store Management (26%)
Amazon Redshift — distribution, sort keys, and Spectrum
The most tested service in Domain 2. Know distribution styles and when each is correct:
- KEY distribution: rows with the same key hash land on the same compute slice — optimal when joining two large tables on a common column. Both tables must use the same distribution key to collocate rows and eliminate redistribution at query time.
- EVEN distribution: rows distributed round-robin across slices — use for tables not joined on a consistent key, or for staging tables that feed into larger transforms.
- ALL distribution: every node gets a full copy of the table — use for small dimension tables (< 5 million rows) that are frequently joined to large fact tables. Eliminates dimension-side redistribution in star-schema queries.
- AUTO distribution: Redshift chooses based on table size. The correct default unless performance evidence indicates a specific style is needed.
Sort keys: compound sort keys accelerate range queries and ORDER BY on leading columns. Interleaved sort keys treat all sort columns equally — useful when queries filter on multiple columns in no consistent order, but VACUUM REINDEX is expensive. For most workloads, compound sort keys on date columns used in WHERE filters deliver the greatest query acceleration with the lowest maintenance overhead.
Redshift Spectrum: query S3 data lake files directly via external tables pointing to the Glue Data Catalog. Billed per TB scanned in S3. Use for cold or archival data that cannot be loaded into Redshift but still needs to be joined against warm Redshift data in a single SQL query.
S3, DynamoDB, Aurora zero-ETL, and OpenSearch
- Amazon S3 data lake: foundation for most AWS data architectures. S3 Intelligent-Tiering auto-archives objects with unknown access patterns. Lifecycle policies move cold data to Glacier after 90 days. S3 Select pushes simple row/column filter predicates directly to S3, reducing data transferred to compute for lightweight queries without a full Athena job.
- Amazon DynamoDB: use for single-table designs with predictable key-based access patterns and millisecond SLA requirements. DynamoDB Streams captures item-level changes for real-time downstream pipeline triggers. Global Tables for multi-region active-active replication. TTL for automatic expiration of session data or ephemeral staging records.
- Aurora zero-ETL with Redshift: replicates Aurora PostgreSQL or MySQL transactional data into Redshift automatically, eliminating the Glue ETL job previously required for OLTP→warehouse pipelines. Correct answer when the question includes an operational Aurora database that needs near-real-time availability in Redshift for analytics.
- Amazon OpenSearch Service: full-text search and log analytics. Use when queries require free-text search, geo-spatial queries, or aggregation on high-cardinality log fields at scale. Not the correct answer for columnar SQL analytics on structured history — that is Redshift or Athena.
Store selection decision matrix: millisecond key-based reads → DynamoDB. Complex SQL analytics on structured history → Redshift. Ad-hoc SQL on S3 data lake → Athena. Full-text and log search → OpenSearch. Relational OLTP feeding analytics → Aurora with zero-ETL.
Domain 3 — Data Operations and Support (22%)
Orchestration and pipeline health are the central topics. The exam asks you to choose the right orchestration layer given the constraints in the scenario.
Orchestration: Glue Workflows, MWAA, Step Functions, EventBridge
- AWS Glue Workflows: trigger a Glue Crawler, then a Glue ETL job, then a follow-on job on success — a simple, Glue-native DAG with no external dependencies. Correct when the entire pipeline consists of Glue services and minimum operational complexity is required.
- Amazon MWAA (Managed Workflows for Apache Airflow): use for complex, multi-service DAG orchestration requiring Python DAG code, custom Airflow operators, external service integrations (Spark, DBT, external APIs), or migration of existing Airflow DAGs. MWAA manages the scheduler, worker scaling, and metadata database — you supply DAG files in S3.
- AWS Step Functions: serverless state machine orchestration. Express Workflows for high-volume, short-duration, at-least-once execution. Standard Workflows for durable, exactly-once, long-running pipelines. Correct when you want visual workflow debugging, tight IAM integration, and no Python DAG code to maintain.
- Amazon EventBridge Scheduler: cron-based triggers for Glue jobs, Lambda functions, and ECS tasks. Correct for simple scheduled pipeline runs without the overhead of a full orchestration framework.
Monitoring, data quality, and cost optimization
CloudWatch metrics and alarms: track Glue job duration, DPU hours consumed, and error counts. Set alarms for SLA breach notifications. Glue job run logs stream to CloudWatch Logs for transform debugging.
Glue job bookmarks: track which data the job has already processed to enable incremental ETL runs — only new records processed since the last successful run, avoiding full dataset reprocessing on every trigger.
Cost optimization patterns:
- EMR Spot Instances for large-scale batch Spark workloads: 60–90% savings vs on-demand, with checkpoint-to-S3 for recovery if Spot capacity is reclaimed.
- Redshift reserved nodes for predictable warehouse workloads: up to 75% savings vs on-demand pricing.
- Athena cost control: partition pruning, Parquet format conversion (reduces scanned bytes 60–85%), workgroup data scan limits to cap per-query costs.
- Glue Serverless pricing (per DPU-second) is cost-effective for infrequent ETL jobs versus maintaining an always-on EMR cluster.
Domain 4 — Data Security and Governance (18%)
AWS Lake Formation — fine-grained data lake access control
Lake Formation adds a permissions layer on top of S3 and the Glue Data Catalog, replacing direct S3 bucket policies plus IAM policy combinations for multi-team data lake access control. At scale, Lake Formation is always the correct answer over raw S3 ACLs when the question describes fine-grained data access across many IAM roles and teams.
- Column-level security: restrict access to specific columns. Analysts see the table structure but with PII columns (SSN, email, date of birth) excluded from query results.
- Row-level filters: restrict which rows a role can see. Regional analysts are limited to rows where
region = 'eu-west-1'. - Cell-level security: combine row filters and column permissions simultaneously — a specific role sees only certain rows AND certain columns in a single access policy.
- LF-Tags (attribute-based access control): tag tables and columns with metadata labels (
sensitivity=PII,domain=finance), then grant access based on tag values rather than individual resource ARNs. Scales to large catalogs without managing per-table grants.
AWS Macie, encryption, and VPC endpoints
AWS Macie: ML-powered PII detection in S3. Macie runs discovery jobs, classifies S3 objects, and reports findings (names, SSNs, credit card numbers, health data). Integrates with Security Hub for centralized finding management. Critical exam distinction: Macie detects, does not remediate — remediation happens via EventBridge rules triggering Lambda automation in response to Macie findings.
S3 encryption options:
- SSE-S3: S3-managed keys, no KMS API calls, lowest cost. Default for data lake buckets without specific compliance requirements.
- SSE-KMS: AWS-managed or customer-managed KMS keys. Enables fine-grained KMS key policy control and CloudTrail audit of every object decryption event. Use for regulated data (HIPAA, PCI DSS).
- SSE-C: customer-provided keys; S3 applies encryption server-side while you manage the key entirely. Rare; used for compliance regimes requiring full customer key ownership.
- Client-side encryption: encrypt before sending to S3; AWS never sees plaintext. Maximum control, maximum operational complexity.
Redshift encryption: must be enabled at cluster creation. Changing encryption on an existing cluster requires creating an encrypted snapshot and restoring to a new cluster — there is no in-place toggle, making this an exam-testable gotcha.
VPC endpoints for private data pipelines: Gateway Endpoints (free) for S3 and DynamoDB. Interface Endpoints (AWS PrivateLink) for Glue, Kinesis, STS, and CloudWatch. Production pipelines handling regulated data should route all traffic through VPC endpoints so no data transits the public internet.
The question pattern that surprises most DEA-C01 candidates: two technically correct service choices where one satisfies a hidden operational or cost constraint the other misses. Both Kinesis Firehose and a custom Lambda+S3 PUT pattern can land data in S3 — Firehose is correct when the question says “minimum operational overhead.” Both Glue ETL and EMR Spark can run the same transformation — Glue is correct for serverless, managed ETL; EMR is correct for very large-scale processing (> 10 TB) or existing Spark application code. Both MWAA and Step Functions can orchestrate a multi-step pipeline — MWAA is correct when the question mentions existing Airflow DAGs or Python-based custom operators.
The three knowledge gaps that separate passing scores from near-misses
Three specific gaps account for the majority of incorrect answers on DEA-C01 practice simulations:
- Kinesis Streams vs Firehose vs Analytics: candidates who know all three ingest streaming data but cannot articulate the consumer model, latency profile, and destination constraints for each will consistently pick the wrong service. Know: KDS for custom consumers and low-latency replay; Firehose for managed delivery with no consumer code; Analytics/Flink for stateful stream processing with SQL or Java.
- Lake Formation vs S3 bucket policies: S3 bucket policies cannot grant column-level access — that distinction is exam-critical. When a question describes a data lake with “multiple teams needing access to different tables and columns,” Lake Formation column-level security or LF-Tags is the correct answer. Direct S3+IAM combinations are the wrong answer and become unmanageable at lake scale.
- Redshift distribution styles for join performance: questions present a slow fact-dimension join and ask which distribution change fixes it. KEY distribution on the join column collocates rows and eliminates data redistribution. ALL distribution on a small dimension table eliminates dimension-side redistribution. Candidates who cannot distinguish these two solutions will miss a category of performance optimization questions entirely.
DEA-C01 is the clearest professional signal available to AWS data engineers in 2026. The investment — 60–100 study hours and a $300 exam fee — can yield a $20k–$40k annual salary increase at first job change for engineers transitioning from general cloud roles into data engineering. For experienced data engineers, DEA-C01 paired with 3+ years of Redshift and Kinesis production experience is consistently the fastest path to senior data engineer roles at $165k–$200k in US technology markets. The cert validates production experience to hiring managers who cannot otherwise distinguish pipeline builders from data platform architects at the screening stage.
Prepare for DEA-C01 and every other AWS, Azure, GCP, and CompTIA cert with free expert-level practice questions on CertQuests.
Browse All Certification Practice Tests →