| Field | Details |
|---|---|
| Exam Code | Professional Data Engineer (updated 2023) |
| Questions | 50–60 multiple-choice and multiple-select |
| Duration | 2 hours |
| Passing Score | ~70% (Google does not publish exact score) |
| Price | $200 USD |
| Recertification | Every 2 years |
| Recommended Experience | 3+ years industry experience, 1+ year on GCP |
| Prerequisites | None official (GCP ACE or PCA recommended) |
Exam Domain Weights
Course Modules
📖 Read in-depth chapter ▾
PDE is fundamentally a "pick the right data store" exam. Every domain leans on the same decision matrix. Memorise the access-pattern → service mapping.
- OLTP relational: Cloud SQL (MySQL / PostgreSQL / SQL Server, regional, lift-and-shift). AlloyDB for PostgreSQL with HTAP. Cloud Spanner for globally distributed strong consistency.
- OLAP analytics: BigQuery as the warehouse. Serverless, columnar, billed per query (on-demand) or by slot reservation. NOT for OLTP.
- NoSQL wide-column / time-series: Bigtable — low-latency at million-QPS, schemaless, horizontally scalable. The IoT / fintech / ad-tech default.
- NoSQL document / mobile: Firestore in Native mode for real-time sync + offline SDK. Datastore mode is legacy.
- Object storage / data lake: Cloud Storage. Multi-region or regional buckets. Tier classes (Standard / Nearline / Coldline / Archive).
- Cache: Memorystore (Redis / Memcached managed).
- Streaming: Pub/Sub for ingest. Dataflow for transform. BigQuery streaming inserts or Bigtable as sink.
A retailer's data platform: catalog (~1M products, low-latency lookup) → Cosmos-like NoSQL? No → Firestore (mobile sync) OR Bigtable (high-QPS reads). Order transactions (ACID, global stores) → Cloud Spanner. 5-year order history (analytics) → BigQuery with date partitioning. Real-time clickstream → Pub/Sub → Dataflow → BigQuery. PDFs / receipts → Cloud Storage with lifecycle Standard → Coldline → Archive.
The data lake is the canonical PDE foundation. GCS-based lake with BigQuery + Dataflow + Dataproc layered on top. PDE tests zone design + format choice + lifecycle.
- Bronze / Silver / Gold zones: Bronze = raw landed data, immutable. Silver = cleaned + deduped + typed. Gold = aggregated / consumer-ready datasets. Sometimes called Raw / Curated / Refined.
- File formats: Parquet + Avro for columnar / row analytics. Avro for schema-evolving streaming sinks. JSON Lines for human-readable / quick-debug. NEVER CSV in production lakes — no schema, no compression.
- Partitioning convention: Hive-style partitions on date —
gs://bucket/silver/orders/dt=2026-05-16/. BigQuery external tables + Dataflow auto-detect partitions. - Compression: Snappy (fast, splittable) for hot data. Gzip for cold archives. ZSTD when supported. Always compress — costs 5-10× less for the same data.
- Cloud Storage class choices: Bronze = Standard for first 30d → Nearline → Coldline → Archive. Silver / Gold = Standard (queried more). Lifecycle policies automate transitions.
- BigLake: unify Cloud Storage + BigQuery — query GCS Parquet/Avro as if BigQuery tables; column-level + row-level security applied at the lake layer.
A retailer's lake on GCS: gs://retail-lake/bronze/clickstream/dt=... (raw JSON from Pub/Sub → GCS via Dataflow, 7-day Standard → Nearline). Dataflow batch transforms write Parquet to silver/clickstream/dt=... (deduped, schema-typed). BigQuery external table over silver. Daily gold aggregations materialised as native BigQuery tables for sub-second BI dashboards. BigLake applies row-level security on silver so analysts only see their region's data.
Data engineering touches sensitive data. PDE expects governance from day one — IAM, encryption, Data Catalog, DLP.
- IAM for data: least-privilege role grants at project / bucket / dataset / table / column. Dataset-level roles in BigQuery (BigQuery Data Viewer / Editor) vs project-level (BigQuery User).
- Service accounts for pipelines: each pipeline (Dataflow / Composer / Dataproc) runs as its own SA with minimum required permissions. Avoid the default Compute Engine SA.
- CMEK: Customer-Managed Encryption Keys via Cloud KMS. BigQuery / Cloud Storage / Spanner / Dataflow / Bigtable all support CMEK. Required for compliance regimes.
- VPC Service Controls: perimeter around the data services to prevent exfiltration via leaked credentials. Mandatory for HIPAA / PCI lakes.
- Data Catalog: central metadata catalog. Tag templates for sensitivity classification (PII / PHI / Financial). Powers search + lineage.
- Cloud DLP: scans for PII / PHI / financial data; can de-identify (mask, redact, tokenize) or hash. Run on bronze data before promoting to silver.
Healthcare data platform: VPC Service Controls perimeter around the BigQuery + GCS + Dataflow project. Cloud DLP scans bronze landing zone on ingest — detects SSN + DOB + names; de-identifies via format-preserving tokenization before silver promotion. Data Catalog tag template "PHI" applied to silver tables; column-level access policies in BigQuery restrict raw PHI fields to specific analyst group. CMEK on every BQ dataset + GCS bucket. SA-per-pipeline for Dataflow / Composer with least-privilege roles.
📖 Read in-depth chapter ▾
BigQuery bills by data scanned. The single biggest cost lever is partitioning + clustering. Get this wrong and query bills 100×.
- Partition types: Ingestion-time (auto-populated by load time), Time-unit column (date / timestamp column you specify), Integer-range (numeric range buckets). Time-unit column is the modern default.
- Partition pruning: queries that filter on the partition column scan ONLY relevant partitions.
WHERE order_date BETWEEN '2026-05-01' AND '2026-05-15'reads ~15 partitions instead of all 2000+ days. - Require partition filter: table option that forces every query to include a partition filter. Prevents accidental full-table scans on huge tables.
- Clustering: sorts data within each partition by 1-4 columns. Queries filtering on cluster columns prune blocks even more. Stacks with partitioning.
- Cluster column order matters: most-filtered column first. Cardinality matters less than filter-pattern frequency.
- Cluster expiration: partition expiration deletes old partitions automatically — combine with date partitioning to enforce data-retention policies.
A 50TB orders table queried mostly by date range + customer + store. Design: partition on order_date (daily partitions). Cluster on customer_id, store_id. Require partition filter. Typical query SELECT ... WHERE order_date BETWEEN '2026-05-01' AND '2026-05-07' AND customer_id = 12345 scans ~7 partitions × a few blocks = ~100 MB instead of 50 TB. 99.99% cost reduction.
BigQuery's pricing comes in two flavours — on-demand (per-TB scanned) and Editions (capacity-based with slots). PDE tests when to pick each + how to combine them.
- Slot: a unit of CPU + memory + IO. Parallel query execution scales with available slots. On-demand has implicit dynamic slot pool. Editions reserves explicit slots.
- Editions tiers: Standard (basic), Enterprise (multi-statement transactions, customer-managed encryption, BI Engine), Enterprise Plus (cross-region replication, fine-grained access).
- Reservations: commit to N slots in a region for 1-month, 1-year, or 3-year. Assign to projects via assignments. Predictable cost for steady ETL workloads.
- Idle slots + autoscaling: reservations can autoscale up to a max + share idle slots across reservations in the same admin project.
- BI Engine: in-memory query acceleration. Sub-second dashboards. Configured as a separate reservation. Looker / Looker Studio / Tableau benefit.
- Materialised views: precomputed aggregations refreshed incrementally. BQ rewrites matching queries to use the MV. Cuts cost for common dashboard queries.
A retailer's predictable nightly ETL + spiky daytime ad-hoc: cover ETL with Enterprise Edition reservation, 500 slots committed 1-year. Daytime ad-hoc analysts use on-demand billing (separate project). Looker dashboards on the materialised views accelerated by a BI Engine reservation for sub-second response. Common 30-day aggregation queries replaced with materialised views — additional cost cut.
BQ's advanced features. External tables for federated query; row + column security for governance; BigQuery ML for in-DB modelling.
- External tables: BQ tables that read data from Cloud Storage / Cloud SQL / Bigtable / Google Sheets without copying. Use for federated query + data-lake patterns.
- BigLake: external tables with BigQuery's IAM + row/column security applied at the lake layer. Supports Parquet/Avro/ORC/JSON on GCS.
- Row-level security (RLS): per-table access policies that filter rows by a condition (e.g.,
region = SESSION_USER_REGION()). Applied transparently to every query. - Column-level access policies: tag columns with policy tags from Data Catalog. Role grants reference the tag — granting "PII reader" exposes the column to that user.
- Dynamic data masking: column-level masking rules (e.g., hash, last-4-digits) applied per role. User without unmask permission sees the masked value.
- BigQuery ML: train ML models with SQL.
CREATE MODEL ... OPTIONS(model_type='LINEAR_REG') AS SELECT .... Supports LINEAR_REG, LOGISTIC_REG, KMEANS, BOOSTED_TREE, ARIMA_PLUS, DNN, AutoML. Replace simple ML for analysts with just SQL.
A multinational has region-specific data-residency rules. External BigLake table over GCS lake. Row-level security policy: rows filtered by user's region claim. Column-level access: SSN column tagged "PII", only HR role can see it. Dynamic masking on email column shows masked email to analysts. Data scientists use BigQuery ML to train a churn-prediction LOGISTIC_REG model on the silver table — no Vertex AI dependency for simple cases.
📖 Read in-depth chapter ▾
Apache Beam is the unified batch + streaming SDK; Dataflow is GCP's managed runner. Same pipeline code runs both modes.
- PCollection: distributed dataset (immutable, parallel-processable). Bounded for batch; unbounded for streaming.
- PTransform: processing step — ParDo (per-element function), GroupByKey, Combine, Flatten, Window. Compose into DAGs.
- Pipeline: the DAG of PTransforms. Built once, runs anywhere — Direct Runner (local), Dataflow (managed), Flink, Spark.
- I/O connectors: built-in for GCS, Pub/Sub, BigQuery, Bigtable, Spanner, Kafka, JDBC. Source + sink primitives.
- Dataflow runner advantages: autoscaling (workers added/removed based on backlog), data sampling for debugging, GCP-native I/O.
- Templates vs Flex Templates: Templates = pre-baked pipeline image, parameters at run-time. Flex Templates = build pipeline into Docker image + push to Artifact Registry — supports parameters that change pipeline structure.
Streaming pipeline: read from Pub/Sub topic orders-stream → ParDo to parse JSON → GroupByKey on customerId → ParDo to compute customer running totals → write to BigQuery streaming inserts. Same code with --streaming=False + bounded GCS source can run as a batch backfill against historical data. Flex Template packages it as parameterised Docker image for ops + analyst self-service runs.
Streaming = unbounded data. Windows turn unbounded into bounded chunks for aggregation. Watermarks + triggers handle late data. PDE tests this on every streaming question.
- Fixed (tumbling) windows: non-overlapping fixed-size intervals — every 5 minutes. Best for "5-minute counts".
- Sliding windows: overlapping fixed-size intervals — 5-min window every 1 min. Smooth moving averages.
- Session windows: gap-based — close window when no activity for N minutes. Perfect for user-session analytics.
- Global window: single window for all elements. Requires triggers to emit (otherwise never closes for unbounded data).
- Watermarks: Dataflow's estimate of "all events with timestamp < W have arrived". Watermark advances as input streams progress. Used to close windows + emit results.
- Triggers: control when window results emit. Default = at watermark close. Custom = early (every N elements) + late (each late element).
- Allowed lateness: how long after watermark close to keep accepting late events. Pair with late-firing triggers to update aggregates.
E-commerce real-time conversion-rate dashboard: sliding window 30 minutes wide, sliding every 5 minutes. Watermark tracks event-time progress from Pub/Sub timestamps. Allowed lateness 10 minutes with late-firing trigger — dashboard updates retroactively when delayed mobile events arrive. After 10 min late, drop the event (alert if rate exceeds threshold).
Advanced Beam: enrich streams with reference data (side inputs), maintain per-key state (stateful DoFns), and use Streaming Engine to offload state from workers.
- Side input: auxiliary PCollection broadcast to a ParDo for enrichment lookups. E.g., enrich clickstream events with a user-profile lookup table.
- Slowly-changing side inputs: refresh side input on schedule (e.g., reload product catalog every hour) using GenerateSequence as a trigger.
- Stateful DoFns: per-key state stored across element invocations. Common pattern: count events per key within a window, emit when threshold hit.
- Timers: register a future event-time or processing-time callback. Used with state for windowed deduplication + custom triggers.
- Streaming Engine: Dataflow option that offloads state + windowing from workers to a Dataflow-managed backend. Faster autoscaling (no state to move), lower worker memory needs.
- Dataflow Prime: the newer execution model — automatic vertical autoscaling, right-fitting workers per stage, integrated with Streaming Engine.
Fraud-detection streaming pipeline: read transactions from Pub/Sub; side input = customer risk-tier lookup table refreshed hourly. ParDo enriches each transaction with risk tier. Stateful DoFn keyed by customerId tracks the last 10 transactions in state. Detect rapid spend increase → emit alert. Streaming Engine ON + Dataflow Prime for elastic autoscaling. Output to BigQuery streaming inserts + Pub/Sub alert topic.
📖 Read in-depth chapter ▾
Pub/Sub is the canonical PDE messaging primitive. Beyond fan-out (covered in PCA Lesson 2.2), this lesson covers the subtle delivery guarantees + Pub/Sub Lite.
- Ordering keys: messages with the same key are delivered in order WITHIN a subscription. Throughput per key is sequential — choose keys with enough diversity to parallelise.
- Exactly-once delivery: subscription option that guarantees each message delivered at most once within an acknowledgement deadline window. Subscriber must acknowledge inside the deadline; otherwise redelivery occurs.
- Dead Letter Topic (DLT): attach to a subscription with max-delivery-attempts. After N failed deliveries, message routes to DLT for forensic + replay.
- Acknowledgement deadline: default 10s, max 600s. Extend dynamically via
modifyAckDeadlinewhen processing takes longer than expected. - Pub/Sub Lite: regional, single-zone, partition-based (Kafka-like). Cheaper than Pub/Sub. Less durable. Use only for non-critical / cost-sensitive workloads.
- Subscription types: Pull (consumer requests), Push (Pub/Sub POSTs to HTTP endpoint), BigQuery subscription (writes directly to BQ, no intermediate Dataflow), Cloud Storage subscription (writes to GCS).
Order events: Pub/Sub topic orders.placed with ordering key = customerId (per-customer sequential). Subscription with exactly-once delivery + DLT after 5 failed attempts. Cloud Run consumer extends ack deadline if processing > 10s. BigQuery subscription on a separate "raw orders" subscription writes directly into BQ without Dataflow — cheapest path for simple persistence.
Dataproc is managed Hadoop / Spark. PDE tests the ephemeral cluster pattern + the newer Serverless Spark option.
- Ephemeral clusters: spin up a cluster for one job, shut down on completion. Avoids paying for idle clusters. The PDE-favoured pattern.
- Dataproc on GKE: run Spark / Flink on a shared GKE cluster instead of dedicated VMs. Better resource sharing across jobs.
- Dataproc Serverless: no cluster — submit a Spark batch job, pay per execution, no infrastructure management. The modern preferred pattern for one-off jobs.
- Persistent state via Metastore: Dataproc Metastore is a managed Hive metastore. Ephemeral clusters / Serverless jobs can read/write tables defined there.
- Autoscaling: Dataproc clusters can scale worker count based on YARN pending containers. Combine with secondary worker pool (preemptible).
- Workflow templates: declarative job DAG + cluster spec.
gcloud dataproc workflow-templates instantiatespins cluster → runs jobs → terminates cluster. The IaC pattern for Dataproc.
Nightly Spark ETL: define a workflow template — cluster spec (10 primary + 20 secondary preemptible workers, autoscaling policy), job (Spark JAR + arguments). Cloud Composer triggers instantiate at 02:00 UTC. Cluster spins up, runs the job, shuts down. Total cost: ~2 hours of cluster time per night vs 24h for a persistent cluster. Dataproc Metastore holds the Hive table catalog accessed by every job.
For teams without strong Spark / Beam skills, Data Fusion provides visual ETL. Datastream replicates change events from databases.
- Data Fusion: GUI-based ETL on top of CDAP. Drag-and-drop pipelines. Generates Dataflow / Spark / MapReduce under the hood. Use for analyst-built or low-code pipelines.
- Wrangler: Data Fusion's data-preparation interactive tool. Apply transformations on a sample; generates the pipeline transforms automatically.
- Datastream: Change Data Capture from MySQL / PostgreSQL / Oracle / SQL Server / AlloyDB into BigQuery / Cloud Storage / Spanner. Low-latency, schema-aware.
- Datastream → BigQuery direct mode: writes CDC changes directly into BQ with upserts via the BigQuery Storage Write API. Near-real-time replication.
- Use cases: migrate from on-prem MySQL to Cloud SQL with continuous replication, or stream operational data into BigQuery for analytics with sub-minute lag.
- Schema drift: Datastream detects source schema changes + propagates to BigQuery. Configurable handling of drop / add / type changes.
A company runs operational MySQL on-prem. Need real-time analytics in BigQuery without an ETL team. Design: Datastream connects to MySQL via private connectivity (Cloud VPN). Captures all changes via MySQL binlog. Writes to BigQuery direct mode with upserts. Sub-minute lag from OLTP to BQ. Schema drift handled — new columns auto-added to BQ. Analysts use Looker on the BQ tables, see operational data updated nearly real-time.
📖 Read in-depth chapter ▾
Bigtable's performance depends entirely on row-key design. PDE tests this more than any other Bigtable topic.
- Lexicographic key order: Bigtable sorts rows by key. Range scans use the order. Sequential keys = hot tablet.
- Hot-spot causes: monotonically increasing prefix (timestamp first), single common prefix (all rows start with same value), low-cardinality leading field.
- Reverse-timestamp: for "most recent first" access, prepend reverse-timestamp (
MAX_TIME - actual_time). Distributes writes across tablets + makes most-recent reads efficient. - Field promotion: move high-cardinality field to leading position. Instead of
timestamp#userId, useuserId#timestamp— writes spread by user. - Salting: prepend hash of key (mod N). Distributes writes uniformly. Cost: scans must hit all N salt prefixes — only useful for write-heavy workloads with random access.
- Tablet splits: Bigtable auto-splits hot tablets if access is even. If access is skewed (one tablet hot), splits don't help — fix the key design.
IoT telemetry: 1M devices, each writes once per second. Bad key: timestamp#deviceId — every write at time T hits the same tablet (timestamp prefix is shared). Hot tablet. Good key: deviceId#reverseTimestamp — writes spread across deviceIds; per-device range scans (most-recent-first) efficient. Hot device? Add salt: {hash(deviceId)%16}#deviceId#reverseTimestamp — distributes hot device across 16 tablets.
Cloud Spanner is the global ACID OLTP. PDE tests interleaved tables + commit timestamps + read-only timestamps.
- Interleaved tables: physically co-locate child rows with their parent row in the same split.
INTERLEAVE IN PARENTin CREATE TABLE. Used for parent-child relationships (Order → OrderLineItems). - Why interleave: parent + children read in one IO. Joins on parent key become local. Beats traditional foreign-key join for performance.
- Commit timestamps: system-managed timestamp set at commit. Use
OPTIONS (allow_commit_timestamp=true). Monotonically increasing across all replicas — perfect for CDC + audit logs. - Stale reads: read with
READ_TIMESTAMPorEXACT_STALENESSto get a read-only snapshot. Avoids contention with writers. Use for analytics queries against the OLTP DB. - Primary key design: avoid monotonically increasing PK (UUIDs work, sequential IDs cause hot splits). Bit-reverse sequential IDs if forced into them.
- Spanner Editions: Standard, Enterprise (multi-region, CMEK), Enterprise Plus (cross-region replication tiers, leader options).
E-commerce platform: Orders table with PK (customer_id, order_id). OrderLineItems interleaved INTERLEAVE IN PARENT Orders ON DELETE CASCADE. Reading order + lines = one IO. Commit timestamp column on every table for CDC streaming via Datastream → BQ. Analytics queries use stale reads with 10-second staleness — non-blocking for OLTP writers.
Two cost / availability features. GCS lifecycle policies for tiering; Bigtable replication for read scaling + DR.
- GCS lifecycle policies: declarative rules — age-based transitions (Standard → Nearline → Coldline → Archive), age-based deletion, noncurrent-version deletion (with versioning).
- Lifecycle on versioned buckets: rules can target current vs noncurrent versions independently. Common: keep 3 most-recent noncurrent versions, archive past N days.
- Object lock + retention policies: immutability (WORM) — bucket-level retention policy that locks down objects for N years. Compliance.
- Bigtable replication: add more cluster locations to an instance for read scaling + DR. App profile chooses single-cluster (consistent) or multi-cluster (eventual consistency for reads).
- App profiles: control routing — single-cluster routing pins all reads to one cluster (predictable). Multi-cluster fans out for HA.
- Failover: automatic if multi-cluster app profile; manual cluster switch if single-cluster. Replication lag observable via metrics.
A bank's transaction archive (regulated 7-year retention) lives in GCS. Lifecycle policy: transition Standard → Nearline at 30d → Coldline at 90d → Archive at 365d. Object retention 7 years (compliance). Bigtable for real-time transaction analytics: 2 clusters (us-east1 + us-west1). App profile = multi-cluster (HA), readers use either cluster. On cluster failure, traffic auto-routes to the other.
📖 Read in-depth chapter ▾
BQ ML lets data analysts build ML without leaving SQL. PDE expects you to know the model types + when BQ ML fits vs when Vertex AI is needed.
- Model types: LINEAR_REG (regression), LOGISTIC_REG (classification), KMEANS (clustering), MATRIX_FACTORIZATION (recsys), BOOSTED_TREE (XGBoost-like), DNN, ARIMA_PLUS (time-series forecasting), AUTOML_CLASSIFIER / REGRESSOR (delegates to Vertex AutoML).
- Training:
CREATE MODEL ... OPTIONS(model_type='...') AS SELECT ... FROM training_table. Single SQL statement. - Evaluation:
ML.EVALUATE(MODEL ..., (SELECT ...))returns metrics (accuracy, AUC, log loss). Built-in test/train split via OPTIONS. - Prediction:
ML.PREDICT(MODEL ..., (SELECT ...)). Batch prediction inline. - Hyperparameter tuning:
ML.NUM_TRIALS+ML.HPARAM_TUNING_OBJECTIVES. Built-in Vizier-based search. - BQ ML vs Vertex AI: BQ ML for SQL analysts + simple models + data already in BQ. Vertex AI for production custom models, online serving, more frameworks, MLOps pipelines.
- Export to Vertex: BQ ML models can be exported to Vertex AI Model Registry for online endpoint serving.
A marketing analyst wants to predict customer churn. Data already in BigQuery. CREATE MODEL marketing.churn_model OPTIONS(model_type='LOGISTIC_REG', input_label_cols=['churned']) AS SELECT customer_features, churned FROM customer_history. ML.EVALUATE shows AUC 0.87. ML.PREDICT against the current customer table emits churn probabilities — marketing team uses the probabilities for targeted campaigns. No data engineering or Python required.
Vertex AI is GCP's full ML platform. PDE tests AutoML vs custom training + the model deployment options.
- Vertex AI AutoML: upload labelled data → Google trains a model + finds best architecture. Tabular, Image, Video, Text, Forecasting. Hands-free.
- Custom training: bring your own training code (Python, TF/PyTorch/scikit-learn/XGBoost). Submit to Vertex training service; specify machine type + GPU + region.
- Training service: serverless training — pay per training hour. Supports distributed training (parameter server, all-reduce).
- Pre-built containers: Vertex provides containers for major frameworks. Custom containers for niche stacks.
- Hyperparameter tuning: Vertex Vizier-based HPT — submit a config + budget; Vertex runs the search.
- Online vs batch prediction: Online = HTTP endpoint, low-latency, autoscaling. Batch = bulk predictions against a file. Trade-off cost vs latency.
- Model Registry: versioned model storage + lineage. Deploy specific versions to endpoints. Roll back by deploying previous version.
Product-image classification: small dataset, no ML team. → Vertex AI AutoML Image — upload labelled images, AutoML trains, returns model. For recommendation system with custom embedding model: custom training in PyTorch, GPU machine type, hyperparameter tuning. Model registered in Vertex Model Registry, deployed to online endpoint at v1.0. New v1.1 deployed to 10% traffic split for canary; promote to 100% after validation.
Production ML requires consistent features + reproducible training. Feature Store and Vertex AI Pipelines are the MLOps primitives PDE expects.
- Vertex Feature Store: central feature repository. Online (low-latency for serving) + offline (BigQuery-backed for training). Solves training-serving skew — both reads the same feature.
- Feature ingestion: batch from BigQuery / Dataflow, or streaming from Pub/Sub. Features versioned with timestamps for point-in-time correctness.
- Vertex AI Pipelines: Kubeflow-based or TFX-based pipeline runner. Define training workflow as DAG; Vertex executes serverlessly.
- Pipeline components: reusable steps — data prep, training, evaluation, deployment. Build once, compose into many pipelines.
- Continuous training (CT): trigger pipelines on schedule or data drift. Retrain + evaluate + deploy automatically if metrics improve.
- Model Monitoring: Vertex monitors deployed models for prediction drift + skew + outliers. Alerts on degradation.
- Vertex AI Workbench: managed Jupyter for data scientists. Integrated with BQ + GCS + Vertex services.
Fraud detection MLOps pipeline: Feature Store with online + offline. Streaming ingestion from Pub/Sub keeps online features current. Vertex AI Pipeline triggered weekly: pull training data from Feature Store offline → train new model → evaluate against holdout → if AUC improves, deploy to canary endpoint at 10% traffic. Model Monitoring flags feature drift → triggers retraining outside the weekly schedule. Vertex Workbench for data scientist exploration.
📖 Read in-depth chapter ▾
Cloud Composer is managed Airflow. PDE tests DAG design + operators + secret management.
- DAG (Directed Acyclic Graph): Airflow's pipeline primitive. Python file defining tasks + dependencies. Loaded by scheduler + executed by workers.
- Operators: task building blocks. Standard: BashOperator, PythonOperator. GCP-specific: BigQueryInsertJobOperator, DataflowStartFlexTemplateOperator, GCSToBigQueryOperator, BigQueryCheckOperator.
- XCom: cross-task communication. Small payload (e.g., row count from BQ → branch decision). Avoid passing huge datasets via XCom — use GCS pointers instead.
- Pools + concurrency: limit how many tasks of a type run simultaneously. Prevents overloading downstream (e.g., max 5 concurrent BQ jobs).
- Secret backend: Composer integrates with Secret Manager — Airflow connections + variables stored as secrets, never in code or env vars.
- Composer 2: GKE Autopilot-based, autoscaling environment. Composer 1 is legacy (manual GKE node management).
- Sensors + deferrable operators: sensors wait for external events (file lands, query completes). Deferrable variants release worker slot while waiting — scales to thousands of concurrent waits.
Nightly ETL DAG: GCSObjectExistenceSensor (deferrable) waits for landing-zone file → DataflowStartFlexTemplateOperator runs transformation pipeline → BigQueryInsertJobOperator appends to silver → BigQueryCheckOperator validates row count → BashOperator sends Slack notification. Connection strings + API keys stored in Secret Manager + referenced via Airflow variable. Pool bq_pool limits concurrent BQ jobs to 5.
Dataplex is the data fabric layer that unifies governance across BQ + GCS + Bigtable. PDE tests its lake/zone model + data quality tasks.
- Lake / Zone / Asset model: Lake = logical domain (e.g., "Retail"). Zones inside lakes — Raw + Curated (analogous to bronze + silver). Assets = the actual storage (GCS buckets, BQ datasets).
- Auto-discovery: Dataplex scans assets, infers schemas, registers them as BigQuery external tables for querying. Schema drift detected + propagated.
- Data quality tasks: declarative rules (not null, range, regex, uniqueness, custom SQL) run on schedule. Failures emit Cloud Logging events + can fail downstream pipelines.
- Data lineage: integrated with Vertex AI + Dataflow + Composer — Dataplex tracks where data came from + where it went. Auditable for compliance.
- Data profiling: automated statistical profiles of tables — null rates, distinct counts, histograms. Surfaces data quality issues passively.
- Data Catalog integration: Dataplex pushes metadata to Data Catalog. Tag templates classify (PII, sensitivity). Search across the org for "all tables with column tagged SSN".
Enterprise data platform: Dataplex Lake "Retail" with Raw + Curated zones. GCS buckets + BigQuery datasets registered as assets. Auto-discovery publishes external BQ tables for every GCS Parquet partition. Data quality tasks: orders table must have not-null order_id, customer_id non-empty, amount > 0. Failures alert the data team via Cloud Logging. Data Catalog tag "PII" applied to customer_email column. Lineage shows: raw GCS → silver Parquet → BQ table → Vertex AI training dataset.
Sensitive-data discovery + de-identification is mandatory for regulated workloads. Cloud DLP scans + transforms PII inline or via Dataflow.
- InfoType detectors: 150+ built-in detectors (US_SSN, EMAIL, PHONE, CREDIT_CARD, etc.) + custom regex / dictionary detectors.
- Inspection: scan content for InfoTypes. Returns findings with confidence + offsets. Run on a BQ table / GCS bucket / Datastore kind.
- De-identification transformations: mask (replace with *), redact (remove), hash, replace-with-info-type, format-preserving encryption (FPE), date shifting, generalisation, bucketing.
- Crypto deterministic vs format-preserving: deterministic = same plaintext → same ciphertext (good for joins on de-identified columns). FPE preserves format (SSN-shape ciphertext, useful when downstream system expects format).
- Re-identification: calculate re-identification risk (k-anonymity / l-diversity) of a de-identified dataset. Tells you if combinations of quasi-identifiers can re-identify individuals.
- DLP in pipelines: integrate via Dataflow templates — every record passes through DLP de-identification on ingest. Or run async via storage job for batch.
Healthcare data ingestion: bronze landing zone GCS bucket → Dataflow pipeline with Cloud DLP de-identification. InfoTypes detected: US_SSN, PERSON_NAME, EMAIL_ADDRESS, DATE_OF_BIRTH. Transformations: SSN → format-preserving encrypted, name → hashed, email → masked (first 2 chars + domain), DOB → year only. De-identified output written to silver zone. Re-identification risk assessment confirms k-anonymity ≥ 5 across quasi-identifiers. Analysts query silver freely; HR re-identifies via FPE decryption key in Secret Manager when needed.
Key Concepts to Master
Bigtable vs BigQuery: The Exam Trap
The exam loves asking which storage system to use. The deciding factor: Bigtable for high-throughput, low-latency reads/writes of time-series or IoT data at millions of QPS. BigQuery for analytical queries over terabytes with SQL. If the scenario says "real-time sensor data at <10ms latency," it's Bigtable. If it says "analyze 3 years of sales data," it's BigQuery.
Dataflow Windowing: Which Window When?
Three window types tested heavily: Fixed (tumbling) — non-overlapping equal-size windows, e.g., aggregate sales per hour. Sliding — overlapping windows for moving averages, e.g., 1-hour window every 5 minutes. Session — gap-based, variable duration, ideal for user activity sessions. Late data is handled with allowedLateness and trigger strategies.
BigQuery ML: Choosing the Right Model Type
The PDE exam tests CREATE MODEL type selection: LINEAR_REG for numeric predictions (price forecasting). LOGISTIC_REG for binary/multi-class classification (churn prediction). KMEANS for unsupervised clustering (customer segmentation). BOOSTED_TREE_CLASSIFIER for high-accuracy tabular classification. ARIMA_PLUS for time-series forecasting with trend/seasonality decomposition.
6-Week Study Plan
CREATE MODEL and evaluate it with ML.EVALUATE. Explore Vertex AI Feature Store and understand the difference between online and offline serving. Study MLOps pipeline patterns.
Top 4 Mistakes on the PDE Exam
WHERE clause. If you wrap the partition column in a function (DATE(timestamp_col)), pruning is disabled. The exam tests this with cost optimization scenarios.
GCP PDE vs GCP PCA — What's the Difference?
PDE — Data Engineer
- BigQuery, Dataflow, Pub/Sub in depth
- Bigtable schema and hot-spot design
- Vertex AI, BigQuery ML, Feature Store
- Cloud Composer Airflow DAGs
- Dataplex governance & data quality
- Cloud DLP de-identification
- Datastream CDC pipelines
- Focus: data layer and ML pipelines
PCA — Cloud Architect
- Shared VPC, Cloud Spanner multi-region
- Anthos multi-cloud & GKE architecture
- VPC Service Controls, Binary Authorization
- Cloud Deploy, Config Sync GitOps
- SLO/SLI engineering, burn rate alerting
- Chaos engineering & DR strategies
- Workload Identity, CMEK, Cloud HSM
- Focus: infrastructure and platform design