Google Cloud · cloud

GCP Professional Data Engineer (PDE)

Master Google Cloud data engineering for the PDE exam: BigQuery partitioning/clustering/ML/BI Engine, Dataflow Apache Beam windowing and stateful processing, Pub/Sub ordering and exactly-once delivery, Dataproc ephemeral clusters and Serverless Spark, Cloud Spanner interleaving, Bigtable row-key design, Vertex AI Feature Store and Pipelines, Cloud DLP de-identification, Dataplex governance, and Cloud Composer Airflow orchestration. 60 scenario-based practice questions.

7Modules
40 hoursDuration
advancedLevel
Field Details
Exam CodeProfessional Data Engineer (updated 2023)
Questions50–60 multiple-choice and multiple-select
Duration2 hours
Passing Score~70% (Google does not publish exact score)
Price$200 USD
RecertificationEvery 2 years
Recommended Experience3+ years industry experience, 1+ year on GCP
PrerequisitesNone official (GCP ACE or PCA recommended)

Exam Domain Weights

Domain 1 — Designing Data Processing Systems ~22%
Domain 2 — Ingesting and Processing Data ~25%
Domain 3 — Storing the Data ~20%
Domain 4 — Preparing and Using Data for Analysis ~15%
Domain 5 — Maintaining and Automating Workloads ~18%

Course Modules

Module 013 lessons
Data Engineering Foundations & Service Selection
Understand the GCP data engineering landscape and master the critical skill of selecting the right service for each workload. Learn when to use BigQuery vs Bigtable vs Spanner vs Cloud SQL, and how to design a modern data lake architecture on Cloud Storage.
service selection data lake Cloud Storage BigQuery Bigtable Spanner
📖 Read in-depth chapter
Lesson 1.1 Storage selection — OLTP, OLAP, NoSQL, time-series

PDE is fundamentally a "pick the right data store" exam. Every domain leans on the same decision matrix. Memorise the access-pattern → service mapping.

Key concepts
  • OLTP relational: Cloud SQL (MySQL / PostgreSQL / SQL Server, regional, lift-and-shift). AlloyDB for PostgreSQL with HTAP. Cloud Spanner for globally distributed strong consistency.
  • OLAP analytics: BigQuery as the warehouse. Serverless, columnar, billed per query (on-demand) or by slot reservation. NOT for OLTP.
  • NoSQL wide-column / time-series: Bigtable — low-latency at million-QPS, schemaless, horizontally scalable. The IoT / fintech / ad-tech default.
  • NoSQL document / mobile: Firestore in Native mode for real-time sync + offline SDK. Datastore mode is legacy.
  • Object storage / data lake: Cloud Storage. Multi-region or regional buckets. Tier classes (Standard / Nearline / Coldline / Archive).
  • Cache: Memorystore (Redis / Memcached managed).
  • Streaming: Pub/Sub for ingest. Dataflow for transform. BigQuery streaming inserts or Bigtable as sink.
Concrete example

A retailer's data platform: catalog (~1M products, low-latency lookup) → Cosmos-like NoSQL? NoFirestore (mobile sync) OR Bigtable (high-QPS reads). Order transactions (ACID, global stores) → Cloud Spanner. 5-year order history (analytics) → BigQuery with date partitioning. Real-time clickstream → Pub/Sub → Dataflow → BigQuery. PDFs / receipts → Cloud Storage with lifecycle Standard → Coldline → Archive.

Key takeaway: Spanner = global ACID. BigQuery = analytics warehouse. Bigtable = time-series / IoT. Firestore = mobile real-time. Cloud SQL / AlloyDB = relational OLTP. Cloud Storage = objects.
⚡ Mini-quiz
Drill data-store-selection scenarios → study mode (10 questions).
Lesson 1.2 Data lake patterns on Cloud Storage

The data lake is the canonical PDE foundation. GCS-based lake with BigQuery + Dataflow + Dataproc layered on top. PDE tests zone design + format choice + lifecycle.

Key concepts
  • Bronze / Silver / Gold zones: Bronze = raw landed data, immutable. Silver = cleaned + deduped + typed. Gold = aggregated / consumer-ready datasets. Sometimes called Raw / Curated / Refined.
  • File formats: Parquet + Avro for columnar / row analytics. Avro for schema-evolving streaming sinks. JSON Lines for human-readable / quick-debug. NEVER CSV in production lakes — no schema, no compression.
  • Partitioning convention: Hive-style partitions on date — gs://bucket/silver/orders/dt=2026-05-16/. BigQuery external tables + Dataflow auto-detect partitions.
  • Compression: Snappy (fast, splittable) for hot data. Gzip for cold archives. ZSTD when supported. Always compress — costs 5-10× less for the same data.
  • Cloud Storage class choices: Bronze = Standard for first 30d → Nearline → Coldline → Archive. Silver / Gold = Standard (queried more). Lifecycle policies automate transitions.
  • BigLake: unify Cloud Storage + BigQuery — query GCS Parquet/Avro as if BigQuery tables; column-level + row-level security applied at the lake layer.
Concrete example

A retailer's lake on GCS: gs://retail-lake/bronze/clickstream/dt=... (raw JSON from Pub/Sub → GCS via Dataflow, 7-day Standard → Nearline). Dataflow batch transforms write Parquet to silver/clickstream/dt=... (deduped, schema-typed). BigQuery external table over silver. Daily gold aggregations materialised as native BigQuery tables for sub-second BI dashboards. BigLake applies row-level security on silver so analysts only see their region's data.

Key takeaway: bronze/silver/gold zones. Parquet/Avro + Snappy + Hive partitions. Lifecycle policies tier old bronze data. BigLake unifies GCS + BQ with security.
⚡ Mini-quiz
Practise data-lake design scenarios → quick quiz (5 questions).
Lesson 1.3 IAM + data governance fundamentals

Data engineering touches sensitive data. PDE expects governance from day one — IAM, encryption, Data Catalog, DLP.

Key concepts
  • IAM for data: least-privilege role grants at project / bucket / dataset / table / column. Dataset-level roles in BigQuery (BigQuery Data Viewer / Editor) vs project-level (BigQuery User).
  • Service accounts for pipelines: each pipeline (Dataflow / Composer / Dataproc) runs as its own SA with minimum required permissions. Avoid the default Compute Engine SA.
  • CMEK: Customer-Managed Encryption Keys via Cloud KMS. BigQuery / Cloud Storage / Spanner / Dataflow / Bigtable all support CMEK. Required for compliance regimes.
  • VPC Service Controls: perimeter around the data services to prevent exfiltration via leaked credentials. Mandatory for HIPAA / PCI lakes.
  • Data Catalog: central metadata catalog. Tag templates for sensitivity classification (PII / PHI / Financial). Powers search + lineage.
  • Cloud DLP: scans for PII / PHI / financial data; can de-identify (mask, redact, tokenize) or hash. Run on bronze data before promoting to silver.
Concrete example

Healthcare data platform: VPC Service Controls perimeter around the BigQuery + GCS + Dataflow project. Cloud DLP scans bronze landing zone on ingest — detects SSN + DOB + names; de-identifies via format-preserving tokenization before silver promotion. Data Catalog tag template "PHI" applied to silver tables; column-level access policies in BigQuery restrict raw PHI fields to specific analyst group. CMEK on every BQ dataset + GCS bucket. SA-per-pipeline for Dataflow / Composer with least-privilege roles.

Key takeaway: dataset-level IAM, SA per pipeline, CMEK + VPC-SC for compliance, Data Catalog for classification, Cloud DLP for de-identification at ingest. Governance is day-one work, not a phase-2 retrofit.
⚡ Mini-quiz
Drill governance + IAM scenarios → study mode (10 questions).
Module 023 lessons
BigQuery: Analytics at Scale
Deep dive into BigQuery — the most heavily tested service on the PDE exam. Master partitioning (ingestion time vs column-based), clustering, slot reservations, BI Engine for sub-second queries, materialized views, column-level and row-level security, external tables, and BigQuery ML.
partitioning clustering slot reservations BI Engine materialized views column security BigQuery ML
📖 Read in-depth chapter
Lesson 2.1 Partitioning + clustering — query-cost optimisation

BigQuery bills by data scanned. The single biggest cost lever is partitioning + clustering. Get this wrong and query bills 100×.

Key concepts
  • Partition types: Ingestion-time (auto-populated by load time), Time-unit column (date / timestamp column you specify), Integer-range (numeric range buckets). Time-unit column is the modern default.
  • Partition pruning: queries that filter on the partition column scan ONLY relevant partitions. WHERE order_date BETWEEN '2026-05-01' AND '2026-05-15' reads ~15 partitions instead of all 2000+ days.
  • Require partition filter: table option that forces every query to include a partition filter. Prevents accidental full-table scans on huge tables.
  • Clustering: sorts data within each partition by 1-4 columns. Queries filtering on cluster columns prune blocks even more. Stacks with partitioning.
  • Cluster column order matters: most-filtered column first. Cardinality matters less than filter-pattern frequency.
  • Cluster expiration: partition expiration deletes old partitions automatically — combine with date partitioning to enforce data-retention policies.
Concrete example

A 50TB orders table queried mostly by date range + customer + store. Design: partition on order_date (daily partitions). Cluster on customer_id, store_id. Require partition filter. Typical query SELECT ... WHERE order_date BETWEEN '2026-05-01' AND '2026-05-07' AND customer_id = 12345 scans ~7 partitions × a few blocks = ~100 MB instead of 50 TB. 99.99% cost reduction.

Key takeaway: partition on the filter column (typically date). Cluster on the next 1-4 columns by filter frequency. Require partition filter on large tables. Combined: 100× cost reduction.
⚡ Mini-quiz
Drill BQ partitioning + clustering → study mode (10 questions).
Lesson 2.2 Slots, reservations, Editions, BI Engine

BigQuery's pricing comes in two flavours — on-demand (per-TB scanned) and Editions (capacity-based with slots). PDE tests when to pick each + how to combine them.

Key concepts
  • Slot: a unit of CPU + memory + IO. Parallel query execution scales with available slots. On-demand has implicit dynamic slot pool. Editions reserves explicit slots.
  • Editions tiers: Standard (basic), Enterprise (multi-statement transactions, customer-managed encryption, BI Engine), Enterprise Plus (cross-region replication, fine-grained access).
  • Reservations: commit to N slots in a region for 1-month, 1-year, or 3-year. Assign to projects via assignments. Predictable cost for steady ETL workloads.
  • Idle slots + autoscaling: reservations can autoscale up to a max + share idle slots across reservations in the same admin project.
  • BI Engine: in-memory query acceleration. Sub-second dashboards. Configured as a separate reservation. Looker / Looker Studio / Tableau benefit.
  • Materialised views: precomputed aggregations refreshed incrementally. BQ rewrites matching queries to use the MV. Cuts cost for common dashboard queries.
Concrete example

A retailer's predictable nightly ETL + spiky daytime ad-hoc: cover ETL with Enterprise Edition reservation, 500 slots committed 1-year. Daytime ad-hoc analysts use on-demand billing (separate project). Looker dashboards on the materialised views accelerated by a BI Engine reservation for sub-second response. Common 30-day aggregation queries replaced with materialised views — additional cost cut.

Key takeaway: Editions reservations for predictable load. On-demand for spiky / ad-hoc. BI Engine for dashboards. Materialised views for common aggregations. Mix as appropriate.
⚡ Mini-quiz
Practise slots + reservations scenarios → quick quiz (5 questions).
Lesson 2.3 External tables, row + column security, BigQuery ML

BQ's advanced features. External tables for federated query; row + column security for governance; BigQuery ML for in-DB modelling.

Key concepts
  • External tables: BQ tables that read data from Cloud Storage / Cloud SQL / Bigtable / Google Sheets without copying. Use for federated query + data-lake patterns.
  • BigLake: external tables with BigQuery's IAM + row/column security applied at the lake layer. Supports Parquet/Avro/ORC/JSON on GCS.
  • Row-level security (RLS): per-table access policies that filter rows by a condition (e.g., region = SESSION_USER_REGION()). Applied transparently to every query.
  • Column-level access policies: tag columns with policy tags from Data Catalog. Role grants reference the tag — granting "PII reader" exposes the column to that user.
  • Dynamic data masking: column-level masking rules (e.g., hash, last-4-digits) applied per role. User without unmask permission sees the masked value.
  • BigQuery ML: train ML models with SQL. CREATE MODEL ... OPTIONS(model_type='LINEAR_REG') AS SELECT .... Supports LINEAR_REG, LOGISTIC_REG, KMEANS, BOOSTED_TREE, ARIMA_PLUS, DNN, AutoML. Replace simple ML for analysts with just SQL.
Concrete example

A multinational has region-specific data-residency rules. External BigLake table over GCS lake. Row-level security policy: rows filtered by user's region claim. Column-level access: SSN column tagged "PII", only HR role can see it. Dynamic masking on email column shows masked email to analysts. Data scientists use BigQuery ML to train a churn-prediction LOGISTIC_REG model on the silver table — no Vertex AI dependency for simple cases.

Key takeaway: external tables + BigLake for in-place federation. RLS + column tags + dynamic masking for governance. BQ ML for SQL-native models on simple problems.
⚡ Mini-quiz
Drill external + security + ML scenarios → study mode (10 questions).
Module 033 lessons
Dataflow & Apache Beam: Streaming and Batch Pipelines
Build and optimize data pipelines with Apache Beam on Dataflow. Understand windowing strategies (tumbling, sliding, session), watermarks, triggers for late data, side inputs for enrichment, stateful processing with DoFns, and the performance advantages of Dataflow Streaming Engine.
Apache Beam windowing watermarks side inputs stateful DoFns Streaming Engine Flex Templates
📖 Read in-depth chapter
Lesson 3.1 Apache Beam — PCollection, PTransform, runners

Apache Beam is the unified batch + streaming SDK; Dataflow is GCP's managed runner. Same pipeline code runs both modes.

Key concepts
  • PCollection: distributed dataset (immutable, parallel-processable). Bounded for batch; unbounded for streaming.
  • PTransform: processing step — ParDo (per-element function), GroupByKey, Combine, Flatten, Window. Compose into DAGs.
  • Pipeline: the DAG of PTransforms. Built once, runs anywhere — Direct Runner (local), Dataflow (managed), Flink, Spark.
  • I/O connectors: built-in for GCS, Pub/Sub, BigQuery, Bigtable, Spanner, Kafka, JDBC. Source + sink primitives.
  • Dataflow runner advantages: autoscaling (workers added/removed based on backlog), data sampling for debugging, GCP-native I/O.
  • Templates vs Flex Templates: Templates = pre-baked pipeline image, parameters at run-time. Flex Templates = build pipeline into Docker image + push to Artifact Registry — supports parameters that change pipeline structure.
Concrete example

Streaming pipeline: read from Pub/Sub topic orders-stream → ParDo to parse JSON → GroupByKey on customerId → ParDo to compute customer running totals → write to BigQuery streaming inserts. Same code with --streaming=False + bounded GCS source can run as a batch backfill against historical data. Flex Template packages it as parameterised Docker image for ops + analyst self-service runs.

Key takeaway: Beam = SDK, Dataflow = runner. PCollection + PTransform = the building blocks. Flex Templates for parameterised reusable pipelines.
⚡ Mini-quiz
Drill Beam + Dataflow basics → study mode (10 questions).
Lesson 3.2 Windowing, watermarks, triggers

Streaming = unbounded data. Windows turn unbounded into bounded chunks for aggregation. Watermarks + triggers handle late data. PDE tests this on every streaming question.

Key concepts
  • Fixed (tumbling) windows: non-overlapping fixed-size intervals — every 5 minutes. Best for "5-minute counts".
  • Sliding windows: overlapping fixed-size intervals — 5-min window every 1 min. Smooth moving averages.
  • Session windows: gap-based — close window when no activity for N minutes. Perfect for user-session analytics.
  • Global window: single window for all elements. Requires triggers to emit (otherwise never closes for unbounded data).
  • Watermarks: Dataflow's estimate of "all events with timestamp < W have arrived". Watermark advances as input streams progress. Used to close windows + emit results.
  • Triggers: control when window results emit. Default = at watermark close. Custom = early (every N elements) + late (each late element).
  • Allowed lateness: how long after watermark close to keep accepting late events. Pair with late-firing triggers to update aggregates.
Concrete example

E-commerce real-time conversion-rate dashboard: sliding window 30 minutes wide, sliding every 5 minutes. Watermark tracks event-time progress from Pub/Sub timestamps. Allowed lateness 10 minutes with late-firing trigger — dashboard updates retroactively when delayed mobile events arrive. After 10 min late, drop the event (alert if rate exceeds threshold).

Key takeaway: fixed for periodic counts, sliding for moving averages, session for user behaviour. Watermarks + triggers + allowed lateness handle out-of-order events.
⚡ Mini-quiz
Practise windowing + watermark scenarios → quick quiz (5 questions).
Lesson 3.3 Side inputs, stateful DoFns, Streaming Engine

Advanced Beam: enrich streams with reference data (side inputs), maintain per-key state (stateful DoFns), and use Streaming Engine to offload state from workers.

Key concepts
  • Side input: auxiliary PCollection broadcast to a ParDo for enrichment lookups. E.g., enrich clickstream events with a user-profile lookup table.
  • Slowly-changing side inputs: refresh side input on schedule (e.g., reload product catalog every hour) using GenerateSequence as a trigger.
  • Stateful DoFns: per-key state stored across element invocations. Common pattern: count events per key within a window, emit when threshold hit.
  • Timers: register a future event-time or processing-time callback. Used with state for windowed deduplication + custom triggers.
  • Streaming Engine: Dataflow option that offloads state + windowing from workers to a Dataflow-managed backend. Faster autoscaling (no state to move), lower worker memory needs.
  • Dataflow Prime: the newer execution model — automatic vertical autoscaling, right-fitting workers per stage, integrated with Streaming Engine.
Concrete example

Fraud-detection streaming pipeline: read transactions from Pub/Sub; side input = customer risk-tier lookup table refreshed hourly. ParDo enriches each transaction with risk tier. Stateful DoFn keyed by customerId tracks the last 10 transactions in state. Detect rapid spend increase → emit alert. Streaming Engine ON + Dataflow Prime for elastic autoscaling. Output to BigQuery streaming inserts + Pub/Sub alert topic.

Key takeaway: side inputs for enrichment lookups. Stateful DoFns + timers for per-key state. Streaming Engine + Dataflow Prime for elastic autoscaling.
⚡ Mini-quiz
Drill stateful Beam scenarios → study mode (10 questions).
Module 043 lessons
Pub/Sub, Dataproc & Data Fusion: Messaging and ETL
Master event streaming with Pub/Sub (ordering keys, dead-letter topics, exactly-once delivery, Pub/Sub Lite), big data processing with Dataproc (ephemeral clusters, autoscaling, Serverless Spark, Metastore), and code-free ETL with Data Fusion Wrangler. Learn when each tool excels.
Pub/Sub ordering dead-letter exactly-once Dataproc ephemeral Serverless Spark Data Fusion
📖 Read in-depth chapter
Lesson 4.1 Pub/Sub deep dive — ordering, exactly-once, DLT

Pub/Sub is the canonical PDE messaging primitive. Beyond fan-out (covered in PCA Lesson 2.2), this lesson covers the subtle delivery guarantees + Pub/Sub Lite.

Key concepts
  • Ordering keys: messages with the same key are delivered in order WITHIN a subscription. Throughput per key is sequential — choose keys with enough diversity to parallelise.
  • Exactly-once delivery: subscription option that guarantees each message delivered at most once within an acknowledgement deadline window. Subscriber must acknowledge inside the deadline; otherwise redelivery occurs.
  • Dead Letter Topic (DLT): attach to a subscription with max-delivery-attempts. After N failed deliveries, message routes to DLT for forensic + replay.
  • Acknowledgement deadline: default 10s, max 600s. Extend dynamically via modifyAckDeadline when processing takes longer than expected.
  • Pub/Sub Lite: regional, single-zone, partition-based (Kafka-like). Cheaper than Pub/Sub. Less durable. Use only for non-critical / cost-sensitive workloads.
  • Subscription types: Pull (consumer requests), Push (Pub/Sub POSTs to HTTP endpoint), BigQuery subscription (writes directly to BQ, no intermediate Dataflow), Cloud Storage subscription (writes to GCS).
Concrete example

Order events: Pub/Sub topic orders.placed with ordering key = customerId (per-customer sequential). Subscription with exactly-once delivery + DLT after 5 failed attempts. Cloud Run consumer extends ack deadline if processing > 10s. BigQuery subscription on a separate "raw orders" subscription writes directly into BQ without Dataflow — cheapest path for simple persistence.

Key takeaway: ordering keys for per-key sequential. Exactly-once when duplicates hurt. DLT always. BigQuery + GCS subscriptions for simple direct-write patterns.
⚡ Mini-quiz
Drill Pub/Sub delivery scenarios → study mode (10 questions).
Lesson 4.2 Dataproc — ephemeral clusters + Serverless Spark

Dataproc is managed Hadoop / Spark. PDE tests the ephemeral cluster pattern + the newer Serverless Spark option.

Key concepts
  • Ephemeral clusters: spin up a cluster for one job, shut down on completion. Avoids paying for idle clusters. The PDE-favoured pattern.
  • Dataproc on GKE: run Spark / Flink on a shared GKE cluster instead of dedicated VMs. Better resource sharing across jobs.
  • Dataproc Serverless: no cluster — submit a Spark batch job, pay per execution, no infrastructure management. The modern preferred pattern for one-off jobs.
  • Persistent state via Metastore: Dataproc Metastore is a managed Hive metastore. Ephemeral clusters / Serverless jobs can read/write tables defined there.
  • Autoscaling: Dataproc clusters can scale worker count based on YARN pending containers. Combine with secondary worker pool (preemptible).
  • Workflow templates: declarative job DAG + cluster spec. gcloud dataproc workflow-templates instantiate spins cluster → runs jobs → terminates cluster. The IaC pattern for Dataproc.
Concrete example

Nightly Spark ETL: define a workflow template — cluster spec (10 primary + 20 secondary preemptible workers, autoscaling policy), job (Spark JAR + arguments). Cloud Composer triggers instantiate at 02:00 UTC. Cluster spins up, runs the job, shuts down. Total cost: ~2 hours of cluster time per night vs 24h for a persistent cluster. Dataproc Metastore holds the Hive table catalog accessed by every job.

Key takeaway: ephemeral clusters or Serverless Spark for cost. Metastore for shared catalog. Workflow templates for declarative orchestration. Avoid persistent always-on clusters.
⚡ Mini-quiz
Practise Dataproc scenarios → quick quiz (5 questions).
Lesson 4.3 Data Fusion + Datastream — low-code ETL + CDC

For teams without strong Spark / Beam skills, Data Fusion provides visual ETL. Datastream replicates change events from databases.

Key concepts
  • Data Fusion: GUI-based ETL on top of CDAP. Drag-and-drop pipelines. Generates Dataflow / Spark / MapReduce under the hood. Use for analyst-built or low-code pipelines.
  • Wrangler: Data Fusion's data-preparation interactive tool. Apply transformations on a sample; generates the pipeline transforms automatically.
  • Datastream: Change Data Capture from MySQL / PostgreSQL / Oracle / SQL Server / AlloyDB into BigQuery / Cloud Storage / Spanner. Low-latency, schema-aware.
  • Datastream → BigQuery direct mode: writes CDC changes directly into BQ with upserts via the BigQuery Storage Write API. Near-real-time replication.
  • Use cases: migrate from on-prem MySQL to Cloud SQL with continuous replication, or stream operational data into BigQuery for analytics with sub-minute lag.
  • Schema drift: Datastream detects source schema changes + propagates to BigQuery. Configurable handling of drop / add / type changes.
Concrete example

A company runs operational MySQL on-prem. Need real-time analytics in BigQuery without an ETL team. Design: Datastream connects to MySQL via private connectivity (Cloud VPN). Captures all changes via MySQL binlog. Writes to BigQuery direct mode with upserts. Sub-minute lag from OLTP to BQ. Schema drift handled — new columns auto-added to BQ. Analysts use Looker on the BQ tables, see operational data updated nearly real-time.

Key takeaway: Data Fusion for visual / low-code ETL. Datastream for DB → BQ CDC replication. Sub-minute lag operational data in analytics warehouse.
⚡ Mini-quiz
Drill Data Fusion + Datastream → study mode (10 questions).
Module 053 lessons
Storage Systems: Bigtable, Spanner, and Data Lakes
Design optimal storage schemas for different data patterns. Master Bigtable row-key design to prevent hot spots (reverse timestamps, salting, field promotion), Cloud Spanner interleaved tables and commit timestamps, Cloud Storage lifecycle policies for cost optimization, and Bigtable replication for read scaling and DR.
Bigtable row key hot spots Spanner interleaving commit timestamps lifecycle policies replication
📖 Read in-depth chapter
Lesson 5.1 Bigtable — row key design + hot-spot avoidance

Bigtable's performance depends entirely on row-key design. PDE tests this more than any other Bigtable topic.

Key concepts
  • Lexicographic key order: Bigtable sorts rows by key. Range scans use the order. Sequential keys = hot tablet.
  • Hot-spot causes: monotonically increasing prefix (timestamp first), single common prefix (all rows start with same value), low-cardinality leading field.
  • Reverse-timestamp: for "most recent first" access, prepend reverse-timestamp (MAX_TIME - actual_time). Distributes writes across tablets + makes most-recent reads efficient.
  • Field promotion: move high-cardinality field to leading position. Instead of timestamp#userId, use userId#timestamp — writes spread by user.
  • Salting: prepend hash of key (mod N). Distributes writes uniformly. Cost: scans must hit all N salt prefixes — only useful for write-heavy workloads with random access.
  • Tablet splits: Bigtable auto-splits hot tablets if access is even. If access is skewed (one tablet hot), splits don't help — fix the key design.
Concrete example

IoT telemetry: 1M devices, each writes once per second. Bad key: timestamp#deviceId — every write at time T hits the same tablet (timestamp prefix is shared). Hot tablet. Good key: deviceId#reverseTimestamp — writes spread across deviceIds; per-device range scans (most-recent-first) efficient. Hot device? Add salt: {hash(deviceId)%16}#deviceId#reverseTimestamp — distributes hot device across 16 tablets.

Key takeaway: high-cardinality field first, reverse-timestamp for "most recent" patterns, salting only for extreme write hot keys. Tablet splits won't fix bad keys.
⚡ Mini-quiz
Drill Bigtable row-key scenarios → study mode (10 questions).
Lesson 5.2 Cloud Spanner — schema, interleaving, commit timestamps

Cloud Spanner is the global ACID OLTP. PDE tests interleaved tables + commit timestamps + read-only timestamps.

Key concepts
  • Interleaved tables: physically co-locate child rows with their parent row in the same split. INTERLEAVE IN PARENT in CREATE TABLE. Used for parent-child relationships (Order → OrderLineItems).
  • Why interleave: parent + children read in one IO. Joins on parent key become local. Beats traditional foreign-key join for performance.
  • Commit timestamps: system-managed timestamp set at commit. Use OPTIONS (allow_commit_timestamp=true). Monotonically increasing across all replicas — perfect for CDC + audit logs.
  • Stale reads: read with READ_TIMESTAMP or EXACT_STALENESS to get a read-only snapshot. Avoids contention with writers. Use for analytics queries against the OLTP DB.
  • Primary key design: avoid monotonically increasing PK (UUIDs work, sequential IDs cause hot splits). Bit-reverse sequential IDs if forced into them.
  • Spanner Editions: Standard, Enterprise (multi-region, CMEK), Enterprise Plus (cross-region replication tiers, leader options).
Concrete example

E-commerce platform: Orders table with PK (customer_id, order_id). OrderLineItems interleaved INTERLEAVE IN PARENT Orders ON DELETE CASCADE. Reading order + lines = one IO. Commit timestamp column on every table for CDC streaming via Datastream → BQ. Analytics queries use stale reads with 10-second staleness — non-blocking for OLTP writers.

Key takeaway: interleave parent-child for locality. Commit timestamps for CDC. Stale reads for analytics-on-OLTP. Avoid monotonic PKs.
⚡ Mini-quiz
Practise Spanner schema scenarios → quick quiz (5 questions).
Lesson 5.3 Cloud Storage lifecycle + Bigtable replication

Two cost / availability features. GCS lifecycle policies for tiering; Bigtable replication for read scaling + DR.

Key concepts
  • GCS lifecycle policies: declarative rules — age-based transitions (Standard → Nearline → Coldline → Archive), age-based deletion, noncurrent-version deletion (with versioning).
  • Lifecycle on versioned buckets: rules can target current vs noncurrent versions independently. Common: keep 3 most-recent noncurrent versions, archive past N days.
  • Object lock + retention policies: immutability (WORM) — bucket-level retention policy that locks down objects for N years. Compliance.
  • Bigtable replication: add more cluster locations to an instance for read scaling + DR. App profile chooses single-cluster (consistent) or multi-cluster (eventual consistency for reads).
  • App profiles: control routing — single-cluster routing pins all reads to one cluster (predictable). Multi-cluster fans out for HA.
  • Failover: automatic if multi-cluster app profile; manual cluster switch if single-cluster. Replication lag observable via metrics.
Concrete example

A bank's transaction archive (regulated 7-year retention) lives in GCS. Lifecycle policy: transition Standard → Nearline at 30d → Coldline at 90d → Archive at 365d. Object retention 7 years (compliance). Bigtable for real-time transaction analytics: 2 clusters (us-east1 + us-west1). App profile = multi-cluster (HA), readers use either cluster. On cluster failure, traffic auto-routes to the other.

Key takeaway: GCS lifecycle policies tier old data + delete versions. Object retention for WORM compliance. Bigtable multi-cluster replication for read scale + DR.
⚡ Mini-quiz
Drill GCS + Bigtable replication → study mode (10 questions).
Module 063 lessons
Vertex AI & BigQuery ML: Machine Learning Pipelines
Build production ML systems on GCP. Learn BigQuery ML model types (LINEAR_REG, LOGISTIC_REG, KMEANS, BOOSTED_TREE_CLASSIFIER, ARIMA_PLUS), Vertex AI AutoML vs custom training, Feature Store for feature sharing and online serving, Vertex AI Pipelines for MLOps, and Model Registry for versioning and deployment.
BigQuery ML Vertex AI AutoML Feature Store Vertex AI Pipelines Model Registry MLOps
📖 Read in-depth chapter
Lesson 6.1 BigQuery ML — train + predict in SQL

BQ ML lets data analysts build ML without leaving SQL. PDE expects you to know the model types + when BQ ML fits vs when Vertex AI is needed.

Key concepts
  • Model types: LINEAR_REG (regression), LOGISTIC_REG (classification), KMEANS (clustering), MATRIX_FACTORIZATION (recsys), BOOSTED_TREE (XGBoost-like), DNN, ARIMA_PLUS (time-series forecasting), AUTOML_CLASSIFIER / REGRESSOR (delegates to Vertex AutoML).
  • Training: CREATE MODEL ... OPTIONS(model_type='...') AS SELECT ... FROM training_table. Single SQL statement.
  • Evaluation: ML.EVALUATE(MODEL ..., (SELECT ...)) returns metrics (accuracy, AUC, log loss). Built-in test/train split via OPTIONS.
  • Prediction: ML.PREDICT(MODEL ..., (SELECT ...)). Batch prediction inline.
  • Hyperparameter tuning: ML.NUM_TRIALS + ML.HPARAM_TUNING_OBJECTIVES. Built-in Vizier-based search.
  • BQ ML vs Vertex AI: BQ ML for SQL analysts + simple models + data already in BQ. Vertex AI for production custom models, online serving, more frameworks, MLOps pipelines.
  • Export to Vertex: BQ ML models can be exported to Vertex AI Model Registry for online endpoint serving.
Concrete example

A marketing analyst wants to predict customer churn. Data already in BigQuery. CREATE MODEL marketing.churn_model OPTIONS(model_type='LOGISTIC_REG', input_label_cols=['churned']) AS SELECT customer_features, churned FROM customer_history. ML.EVALUATE shows AUC 0.87. ML.PREDICT against the current customer table emits churn probabilities — marketing team uses the probabilities for targeted campaigns. No data engineering or Python required.

Key takeaway: BQ ML for SQL-native model building when data lives in BQ + analysts own the workflow. Vertex AI when you need custom code, online serving, or MLOps.
⚡ Mini-quiz
Drill BQ ML model-type scenarios → study mode (10 questions).
Lesson 6.2 Vertex AI — AutoML vs custom training

Vertex AI is GCP's full ML platform. PDE tests AutoML vs custom training + the model deployment options.

Key concepts
  • Vertex AI AutoML: upload labelled data → Google trains a model + finds best architecture. Tabular, Image, Video, Text, Forecasting. Hands-free.
  • Custom training: bring your own training code (Python, TF/PyTorch/scikit-learn/XGBoost). Submit to Vertex training service; specify machine type + GPU + region.
  • Training service: serverless training — pay per training hour. Supports distributed training (parameter server, all-reduce).
  • Pre-built containers: Vertex provides containers for major frameworks. Custom containers for niche stacks.
  • Hyperparameter tuning: Vertex Vizier-based HPT — submit a config + budget; Vertex runs the search.
  • Online vs batch prediction: Online = HTTP endpoint, low-latency, autoscaling. Batch = bulk predictions against a file. Trade-off cost vs latency.
  • Model Registry: versioned model storage + lineage. Deploy specific versions to endpoints. Roll back by deploying previous version.
Concrete example

Product-image classification: small dataset, no ML team. → Vertex AI AutoML Image — upload labelled images, AutoML trains, returns model. For recommendation system with custom embedding model: custom training in PyTorch, GPU machine type, hyperparameter tuning. Model registered in Vertex Model Registry, deployed to online endpoint at v1.0. New v1.1 deployed to 10% traffic split for canary; promote to 100% after validation.

Key takeaway: AutoML for no-code / fast results. Custom training for control. Vertex Model Registry + online endpoint for production serving.
⚡ Mini-quiz
Practise AutoML vs custom training → quick quiz (5 questions).
Lesson 6.3 Feature Store + Vertex AI Pipelines — MLOps

Production ML requires consistent features + reproducible training. Feature Store and Vertex AI Pipelines are the MLOps primitives PDE expects.

Key concepts
  • Vertex Feature Store: central feature repository. Online (low-latency for serving) + offline (BigQuery-backed for training). Solves training-serving skew — both reads the same feature.
  • Feature ingestion: batch from BigQuery / Dataflow, or streaming from Pub/Sub. Features versioned with timestamps for point-in-time correctness.
  • Vertex AI Pipelines: Kubeflow-based or TFX-based pipeline runner. Define training workflow as DAG; Vertex executes serverlessly.
  • Pipeline components: reusable steps — data prep, training, evaluation, deployment. Build once, compose into many pipelines.
  • Continuous training (CT): trigger pipelines on schedule or data drift. Retrain + evaluate + deploy automatically if metrics improve.
  • Model Monitoring: Vertex monitors deployed models for prediction drift + skew + outliers. Alerts on degradation.
  • Vertex AI Workbench: managed Jupyter for data scientists. Integrated with BQ + GCS + Vertex services.
Concrete example

Fraud detection MLOps pipeline: Feature Store with online + offline. Streaming ingestion from Pub/Sub keeps online features current. Vertex AI Pipeline triggered weekly: pull training data from Feature Store offline → train new model → evaluate against holdout → if AUC improves, deploy to canary endpoint at 10% traffic. Model Monitoring flags feature drift → triggers retraining outside the weekly schedule. Vertex Workbench for data scientist exploration.

Key takeaway: Feature Store eliminates training/serving skew. Vertex AI Pipelines for reproducible workflows. Model Monitoring closes the loop with drift detection.
⚡ Mini-quiz
Drill MLOps + Feature Store scenarios → study mode (10 questions).
Module 073 lessons
Governance, Quality & Operations: Dataplex, Composer, DLP
Automate and govern your data platform. Learn Cloud Composer (managed Airflow) for DAG orchestration with XCom, pools, and secrets; Datastream for CDC replication from MySQL/PostgreSQL/Oracle to BigQuery; Dataplex for data governance (lakes, zones, data quality tasks); Cloud Data Catalog tag templates; and Cloud DLP for PII de-identification.
Cloud Composer Airflow DAGs Datastream CDC Dataplex Data Catalog Cloud DLP de-identification
📖 Read in-depth chapter
Lesson 7.1 Cloud Composer — Airflow DAG orchestration

Cloud Composer is managed Airflow. PDE tests DAG design + operators + secret management.

Key concepts
  • DAG (Directed Acyclic Graph): Airflow's pipeline primitive. Python file defining tasks + dependencies. Loaded by scheduler + executed by workers.
  • Operators: task building blocks. Standard: BashOperator, PythonOperator. GCP-specific: BigQueryInsertJobOperator, DataflowStartFlexTemplateOperator, GCSToBigQueryOperator, BigQueryCheckOperator.
  • XCom: cross-task communication. Small payload (e.g., row count from BQ → branch decision). Avoid passing huge datasets via XCom — use GCS pointers instead.
  • Pools + concurrency: limit how many tasks of a type run simultaneously. Prevents overloading downstream (e.g., max 5 concurrent BQ jobs).
  • Secret backend: Composer integrates with Secret Manager — Airflow connections + variables stored as secrets, never in code or env vars.
  • Composer 2: GKE Autopilot-based, autoscaling environment. Composer 1 is legacy (manual GKE node management).
  • Sensors + deferrable operators: sensors wait for external events (file lands, query completes). Deferrable variants release worker slot while waiting — scales to thousands of concurrent waits.
Concrete example

Nightly ETL DAG: GCSObjectExistenceSensor (deferrable) waits for landing-zone file → DataflowStartFlexTemplateOperator runs transformation pipeline → BigQueryInsertJobOperator appends to silver → BigQueryCheckOperator validates row count → BashOperator sends Slack notification. Connection strings + API keys stored in Secret Manager + referenced via Airflow variable. Pool bq_pool limits concurrent BQ jobs to 5.

Key takeaway: Composer 2 + DAG with GCP operators + Secret Manager backend + deferrable sensors. Pools limit concurrency on downstream services.
⚡ Mini-quiz
Drill Composer DAG scenarios → study mode (10 questions).
Lesson 7.2 Dataplex — data governance + quality

Dataplex is the data fabric layer that unifies governance across BQ + GCS + Bigtable. PDE tests its lake/zone model + data quality tasks.

Key concepts
  • Lake / Zone / Asset model: Lake = logical domain (e.g., "Retail"). Zones inside lakes — Raw + Curated (analogous to bronze + silver). Assets = the actual storage (GCS buckets, BQ datasets).
  • Auto-discovery: Dataplex scans assets, infers schemas, registers them as BigQuery external tables for querying. Schema drift detected + propagated.
  • Data quality tasks: declarative rules (not null, range, regex, uniqueness, custom SQL) run on schedule. Failures emit Cloud Logging events + can fail downstream pipelines.
  • Data lineage: integrated with Vertex AI + Dataflow + Composer — Dataplex tracks where data came from + where it went. Auditable for compliance.
  • Data profiling: automated statistical profiles of tables — null rates, distinct counts, histograms. Surfaces data quality issues passively.
  • Data Catalog integration: Dataplex pushes metadata to Data Catalog. Tag templates classify (PII, sensitivity). Search across the org for "all tables with column tagged SSN".
Concrete example

Enterprise data platform: Dataplex Lake "Retail" with Raw + Curated zones. GCS buckets + BigQuery datasets registered as assets. Auto-discovery publishes external BQ tables for every GCS Parquet partition. Data quality tasks: orders table must have not-null order_id, customer_id non-empty, amount > 0. Failures alert the data team via Cloud Logging. Data Catalog tag "PII" applied to customer_email column. Lineage shows: raw GCS → silver Parquet → BQ table → Vertex AI training dataset.

Key takeaway: Dataplex lake / zone / asset model. Auto-discovery + DQ tasks + lineage + Data Catalog integration. The unified governance layer.
⚡ Mini-quiz
Practise Dataplex scenarios → quick quiz (5 questions).
Lesson 7.3 Cloud DLP + de-identification

Sensitive-data discovery + de-identification is mandatory for regulated workloads. Cloud DLP scans + transforms PII inline or via Dataflow.

Key concepts
  • InfoType detectors: 150+ built-in detectors (US_SSN, EMAIL, PHONE, CREDIT_CARD, etc.) + custom regex / dictionary detectors.
  • Inspection: scan content for InfoTypes. Returns findings with confidence + offsets. Run on a BQ table / GCS bucket / Datastore kind.
  • De-identification transformations: mask (replace with *), redact (remove), hash, replace-with-info-type, format-preserving encryption (FPE), date shifting, generalisation, bucketing.
  • Crypto deterministic vs format-preserving: deterministic = same plaintext → same ciphertext (good for joins on de-identified columns). FPE preserves format (SSN-shape ciphertext, useful when downstream system expects format).
  • Re-identification: calculate re-identification risk (k-anonymity / l-diversity) of a de-identified dataset. Tells you if combinations of quasi-identifiers can re-identify individuals.
  • DLP in pipelines: integrate via Dataflow templates — every record passes through DLP de-identification on ingest. Or run async via storage job for batch.
Concrete example

Healthcare data ingestion: bronze landing zone GCS bucket → Dataflow pipeline with Cloud DLP de-identification. InfoTypes detected: US_SSN, PERSON_NAME, EMAIL_ADDRESS, DATE_OF_BIRTH. Transformations: SSN → format-preserving encrypted, name → hashed, email → masked (first 2 chars + domain), DOB → year only. De-identified output written to silver zone. Re-identification risk assessment confirms k-anonymity ≥ 5 across quasi-identifiers. Analysts query silver freely; HR re-identifies via FPE decryption key in Secret Manager when needed.

Key takeaway: DLP InfoType detectors + de-identification transformations. FPE for preserve-format. Re-identification risk for compliance proof. Pipeline-integrated for streaming PII protection.
⚡ Mini-quiz
Drill DLP scenarios → study mode (10 questions).
Test your knowledge as you study 60 scenario-based questions covering all 5 PDE domains. Instant explanations for every answer.
Take the Quiz Podcast

Key Concepts to Master

Concept 1

Bigtable vs BigQuery: The Exam Trap

The exam loves asking which storage system to use. The deciding factor: Bigtable for high-throughput, low-latency reads/writes of time-series or IoT data at millions of QPS. BigQuery for analytical queries over terabytes with SQL. If the scenario says "real-time sensor data at <10ms latency," it's Bigtable. If it says "analyze 3 years of sales data," it's BigQuery.

Concept 2

Dataflow Windowing: Which Window When?

Three window types tested heavily: Fixed (tumbling) — non-overlapping equal-size windows, e.g., aggregate sales per hour. Sliding — overlapping windows for moving averages, e.g., 1-hour window every 5 minutes. Session — gap-based, variable duration, ideal for user activity sessions. Late data is handled with allowedLateness and trigger strategies.

Concept 3

BigQuery ML: Choosing the Right Model Type

The PDE exam tests CREATE MODEL type selection: LINEAR_REG for numeric predictions (price forecasting). LOGISTIC_REG for binary/multi-class classification (churn prediction). KMEANS for unsupervised clustering (customer segmentation). BOOSTED_TREE_CLASSIFIER for high-accuracy tabular classification. ARIMA_PLUS for time-series forecasting with trend/seasonality decomposition.

6-Week Study Plan

Week 1
Foundations & BigQuery Core Complete Module 1 and Module 2. Hands-on: create partitioned and clustered tables, run EXPLAIN on large queries, experiment with BI Engine reservation. Take the first 20 practice questions.
Week 2
Dataflow & Apache Beam Complete Module 3. Build a streaming Dataflow pipeline with fixed and session windows. Practice watermark configuration and late-data triggers. Focus on Dataflow vs Dataproc decision scenarios.
Week 3
Pub/Sub, Dataproc & Data Fusion Complete Module 4. Set up a Pub/Sub topic with ordering keys and dead-letter queue. Spin up a Dataproc ephemeral cluster for a Spark job and configure autoscaling. Try a Data Fusion Wrangler transformation.
Week 4
Storage Systems: Bigtable & Spanner Complete Module 5. Design 3 different Bigtable row-key schemas for different access patterns and identify which would cause hot spots. Review Cloud Spanner interleaving documentation. Practice Cloud Storage lifecycle rule configuration.
Week 5
Vertex AI & Machine Learning Complete Module 6. Create a BigQuery ML model with CREATE MODEL and evaluate it with ML.EVALUATE. Explore Vertex AI Feature Store and understand the difference between online and offline serving. Study MLOps pipeline patterns.
Week 6
Governance, Operations & Full Practice Complete Module 7. Configure a Cloud Composer DAG with dependencies and retry logic. Set up a Dataplex lake and add a data quality task. Run Cloud DLP on a sample dataset. Take the full 60-question practice test and review all wrong answers.

Top 4 Mistakes on the PDE Exam

Confusing Bigtable row-key anti-patterns Sequential timestamps as the sole row key create a hot spot — all new writes go to the same tablet. The fix is field promotion (tenant+timestamp), reverse timestamp, or hash salting. The exam presents multiple row-key designs and asks which avoids hot spots.
Choosing Dataproc when Dataflow is correct If the question describes a new pipeline with no existing Spark/Hadoop code, choose Dataflow (Apache Beam). Dataproc is for lifting-and-shifting existing Spark/Hadoop workloads or when you need full Spark ecosystem access. "Unified batch and streaming" is a Dataflow keyword.
Misunderstanding BigQuery partition pruning Partition pruning only works when you filter on the partition column in the WHERE clause. If you wrap the partition column in a function (DATE(timestamp_col)), pruning is disabled. The exam tests this with cost optimization scenarios.
Mixing up Cloud DLP de-identification methods The exam distinguishes between: masking (replace with character, e.g., "***"), tokenization/pseudonymization (replace with surrogate, reversible), bucketing (generalize numbers into ranges), and date shifting (shift dates randomly while preserving duration). Each has different reversibility and utility tradeoffs.

GCP PDE vs GCP PCA — What's the Difference?

PDE — Data Engineer

  • BigQuery, Dataflow, Pub/Sub in depth
  • Bigtable schema and hot-spot design
  • Vertex AI, BigQuery ML, Feature Store
  • Cloud Composer Airflow DAGs
  • Dataplex governance & data quality
  • Cloud DLP de-identification
  • Datastream CDC pipelines
  • Focus: data layer and ML pipelines

PCA — Cloud Architect

  • Shared VPC, Cloud Spanner multi-region
  • Anthos multi-cloud & GKE architecture
  • VPC Service Controls, Binary Authorization
  • Cloud Deploy, Config Sync GitOps
  • SLO/SLI engineering, burn rate alerting
  • Chaos engineering & DR strategies
  • Workload Identity, CMEK, Cloud HSM
  • Focus: infrastructure and platform design
Start practicing →