Google Cloud — Professional Level

GCP Professional Data Engineer

Master the data engineering stack on Google Cloud: BigQuery analytics, Dataflow streaming pipelines, Pub/Sub event buses, Bigtable time-series storage, Vertex AI MLOps, and Cloud Composer orchestration. 60 scenario-based practice questions.

GCP PDE 7 Modules ~40 hours Advanced 60 practice questions Updated 2026
Start Practice Quiz Listen on Spotify
Field Details
Exam CodeProfessional Data Engineer (updated 2023)
Questions50–60 multiple-choice and multiple-select
Duration2 hours
Passing Score~70% (Google does not publish exact score)
Price$200 USD
RecertificationEvery 2 years
Recommended Experience3+ years industry experience, 1+ year on GCP
PrerequisitesNone official (GCP ACE or PCA recommended)

Exam Domain Weights

Domain 1 — Designing Data Processing Systems ~22%
Domain 2 — Ingesting and Processing Data ~25%
Domain 3 — Storing the Data ~20%
Domain 4 — Preparing and Using Data for Analysis ~15%
Domain 5 — Maintaining and Automating Workloads ~18%

Course Modules

Module 01
Data Engineering Foundations & Service Selection
Understand the GCP data engineering landscape and master the critical skill of selecting the right service for each workload. Learn when to use BigQuery vs Bigtable vs Spanner vs Cloud SQL, and how to design a modern data lake architecture on Cloud Storage.
service selection data lake Cloud Storage BigQuery Bigtable Spanner
Module 02
BigQuery: Analytics at Scale
Deep dive into BigQuery — the most heavily tested service on the PDE exam. Master partitioning (ingestion time vs column-based), clustering, slot reservations, BI Engine for sub-second queries, materialized views, column-level and row-level security, external tables, and BigQuery ML.
partitioning clustering slot reservations BI Engine materialized views column security BigQuery ML
Module 03
Dataflow & Apache Beam: Streaming and Batch Pipelines
Build and optimize data pipelines with Apache Beam on Dataflow. Understand windowing strategies (tumbling, sliding, session), watermarks, triggers for late data, side inputs for enrichment, stateful processing with DoFns, and the performance advantages of Dataflow Streaming Engine.
Apache Beam windowing watermarks side inputs stateful DoFns Streaming Engine Flex Templates
Module 04
Pub/Sub, Dataproc & Data Fusion: Messaging and ETL
Master event streaming with Pub/Sub (ordering keys, dead-letter topics, exactly-once delivery, Pub/Sub Lite), big data processing with Dataproc (ephemeral clusters, autoscaling, Serverless Spark, Metastore), and code-free ETL with Data Fusion Wrangler. Learn when each tool excels.
Pub/Sub ordering dead-letter exactly-once Dataproc ephemeral Serverless Spark Data Fusion
Module 05
Storage Systems: Bigtable, Spanner, and Data Lakes
Design optimal storage schemas for different data patterns. Master Bigtable row-key design to prevent hot spots (reverse timestamps, salting, field promotion), Cloud Spanner interleaved tables and commit timestamps, Cloud Storage lifecycle policies for cost optimization, and Bigtable replication for read scaling and DR.
Bigtable row key hot spots Spanner interleaving commit timestamps lifecycle policies replication
Module 06
Vertex AI & BigQuery ML: Machine Learning Pipelines
Build production ML systems on GCP. Learn BigQuery ML model types (LINEAR_REG, LOGISTIC_REG, KMEANS, BOOSTED_TREE_CLASSIFIER, ARIMA_PLUS), Vertex AI AutoML vs custom training, Feature Store for feature sharing and online serving, Vertex AI Pipelines for MLOps, and Model Registry for versioning and deployment.
BigQuery ML Vertex AI AutoML Feature Store Vertex AI Pipelines Model Registry MLOps
Module 07
Governance, Quality & Operations: Dataplex, Composer, DLP
Automate and govern your data platform. Learn Cloud Composer (managed Airflow) for DAG orchestration with XCom, pools, and secrets; Datastream for CDC replication from MySQL/PostgreSQL/Oracle to BigQuery; Dataplex for data governance (lakes, zones, data quality tasks); Cloud Data Catalog tag templates; and Cloud DLP for PII de-identification.
Cloud Composer Airflow DAGs Datastream CDC Dataplex Data Catalog Cloud DLP de-identification
Test your knowledge as you study 60 scenario-based questions covering all 5 PDE domains. Instant explanations for every answer.
Take the Quiz Podcast

Key Concepts to Master

Concept 1

Bigtable vs BigQuery: The Exam Trap

The exam loves asking which storage system to use. The deciding factor: Bigtable for high-throughput, low-latency reads/writes of time-series or IoT data at millions of QPS. BigQuery for analytical queries over terabytes with SQL. If the scenario says "real-time sensor data at <10ms latency," it's Bigtable. If it says "analyze 3 years of sales data," it's BigQuery.

Concept 2

Dataflow Windowing: Which Window When?

Three window types tested heavily: Fixed (tumbling) — non-overlapping equal-size windows, e.g., aggregate sales per hour. Sliding — overlapping windows for moving averages, e.g., 1-hour window every 5 minutes. Session — gap-based, variable duration, ideal for user activity sessions. Late data is handled with allowedLateness and trigger strategies.

Concept 3

BigQuery ML: Choosing the Right Model Type

The PDE exam tests CREATE MODEL type selection: LINEAR_REG for numeric predictions (price forecasting). LOGISTIC_REG for binary/multi-class classification (churn prediction). KMEANS for unsupervised clustering (customer segmentation). BOOSTED_TREE_CLASSIFIER for high-accuracy tabular classification. ARIMA_PLUS for time-series forecasting with trend/seasonality decomposition.

6-Week Study Plan

Week 1
Foundations & BigQuery Core Complete Module 1 and Module 2. Hands-on: create partitioned and clustered tables, run EXPLAIN on large queries, experiment with BI Engine reservation. Take the first 20 practice questions.
Week 2
Dataflow & Apache Beam Complete Module 3. Build a streaming Dataflow pipeline with fixed and session windows. Practice watermark configuration and late-data triggers. Focus on Dataflow vs Dataproc decision scenarios.
Week 3
Pub/Sub, Dataproc & Data Fusion Complete Module 4. Set up a Pub/Sub topic with ordering keys and dead-letter queue. Spin up a Dataproc ephemeral cluster for a Spark job and configure autoscaling. Try a Data Fusion Wrangler transformation.
Week 4
Storage Systems: Bigtable & Spanner Complete Module 5. Design 3 different Bigtable row-key schemas for different access patterns and identify which would cause hot spots. Review Cloud Spanner interleaving documentation. Practice Cloud Storage lifecycle rule configuration.
Week 5
Vertex AI & Machine Learning Complete Module 6. Create a BigQuery ML model with CREATE MODEL and evaluate it with ML.EVALUATE. Explore Vertex AI Feature Store and understand the difference between online and offline serving. Study MLOps pipeline patterns.
Week 6
Governance, Operations & Full Practice Complete Module 7. Configure a Cloud Composer DAG with dependencies and retry logic. Set up a Dataplex lake and add a data quality task. Run Cloud DLP on a sample dataset. Take the full 60-question practice test and review all wrong answers.

Top 4 Mistakes on the PDE Exam

Confusing Bigtable row-key anti-patterns Sequential timestamps as the sole row key create a hot spot — all new writes go to the same tablet. The fix is field promotion (tenant+timestamp), reverse timestamp, or hash salting. The exam presents multiple row-key designs and asks which avoids hot spots.
Choosing Dataproc when Dataflow is correct If the question describes a new pipeline with no existing Spark/Hadoop code, choose Dataflow (Apache Beam). Dataproc is for lifting-and-shifting existing Spark/Hadoop workloads or when you need full Spark ecosystem access. "Unified batch and streaming" is a Dataflow keyword.
Misunderstanding BigQuery partition pruning Partition pruning only works when you filter on the partition column in the WHERE clause. If you wrap the partition column in a function (DATE(timestamp_col)), pruning is disabled. The exam tests this with cost optimization scenarios.
Mixing up Cloud DLP de-identification methods The exam distinguishes between: masking (replace with character, e.g., "***"), tokenization/pseudonymization (replace with surrogate, reversible), bucketing (generalize numbers into ranges), and date shifting (shift dates randomly while preserving duration). Each has different reversibility and utility tradeoffs.

GCP PDE vs GCP PCA — What's the Difference?

PDE — Data Engineer

  • BigQuery, Dataflow, Pub/Sub in depth
  • Bigtable schema and hot-spot design
  • Vertex AI, BigQuery ML, Feature Store
  • Cloud Composer Airflow DAGs
  • Dataplex governance & data quality
  • Cloud DLP de-identification
  • Datastream CDC pipelines
  • Focus: data layer and ML pipelines

PCA — Cloud Architect

  • Shared VPC, Cloud Spanner multi-region
  • Anthos multi-cloud & GKE architecture
  • VPC Service Controls, Binary Authorization
  • Cloud Deploy, Config Sync GitOps
  • SLO/SLI engineering, burn rate alerting
  • Chaos engineering & DR strategies
  • Workload Identity, CMEK, Cloud HSM
  • Focus: infrastructure and platform design

Ready to Practice?

60 scenario-based questions covering all 5 PDE exam domains. Immediate feedback with detailed explanations. No signup, no paywall.

Start the Quiz — Free Listen on Spotify
GCP PDE exam tips on the CertQuests podcast →