Google Cloud · cloud

Google Cloud Professional Cloud Architect (PCA)

Master Google Cloud architecture for the PCA exam: Shared VPC, Cloud Spanner global consistency, Bigtable time-series design, Anthos multi-cloud, VPC Service Controls data perimeters, Binary Authorization supply chain security, Workload Identity, SLO/SLI engineering, multi-window burn rate alerting, Cloud Deploy with approval gates, and chaos engineering for resilience validation. Covers all 6 PCA exam domains.

7Modules
40 hoursDuration
advancedLevel
Exam fact Details
Exam codeProfessional Cloud Architect (GCP PCA)
Full nameGoogle Cloud Certified – Professional Cloud Architect
Questions50–60 (mix of MCQ and case study questions)
Passing score70%+ (Google uses scaled scoring, not publicly disclosed)
Duration120 minutes
Price$200 USD
Prerequisites3+ years of industry experience, 1+ year of GCP hands-on (GCP ACE recommended)
RenewalRecertify every 2 years via a 2-hour recertification exam

Exam domain weights

Domain 1 — Designing and Planning a Cloud Solution Architecture 24%
Domain 2 — Managing and Provisioning Solution Infrastructure 15%
Domain 3 — Designing for Security and Compliance 18%
Domain 4 — Analyzing and Optimizing Technical and Business Processes 18%
Domain 5 — Managing Implementation 11%
Domain 6 — Ensuring Solution and Operations Reliability 14%

Course modules

Module 13 lessons
Google Cloud Architecture Fundamentals

Build the mental model for GCP's global infrastructure before diving into services. Understand regions, zones, points-of-presence, and how Google's private backbone differs from the public internet. Learn the Resource Hierarchy (organization → folders → projects → resources) and how IAM policies are inherited. Master the shared responsibility model for GCP and the Well-Architected Framework's six pillars: operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. Grasp network fundamentals: how GCP VPC is global (not regional like AWS), how default routes and VPC flows differ from traditional networks.

Resource Hierarchy (Org → Folder → Project) IAM policy inheritance and evaluation GCP global vs regional services VPC as a global construct Shared VPC vs VPC Peering trade-offs Google's private backbone (Premium vs Standard tier) Compute options: Compute Engine, GKE, Cloud Run, App Engine
📖 Read in-depth chapter
Lesson 1.1 Resource Hierarchy and IAM inheritance

GCP's resource hierarchy is the single biggest mental shift coming from AWS. Organization → Folder → Project → Resource — and IAM policies flow downward. Get the topology right on day one or you'll fight inheritance bugs forever.

Key concepts
  • Organization: top-level container tied to a Cloud Identity / Google Workspace domain. One per company. Owns all folders / projects / billing.
  • Folder: grouping for projects + sub-folders (up to 10 levels deep). Used for OUs / business units. Inherit IAM + Org Policies from parent.
  • Project: the resource container — every GCP resource lives in exactly one project. Has a unique project ID (immutable) and project number (immutable). Billing scoped per project.
  • IAM inheritance: additive. A role granted at the org level applies to every project beneath. Most-specific does NOT override — you can't deny inheritance below.
  • Deny Policies: separate from IAM Allow policies. Deny rules take precedence. Use sparingly — primary access control is via not-granting at the lower tier.
  • Org Policies: constraints enforced top-down (e.g., compute.vmExternalIpAccess: deny). Different from IAM. Override at child levels only if the parent allows.
Concrete example

A 200-employee SaaS organises GCP as: Org → folders prod + non-prod + sandbox → ~80 projects under each. IAM: org-level Viewer for the security team (cascades to everything). Prod folder: only specific roles. Sandbox: free-for-all developer + Viewer at the folder so devs can audit each others' work. Org Policy compute.vmExternalIpAccess: deny at org level — no project can grant a VM a public IP without explicit folder-level exception.

Key takeaway: Org → Folder → Project. IAM additive, top-down. Org Policies are separate and prescriptive. Pick the topology before granting anything.
⚡ Mini-quiz
Drill resource-hierarchy + IAM scenarios → study mode (10 questions).
Lesson 1.2 VPC as a global construct + network tiers

GCP VPC is GLOBAL — a single VPC spans every region. Subnets are regional. This is the opposite of AWS where VPCs are regional and subnets are zonal. The PCA exam tests this distinction relentlessly.

Key concepts
  • Global VPC: one VPC, many regions. Subnets are per-region inside the same VPC. Routes propagate across regions automatically.
  • Auto-mode vs custom-mode VPC: auto-mode auto-creates a /20 subnet in every region — convenient for demos, never for production. Custom-mode lets you carve CIDRs explicitly; production-only choice.
  • Shared VPC vs VPC Peering: Shared VPC = host project shares its VPC with multiple service projects — central network team owns; workload teams attach. VPC Peering = connect two VPCs (within or across orgs). Non-transitive like AWS.
  • Premium vs Standard network tier: Premium routes traffic over Google's global backbone, exits at the nearest edge to the destination. Standard hands traffic to the public internet at the exit region — cheaper but slower / less reliable.
  • Default firewall behaviour: implicit allow egress, implicit deny ingress. Explicit allow rules required for inbound. Tag-based + service-account-based firewall rules.
  • VPC Flow Logs: capture flow records per subnet. Filter expressions to control cost. Visible in Logs Explorer + exportable to BigQuery.
Concrete example

A global SaaS deploys to 5 regions. Single Shared VPC in the network host project, custom-mode. Five subnets — one per region, non-overlapping CIDRs. Workload projects attach as service projects of the Shared VPC. Premium network tier for the customer-facing Global LB (exit at user's nearest edge, traverse Google backbone to nearest GCE backend). Internal admin traffic via Standard tier for cost.

Key takeaway: VPC is global, subnets are regional. Shared VPC for central network team owning + workload teams attaching. Premium tier for customer-facing; Standard for cost-tolerant internal.
⚡ Mini-quiz
Practise VPC + network-tier scenarios → quick quiz (5 questions).
Lesson 1.3 Compute family selection — GCE, GKE, Cloud Run, App Engine

Four compute primitives at different abstraction levels. PCA expects you to climb the ladder — pick the most-abstracted option that fits the workload.

Key concepts
  • Compute Engine (GCE): raw VMs. Full OS control. Use when you need a specific kernel, GPUs, custom drivers, or licence-tied software.
  • Managed Instance Groups (MIG): AWS's equivalent of Auto Scaling Groups. Regional MIGs distribute across 3 zones. Autoscaler responds to CPU / LB capacity / custom metrics.
  • GKE Standard: managed Kubernetes — Google runs the control plane, you manage the worker nodes (or use Autopilot for fully-managed). Use when you need K8s primitives + helm + ecosystem tools.
  • GKE Autopilot: Google manages BOTH control plane AND nodes. Billed per pod request (CPU/memory). Right when you want K8s API without node operations.
  • Cloud Run: serverless containers. Scale 0-N on demand, billed per request + per CPU-second. HTTP / event-driven. Right for stateless web services, background workers via Pub/Sub.
  • Cloud Functions Gen 2: now built on Cloud Run under the hood. Trigger-driven (HTTP, Pub/Sub, Cloud Storage, EventArc). Smaller unit than Cloud Run service.
  • App Engine Standard: the legacy PaaS. Auto-scale, no VMs. Avoid for new builds (Cloud Run is the modern equivalent); maintain only if you're already on it.
Concrete example

A startup: marketing site = Cloud Run (stateless container, scale-to-zero for cost). Image-processing job = Cloud Functions Gen 2 (Cloud Storage trigger). Main app on Kubernetes = GKE Autopilot (no node-ops burden). Legacy Windows app needing specific build tools = Compute Engine VM with managed image lifecycle.

Key takeaway: Cloud Run for stateless containers. Cloud Functions Gen 2 for event-driven snippets. GKE Autopilot when K8s API matters. GCE only when you need OS access. App Engine is legacy.
⚡ Mini-quiz
Drill compute-family scenarios → study mode (10 questions).
Module 23 lessons
Designing Cloud Solutions — Compute, Storage, and Networking (Domain 1)

Domain 1 carries the highest exam weight (24%) and focuses on selecting the right service for complex architectural scenarios. Learn to choose between Cloud Spanner (global ACID transactions), Bigtable (high-throughput time-series), Firestore (mobile real-time sync), and Cloud SQL (relational, single-region). Design Pub/Sub fan-out patterns for decoupled event-driven architectures. Configure Global HTTPS Load Balancers with anycast TLS termination at Google's edge. Design ephemeral Dataproc clusters for batch ML workloads to minimize cost. Understand when Anthos multi-cloud is the answer for hybrid/multi-cloud Kubernetes management.

Cloud Spanner: multi-region, global consistency, write scaling Bigtable row key design (reverse timestamp#device_id) Pub/Sub fan-out: one topic, multiple subscriptions Global HTTPS LB: anycast, edge TLS termination Dataproc ephemeral clusters via workflow templates Anthos multi-cloud: unified control plane across GCP/AWS/on-prem BigQuery partition + cluster optimization Firestore Native mode: real-time listeners, offline SDK Cloud Storage lifecycle tiering (Standard → Nearline → Archive)
📖 Read in-depth chapter
Lesson 2.1 Picking the right GCP data store

GCP has more purpose-built databases than any other cloud. PCA tests the decision matrix obsessively — Spanner vs Bigtable vs Firestore vs Cloud SQL vs BigQuery, given access pattern + scale + consistency.

Key concepts
  • Cloud Spanner: globally-distributed relational with external strong consistency (TrueTime). Massive write scale, multi-region active-active. Expensive — only when you need global ACID writes.
  • Bigtable: wide-column NoSQL for high-throughput time-series / IoT / ad-tech. Single-digit-ms reads at million-QPS. Schema-less. Row key design is everything — bad keys = hot tablets.
  • Firestore: document NoSQL. Native mode (modern, real-time listeners, offline SDK, multi-region) or Datastore mode (legacy). Use for mobile / web app real-time sync.
  • Cloud SQL: managed MySQL / PostgreSQL / SQL Server. Single-region with HA via standby in another zone. Cross-region read replicas; not active-active. The lift-and-shift answer.
  • AlloyDB: Google's PostgreSQL-compatible high-performance DB — beats Cloud SQL for HTAP workloads, ~4× analytics speed. Newer; appears in PCA scenarios with "transactional + analytical hybrid" framing.
  • BigQuery: serverless analytics warehouse. Pay per query (on-demand) or reserved slots. NOT a transactional store; design dimension/fact tables, partitioning, clustering.
  • Cloud Storage: object storage. Classes: Standard, Nearline (≥30d), Coldline (≥90d), Archive (≥365d). Lifecycle policies move data; per-bucket multi-region option.
Concrete example

A FinTech: global trade ledger requiring strong consistency + multi-region writes → Cloud Spanner. Per-customer time-series transaction logs at millions of events/sec → Bigtable with row key customerId#reverseTimestamp. Mobile app real-time notifications → Firestore Native. Historical analytics → BigQuery with daily partitions clustered on customerId. Document attachments → Cloud Storage with lifecycle Nearline → Coldline → Archive at 30/90/365 days.

Key takeaway: Spanner = global ACID. Bigtable = time-series throughput. Firestore = mobile real-time. Cloud SQL = lift-and-shift. AlloyDB = HTAP. BigQuery = analytics. Cloud Storage = objects.
⚡ Mini-quiz
Drill data-store-selection scenarios → study mode (10 questions).
Lesson 2.2 Pub/Sub + Eventarc — event-driven architecture

Pub/Sub is GCP's at-scale message bus. Eventarc bridges GCP services into Pub/Sub-style routing. PCA tests fan-out patterns and ordering guarantees.

Key concepts
  • Pub/Sub topics + subscriptions: producer publishes to topic, each subscription independently consumes. Multiple subscriptions on one topic = fan-out. Default at-least-once delivery.
  • Pull vs Push: Pull = consumer requests messages (default, more control over flow). Push = Pub/Sub HTTP-POSTs to your endpoint (Cloud Run / Cloud Functions sweet spot). Push has stricter latency requirements.
  • Ordering keys: messages with the same ordering key are delivered in order within a subscription. Lower throughput than no-order (per-key serialised). Use for sequential-by-user / sequential-by-device flows.
  • Dead Letter Topics: failed messages route to a DLT after configurable redelivery attempts. Mandatory for production — without it, poison messages spin forever.
  • Message retention: default 7 days. Configurable up to 31 days. Determines replay window.
  • Eventarc: routes GCP service events (Cloud Audit Log entries, Cloud Storage events, Firestore changes) to Cloud Run / Cloud Functions / Workflows. Built on top of Pub/Sub. Replaces hand-rolled trigger pipelines.
  • Pub/Sub Lite: regional, single-zone, cheaper alternative to Pub/Sub. Lower SLA. Use only for non-critical or cost-sensitive workloads — most PCA answers prefer full Pub/Sub.
Concrete example

An e-commerce platform's order-placed event needs to fan out to (1) billing, (2) inventory, (3) notifications. Design: Pub/Sub topic orders.placed. Three subscriptions, one per service, each pull-based with Cloud Run services consuming. Ordering keys = customerId so a single customer's events stay sequential. DLT after 5 failed deliveries. For real-time customer-status updates, Eventarc routes Firestore document changes to a Cloud Run notification service — no Pub/Sub topic to manage.

Key takeaway: Pub/Sub for at-scale fan-out. Ordering keys for per-key sequential semantics. DLT always. Eventarc for service-event-driven workflows without managing topics.
⚡ Mini-quiz
Practise Pub/Sub + Eventarc scenarios → quick quiz (5 questions).
Lesson 2.3 Global Load Balancing + CDN

Google's network is its biggest moat. The Global HTTP(S) LB is the canonical PCA answer for customer-facing apps. Understand the edge anycast + premium tier interaction.

Key concepts
  • Global HTTP(S) LB: one anycast IP, terminated at Google's nearest edge. TLS terminated at edge; backend traffic over Google's private backbone. The default customer-facing choice.
  • Backend services: serve HTTP from regional MIGs, GKE, Cloud Run, Cloud Storage buckets (for static), Internet NEG (proxying to external endpoints). Health-checked + load-balanced.
  • Cloud CDN: attach to backend service for edge caching. Cacheable signals via Cache-Control headers. Signed URLs for paywall content.
  • URL maps + path rules: route by host header / path prefix to different backend services. Path-based microservice routing without per-service public IPs.
  • SSL policies: set minimum TLS version + cipher suite per LB. Mandatory for PCI / FedRAMP compliance.
  • Regional LBs: regional external HTTPS LB (Standard tier, cheaper). Internal HTTPS LB for east-west traffic inside the VPC. Use when global is overkill.
  • Cloud Armor: WAF + DDoS protection in front of Global LBs. OWASP Core Rule Set + custom rate-limit rules.
Concrete example

A SaaS API: Global HTTP(S) LB with anycast IP, edge TLS termination via Google-managed cert. URL map: /api/* → backend MIG in 3 regions; /static/* → Cloud Storage bucket with Cloud CDN enabled, 1-day TTL. Backend services have HTTP health checks. Cloud Armor with OWASP CRS + custom rate-limit (1000 req/min per IP). SSL policy enforces TLS 1.2 minimum.

Key takeaway: Global HTTPS LB for customer-facing with anycast + edge TLS. URL maps for path routing. Cloud CDN for cacheable. Cloud Armor for WAF + DDoS.
⚡ Mini-quiz
Drill Global LB + CDN scenarios → study mode (10 questions).
Module 33 lessons
Infrastructure Provisioning and Management (Domain 2)

Domain 2 covers the provisioning and management of GCP infrastructure. Master Terraform with GCS backend for state locking and version history — the standard IaC approach on GCP. Configure GKE Autopilot (fully managed node infrastructure, billed per pod request) versus GKE Standard with Cluster Autoscaler (GPU node pools with minimum=0 for zero-cost idle). Design MIG autoscaling with warmup periods so new instances are ready before traffic hits them. Implement GKE regional clusters with pod anti-affinity for zone-resilient deployments. Configure Cloud Monitoring SLO burn rate alerts for proactive error budget management.

Terraform GCS backend: object locking + versioning for state GKE Autopilot: fully managed nodes, pod-request billing GPU node pool with min=0 + taints/tolerations MIG autoscaling: warmup period + backend service signal Cloud Deployment Manager: GCP-native YAML/Jinja2 IaC GKE regional clusters: node distribution across 3 zones Spot VMs with checkpointing for long ML jobs Cloud Monitoring SLO burn rate alerting
📖 Read in-depth chapter
Lesson 3.1 Terraform + GCS backend — the GCP IaC standard

GCP's preferred IaC is Terraform with GCS-backed state. PCA expects you to know the GCS-backend mechanics + module structure + state-locking patterns.

Key concepts
  • GCS backend: Terraform state file stored in a Cloud Storage bucket with versioning enabled. Object locking via the GCS object generation provides built-in state locking — no separate DynamoDB-style table needed (unlike AWS Terraform).
  • Bucket versioning: mandatory for state buckets. Lets you recover from accidental state corruption / deletion.
  • Workspace pattern: separate state file per environment (dev / staging / prod) via different state prefix in the same bucket. Terraform workspace command supports this.
  • Module structure: typical GCP Terraform repo: root module per environment, child modules per component (network / IAM / GKE / Cloud SQL). Modules version-pinned via Git tags.
  • Service Account impersonation: avoid long-lived service account keys. Terraform uses your gcloud identity to impersonate a Terraform service account via gcloud auth application-default login + --impersonate-service-account.
  • Cloud Deployment Manager (DM): GCP's native IaC alternative. YAML/Jinja2 templates. Less popular than Terraform but appears on PCA — know it exists for the "which tool" question.
Concrete example

A team standardises on Terraform: GCS state bucket per environment with versioning ON and Object Lifecycle policy "non-current version → archive at 30d". Terraform code in terraform/ with modules network, iam, gke. CI runs Terraform via a service account impersonated from the CI's Workload Identity Federation token. State locks via GCS automatically — multiple CI runs serialize on the state object.

Key takeaway: Terraform + GCS backend = canonical GCP IaC. Bucket versioning mandatory. Service account impersonation over keys. DM exists but is legacy.
⚡ Mini-quiz
Drill Terraform + GCS scenarios → study mode (10 questions).
Lesson 3.2 GKE Autopilot vs Standard — operational model

Two GKE flavours. PCA tests the operational trade-offs and which fits a given scenario.

Key concepts
  • GKE Standard: you manage worker nodes (node pools, sizing, OS upgrades, taints). Full control + full burden. Use when you need bespoke node configs, GPU pools with custom drivers, or single-tenant node isolation.
  • GKE Autopilot: Google manages everything — control plane + workers. Billed per pod resource request (vCPU + memory + ephemeral storage). No node-level access. Pods scheduled on Google-managed infrastructure.
  • When Autopilot fits: standard stateless workloads, when ops team is small, when you want least-burden K8s. PCA's default answer unless the scenario implies node-level control.
  • When Standard fits: GPU/TPU pools you scale to zero (Autopilot has GPU support but more limited), custom node images, specific kernel versions, single-tenant compliance requirements, very high pod density.
  • Cluster Autoscaler (Standard only): scales node pools based on pending pods. Combine with HPA (pod-level) and VPA (right-sizing).
  • Regional vs zonal cluster: regional clusters spread nodes (and control plane replicas) across 3 zones — 99.95% SLA. Zonal cluster is cheaper but single-zone failure kills it. Production always regional.
Concrete example

A team runs 80% standard stateless services + 20% GPU ML training. Design: two clusters — GKE Autopilot for the standard services (lower ops burden); GKE Standard with a GPU node pool (min=0, autoscaler scales up under load) for ML. Both regional in 3 zones. The cost-justified split — Autopilot covers the bulk; Standard handles the niche.

Key takeaway: Autopilot is the default. Standard for GPU pools / custom nodes / single-tenant compliance. Regional cluster always for production.
⚡ Mini-quiz
Practise GKE Autopilot vs Standard → quick quiz (5 questions).
Lesson 3.3 MIG autoscaling + Cloud Monitoring SLOs

For VM-based workloads, Managed Instance Groups + Cloud Monitoring SLOs give you the production-grade scaling + alerting pattern.

Key concepts
  • MIG autoscaler signals: CPU utilisation, HTTP LB serving capacity, Pub/Sub queue depth, custom Cloud Monitoring metrics. Combine signals for richer scaling.
  • Warmup period: grace window after instance creation before counted in autoscaler decision. Lets the app finish initialising before traffic hits.
  • Predictive autoscaling: ML-driven based on historical patterns. Pre-scales before predicted demand — useful for daily/weekly cycles.
  • Instance template + rolling updates: immutable template versions. gcloud compute instance-groups managed rolling-action rolls instances to a new template version with configurable surge + max-unavailable.
  • Cloud Monitoring SLOs: declarative SLO objects with SLI definition + window + target. Cloud Monitoring computes the rolling error budget automatically.
  • Burn rate alerts: multi-window burn rate is the Google SRE Workbook pattern — alert at 1h@14.4x burn (fast) + 6h@2x burn (slow). Catches both fast-burning and slow-degrading SLOs.
Concrete example

A web tier: regional MIG with min=4 / max=20 instances. Autoscaler signal = HTTP LB serving capacity. Warmup period 90s for app boot. Predictive scaling ON (daily traffic pattern). Cloud Monitoring SLO: 99.9% of requests respond in < 500ms over 28 days. Multi-window burn rate alert: 1h>14.4x AND 6h>2x both true → page on-call. SLO dashboard surfaces error budget for the eng manager weekly.

Key takeaway: MIG autoscaler + warmup + predictive for production VMs. Cloud Monitoring SLOs + multi-window burn rate for SRE-style alerting. Page on burn, not on individual errors.
⚡ Mini-quiz
Drill MIG + SLO scenarios → study mode (10 questions).
Module 43 lessons
Security, Compliance, and Zero Trust (Domain 3)

Domain 3 (18%) focuses on Google's zero-trust security model. VPC Service Controls create a security perimeter around GCP services — even valid credentials from outside the perimeter cannot exfiltrate data. Workload Identity eliminates service account key files by binding Kubernetes service accounts to GCP service accounts. Binary Authorization ensures that only attested, scanned container images can be deployed to GKE production. Cloud Armor with OWASP WAF rules protects against L7 attacks. Organization Policy constraints (vmExternalIpAccess, compute.restrictCloudSQLInstances) enforce security at the org level. Cloud IAP enables zero-trust access to internal web applications without VPN.

VPC Service Controls: perimeter, ingress/egress policies Workload Identity: KSA annotation → GCP SA binding Binary Authorization: attestors, KMS-signed attestations Cloud Armor: OWASP WAF rules + adaptive DDoS protection CMEK with Cloud HSM (FIPS 140-2 Level 3) Organization Policy: vmExternalIpAccess deny-all Cloud IAP: zero-trust access without VPN Secret Manager: centralized secrets + automatic rotation Access Transparency + Access Approval Security Command Center Premium: SCC findings dashboard
📖 Read in-depth chapter
Lesson 4.1 VPC Service Controls + Org Policy

VPC Service Controls are GCP's unique answer to data exfiltration — a perimeter that even valid credentials can't escape. Combined with Org Policy constraints, you get preventive controls at scale.

Key concepts
  • VPC Service Controls (VPC-SC): defines a security perimeter around GCP services (BigQuery, GCS, etc.). Calls to those services from inside the perimeter succeed; from outside fail — even with valid IAM. Prevents exfil-via-leaked-credentials.
  • Ingress / Egress policies: explicit rules permitting cross-perimeter calls. E.g., "allow service X from project Y to read BQ tables in this perimeter". Without explicit rules, the perimeter is hermetic.
  • Access Levels: conditions (IP / device / identity) that grant access to the perimeter. Layered on top of IAM.
  • Org Policy: SCP-equivalent — declarative constraints enforced top-down. Common: compute.vmExternalIpAccess: deny, iam.disableServiceAccountKeyCreation: enforced, compute.restrictCloudSQLInstances regions.
  • Boolean vs list constraints: boolean = on/off, list = allow/deny list of values. Override at child scope only if parent's enforcement allows.
  • Dry-run perimeter: evaluate VPC-SC violations without enforcing. Use during rollout to find legitimate cross-perimeter calls before flipping to enforce.
Concrete example

A regulated workload's GCS bucket holds PHI. Wrap in VPC Service Controls perimeter with the workload project. Ingress policy allowing only the workload project's service account + only from corporate IPs (Access Level). Egress policy denying all outbound except to approved analytics project. Org Policy at folder level: storage.publicAccessPrevention: enforced, iam.disableServiceAccountKeyCreation: enforced. Even if PHI service account credentials leak, they can't read from outside the perimeter — VPC-SC blocks.

Key takeaway: VPC-SC = perimeter that valid creds can't escape. Org Policy = top-down preventive constraints. Pair both for defence in depth. Dry-run before enforce.
⚡ Mini-quiz
Drill VPC-SC + Org Policy scenarios → study mode (10 questions).
Lesson 4.2 Workload Identity + Cloud IAP

Two zero-trust primitives. Workload Identity eliminates service-account keys for workloads; Cloud IAP eliminates VPN for users.

Key concepts
  • Workload Identity (GKE): binds Kubernetes service accounts (KSAs) to GCP service accounts (GSAs) via OIDC federation. Pods authenticate as GSAs without storing keys.
  • Workload Identity Federation (non-GCP): federate external workloads (GitHub Actions, AWS, Azure) to GCP without service-account keys. The external token gets exchanged for a short-lived GCP token.
  • Service-account-key bans: Org Policy iam.disableServiceAccountKeyCreation blocks new keys. Forces teams to migrate to Workload Identity.
  • Cloud IAP: identity-aware proxy in front of HTTP(S) backends. Authenticates the user (Google account / Workforce Identity Federation) + authorises via IAM "IAP-secured Web App User" role. No VPN needed.
  • IAP for TCP forwarding: SSH into private VMs through IAP — no public IP, no bastion host. Uses gcloud's --tunnel-through-iap flag.
  • IAP for on-prem: BeyondCorp-style remote access pattern. Replaces traditional corporate VPN for internal app access.
Concrete example

A GKE workload accesses GCS + BigQuery. Old way: service-account key file mounted as a secret. New way: enable Workload Identity on the cluster. Bind the KSA app-sa to GSA app@project.iam.gserviceaccount.com. Pods automatically authenticate as the GSA — no keys. Internal admin dashboard previously accessed via VPN — now behind Cloud IAP on the Global LB; users sign in with their Google identity, IAP validates + injects identity header. SSH into private GCE VMs via IAP TCP forwardinggcloud compute ssh --tunnel-through-iap.

Key takeaway: Workload Identity = zero service-account keys for GKE. Workload Identity Federation = same for non-GCP workloads. Cloud IAP = zero VPN for users. Combine with Org Policy to ban key creation.
⚡ Mini-quiz
Practise Workload Identity + IAP scenarios → quick quiz (5 questions).
Lesson 4.3 Binary Authorization + Secret Manager + KMS

Supply chain security (Binary Authorization), secrets (Secret Manager), and encryption keys (Cloud KMS / Cloud HSM) — the cryptographic foundations of GCP security.

Key concepts
  • Binary Authorization: GKE admission control that enforces image attestation policies — only images that pass scans + are signed by an attestor get deployed. Closes the supply-chain attack vector.
  • Attestors: identities (KMS-backed signing keys) that vouch for images. Common pipeline: image built → vulnerability scan → if clean, attestor signs → Binary Auth permits deploy.
  • Cloud KMS: managed key management. Software-protected (default) or HSM-protected keys. Symmetric / asymmetric / MAC. Per-region keys; multi-region for global services.
  • CMEK (Customer-Managed Encryption Keys): use your KMS key to encrypt service data (GCS, BigQuery, GCE disks, Cloud SQL, etc.). Required for compliance regimes mandating key custody.
  • Cloud HSM: FIPS 140-2 Level 3 HSM. Single-tenant. Use when compliance requires Level 3.
  • Cloud External Key Manager (Cloud EKM): use keys hosted in a third-party HSM (Equinix SmartKey, Fortanix). For organisations that require keys OUTSIDE Google entirely.
  • Secret Manager: centralised secrets with versioning + IAM-based access. Automatic rotation via Cloud Functions trigger. Replaces secret-in-config-files.
Concrete example

A regulated GKE deployment: Cloud Build pipeline scans image with Artifact Registry vulnerability scanning. If criticals = 0, attestor signs via KMS key. Binary Authorization policy on GKE prod cluster requires the attestation — block any other deploy. Secrets pulled at startup from Secret Manager (DB password rotates monthly via Cloud Function). All cluster persistent volumes encrypted with CMEK backed by Cloud HSM for FIPS 140-2 Level 3 compliance.

Key takeaway: Binary Auth = supply-chain enforcement at admission. Secret Manager + automatic rotation = no secrets in config. CMEK + Cloud HSM = compliance-grade key custody. Cloud EKM when keys must live outside GCP.
⚡ Mini-quiz
Drill Binary Auth + Secret Manager + KMS → study mode (10 questions).
Module 53 lessons
Performance Optimization and Cost Engineering (Domain 4)

Domain 4 (18%) covers optimizing both performance and costs across the GCP portfolio. BigQuery Editions introduce reservation-based pricing for predictable ETL workloads — combine with on-demand pricing for ad-hoc analytics. BigQuery BI Engine accelerates Looker dashboards to sub-second response via in-memory caching. VPA (Vertical Pod Autoscaler) automatically right-sizes pod CPU/memory requests to eliminate over-provisioning waste. Cloud Profiler continuously profiles production services at under 1% overhead. Cloud Trace provides distributed tracing waterfall views for diagnosing multi-service latency. Committed Use Discounts (CUDs) for stable VMs, Spot VMs for seasonal burst workloads.

BigQuery Editions: Standard/Enterprise reservations vs on-demand BigQuery BI Engine: in-memory acceleration for Looker VPA Auto mode: right-sizing CPU/memory requests Committed Use Discounts: 1-year (37%) vs 3-year (57%) Spot VMs for seasonal/interruptible workloads Cloud Profiler: flame graphs at <1% overhead Cloud Trace: distributed tracing waterfall view Eventarc: unified event bus for GCP services Looker vs Looker Studio: semantic layer vs per-dashboard metrics GKE NetworkPolicy: PCI DSS pod segmentation
📖 Read in-depth chapter
Lesson 5.1 Compute cost levers — CUDs, Spot, sustained-use

GCP's compute pricing has three main discount mechanisms. PCA expects you to know which fits which workload pattern.

Key concepts
  • Sustained Use Discounts (SUD): automatic discount up to 30% if a VM runs continuously through the month. No commitment. Applies retroactively at month-end.
  • Committed Use Discounts (CUD): 1-year (~37%) or 3-year (~57%) commitments. Resource-based CUDs (specific machine families) or spend-based CUDs (flexible — apply to any compute).
  • Spot VMs: up to 60-91% off on-demand. Can be preempted at any time with 30s notice. For fault-tolerant batch / dev-test / stateless workloads.
  • Preemptible VMs: legacy 24-hour-max version of Spot. Spot replaces it for new workloads.
  • GKE Spot pools: mix Spot + on-demand node pools. Pod taints/tolerations + node affinity steer workloads to right pool. Cost-optimised stateless workloads → Spot; stateful → on-demand.
  • Custom machine types: spec exactly the vCPU + memory you need vs paying for a fixed shape. Particularly useful for memory-heavy workloads where standard shapes over-provision CPU.
Concrete example

A 24/7 production tier (steady baseline) + nightly batch jobs (interruptible) + ML training (GPU, interruptible): cover the baseline with a 3-year Spend-Based CUD at 57% off. Nightly batch and ML on Spot VMs + checkpointing. GKE cluster has a Spot node pool for stateless workloads tagged with toleration. Total cost ~50% below on-demand.

Key takeaway: Spend-Based CUD for predictable spend. Spot for fault-tolerant. Custom machine types for unusual CPU:memory ratios. SUD is automatic — no action needed.
⚡ Mini-quiz
Drill GCP compute pricing scenarios → study mode (10 questions).
Lesson 5.2 BigQuery cost + performance optimization

BigQuery is the most exam-tested GCP service for cost optimisation. PCA tests Editions vs on-demand, partition + cluster design, and BI Engine for dashboards.

Key concepts
  • On-demand pricing: $6.25/TB scanned. Pay per query. Variable cost — runaway query can be expensive.
  • BigQuery Editions: Standard, Enterprise, Enterprise Plus. Reserve "slots" (parallel-execution units). Predictable cost; right when you know your steady ETL load. Mix-and-match with on-demand for ad-hoc.
  • Partitioning: partition tables by ingest time, date column, or integer range. Queries that filter on the partition column scan only relevant partitions — massive cost cut.
  • Clustering: sort table data by 1-4 columns. Queries filtering on cluster columns prune blocks. Stacks with partitioning.
  • Materialized views: precomputed query results stored automatically. Refreshes incrementally. Use for common aggregations on huge tables.
  • BI Engine: in-memory acceleration for BigQuery queries. Looker / Looker Studio dashboards become sub-second. Reservation-based pricing.
  • Query best practices: avoid SELECT *. Filter on partition columns. Use approximate aggregations (APPROX_COUNT_DISTINCT) where exact is unnecessary.
Concrete example

A retailer's 50TB orders table: partition by order_date (daily partitions) + cluster on customer_id, store_id. Daily ETL runs predictable; cover with BigQuery Enterprise edition reserving 100 slots. Ad-hoc analyst queries stay on-demand. Looker dashboards on the orders table use BI Engine reservation for sub-second response. Most-common aggregation query has a materialized view — analysts query the view, BQ uses fresh incremental data.

Key takeaway: partition + cluster + materialized views = best query cost. Editions reservations for predictable load + on-demand for spikes. BI Engine for dashboards.
⚡ Mini-quiz
Practise BigQuery optimization scenarios → quick quiz (5 questions).
Lesson 5.3 Right-sizing + observability — VPA, Profiler, Trace

Right-sizing kills waste at source. Cloud Profiler + Cloud Trace tell you WHERE the spend is going. Together they close the performance / cost loop.

Key concepts
  • VPA (Vertical Pod Autoscaler): automatically right-sizes pod CPU + memory REQUESTS based on observed usage. Recommendation mode = suggest; Auto mode = applied automatically on pod restart.
  • VPA + HPA conflict: can't apply both to the same CPU metric. Workaround: HPA on a different metric (RPS) + VPA on CPU/memory. Or VPA in Recommendation mode + HPA on CPU.
  • Cloud Recommender (CCR): Google-managed recommendation engine across all services. "These GCE VMs are over-provisioned by 40%". "This Cloud SQL instance is right-sized". Quarterly review.
  • Cloud Profiler: continuous statistical profiling of production services. Flame graphs at < 1% overhead. Surfaces the actual hot code paths.
  • Cloud Trace: distributed tracing across microservices. Spans + traces + waterfall view. Diagnose "this API endpoint p99 = 2s" by showing the actual slow span.
  • Cloud Logging: centralised logs. Sinks export to BigQuery / GCS / Pub/Sub for long-term retention or downstream processing. Use exclusion filters to control ingest cost.
  • Cloud Monitoring: metrics + dashboards + alerts. Workspaces aggregate across projects. Custom metrics via OpenTelemetry SDK.
Concrete example

A microservices app: deploy Cloud Profiler agent in each Cloud Run service — flame graphs reveal that a JSON-serialisation library accounts for 60% of CPU. Replace with a faster lib; CPU drops 50%. Cloud Trace reveals one service waits 800ms on a Firestore query; add an index; latency drops to 50ms. VPA in Recommendation mode on the GKE cluster suggests pod CPU requests be cut by 35% on average. Apply via deploy — cluster scales down node count by 30% naturally.

Key takeaway: VPA for right-sizing pods. Cloud Profiler for hot code paths. Cloud Trace for latency diagnosis. Cloud Recommender for ongoing waste audit. Observe before optimising.
⚡ Mini-quiz
Drill VPA + observability scenarios → study mode (10 questions).
Module 63 lessons
CI/CD, GitOps, and Implementation Management (Domain 5)

Domain 5 (11%) covers the tools for delivering software reliably. Cloud Build is GCP's native CI service — cloudbuild.yaml defines sequential steps for build, test, vulnerability scan, and artifact signing. Cloud Deploy provides a managed CD pipeline with promotion gates and requireApproval for production deployments, ensuring the same release artifact flows through all stages. Anthos Config Management's Config Sync watches a Git repository and continuously reconciles configuration across all registered GKE clusters. Binary Authorization enforcement at GKE admission ensures only internally built, attested images can be deployed — closing the supply chain security loop.

Cloud Build: cloudbuild.yaml, triggers, steps Cloud Deploy: delivery pipeline, targets, requireApproval Anthos Config Management: Config Sync GitOps across clusters Artifact Registry: container registry + vulnerability scanning kubectl rollout undo: fastest GKE rollback path Cloud Run: 32GB RAM, 60-min timeout, Pub/Sub trigger Binary Authorization: allowlist policies + supply chain enforcement
📖 Read in-depth chapter
Lesson 6.1 Cloud Build + Artifact Registry — CI pipeline

Cloud Build is GCP's native CI service. Artifact Registry is the modern container + package registry. Together they form the CI half of the pipeline.

Key concepts
  • Cloud Build: serverless CI. cloudbuild.yaml defines sequential steps as containers. Triggers: Cloud Source Repositories, GitHub, GitLab, Bitbucket, manual.
  • Build steps: each step is a container that gets shared workspace + env vars. Common steps: gcr.io/cloud-builders/docker for image build; gcloud builds submit for submit-and-wait.
  • Workload Identity Federation for CI: external CI (GitHub Actions) can obtain GCP tokens via OIDC federation — no service-account key file in the runner.
  • Artifact Registry: replacement for Container Registry (legacy). Supports Docker, Maven, npm, Python, apt, yum. Per-repo IAM. Vulnerability scanning auto-runs on push.
  • Vulnerability scanning: Artifact Analysis scans images for CVEs. Findings via API + integration with Binary Authorization attestors.
  • Custom build pools: Cloud Build private pools run on private network — required when builds must access private VPC resources (private GKE, on-prem DB via VPN).
Concrete example

A team builds + deploys a Cloud Run service: Cloud Build trigger on GitHub push. cloudbuild.yaml steps: (1) run tests, (2) build image, (3) push to Artifact Registry, (4) run Trivy scan (also via Artifact Analysis), (5) sign with attestor if criticals=0, (6) trigger Cloud Deploy delivery pipeline. CI runs via Workload Identity Federation from the build trigger — no SA key. Repo IAM grants build-pool SA push access.

Key takeaway: Cloud Build = serverless CI via cloudbuild.yaml. Artifact Registry replaces GCR. Vulnerability scanning + attestor signing closes the supply-chain loop.
⚡ Mini-quiz
Drill CI pipeline scenarios → study mode (10 questions).
Lesson 6.2 Cloud Deploy + progressive delivery

Cloud Deploy is GCP's managed CD service. It coordinates rollouts across environments with promotion gates + automatic rollback signal.

Key concepts
  • Delivery pipeline: declarative YAML describing stages (dev → staging → prod) + targets per stage. Same artifact promotes through every stage.
  • Targets: deployment destinations — GKE cluster, Cloud Run service, GKE Autopilot, Anthos cluster. One target per environment.
  • Releases + rollouts: a release is an artifact + pipeline run. A rollout is one stage's deploy. gcloud deploy releases create kicks off the pipeline.
  • requireApproval: set on a target so a human must approve before promotion. Critical for prod gates.
  • Canary deployment: Cloud Deploy supports canary strategies on GKE / Cloud Run — % traffic to new version, soak, then 100%.
  • Rollback: rollback to previous release with one command. For GKE specifically, kubectl rollout undo is the fastest emergency rollback when Cloud Deploy isn't available.
  • Cloud Deploy + Skaffold: Skaffold provides build / deploy plumbing locally; Cloud Deploy uses Skaffold internally for rendering manifests + invoking kubectl.
Concrete example

CD pipeline: 3 targets (dev / staging / prod). Dev auto-deploys on push. Staging requires manual approval + soak. Prod target uses canary: 25% → 50% → 100% with 5-minute soak between, plus requireApproval before promotion. Cloud Monitoring SLO alert triggers automatic abort if error budget burns during canary. Emergency rollback via gcloud deploy rollouts rollback.

Key takeaway: Cloud Deploy = managed CD with declarative pipeline + targets. Canary + requireApproval for safety. Combine with SLO alerts for auto-abort.
⚡ Mini-quiz
Practise Cloud Deploy scenarios → quick quiz (5 questions).
Lesson 6.3 Config Sync + Anthos — multi-cluster GitOps

For multi-cluster fleets, Config Sync is the GitOps reconciler. PCA tests the multi-cluster pattern + the Anthos umbrella scope.

Key concepts
  • Config Sync: reconciles cluster state from a Git repo. Watches the repo for changes; applies them to the cluster. Detects drift; re-applies.
  • Hierarchy: repo structure with system/ + cluster/ + namespaces/ directories. Cluster-scoped resources in cluster/; namespaced in namespaces/<ns>/.
  • Multi-cluster scope: register multiple GKE / Anthos clusters in a fleet; Config Sync pulls from one repo + applies to all. Per-cluster overrides via labels.
  • Policy Controller: OPA Gatekeeper-based admission policy. Enforce constraints (e.g., "no privileged containers", "must have resource limits") cluster-wide.
  • Anthos overall: umbrella branding for GCP's hybrid/multi-cloud K8s — Config Sync + Policy Controller + Service Mesh + GKE on AWS / on-prem. Pick when you have non-GCP Kubernetes that needs unified mgmt.
  • GKE Hub: registers clusters to a fleet. Required for Anthos features. Free to register GKE clusters; paid for non-GCP.
Concrete example

Multi-cluster fleet: 5 GKE clusters across 3 regions + 1 on-prem cluster. GKE Hub registers all 6 as a fleet. Config Sync watches the cluster-config Git repo. Common resources (namespaces, RBAC, NetworkPolicy) apply to ALL clusters; region-specific overrides via cluster-label selector. Policy Controller enforces "every pod must have resource limits" + "no privileged containers" cluster-wide. Audit logs via fleet observability.

Key takeaway: Config Sync = GitOps reconciler. Policy Controller = OPA admission. GKE Hub + Anthos for multi-cluster fleet management. Anthos pays for non-GCP clusters too.
⚡ Mini-quiz
Drill Config Sync + Anthos scenarios → study mode (10 questions).
Module 73 lessons
SRE, Reliability Engineering, and Disaster Recovery (Domain 6)

Domain 6 (14%) is the SRE domain — and one of the highest-yield areas for the PCA exam. Master the SLI/SLO/Error Budget framework: define SLIs as ratio metrics (HTTP 2xx / total requests), express SLOs as a percentage target over a rolling window, and calculate the error budget. Implement multi-window burn rate alerting from the Google SRE Workbook: 1-hour window at 14.4x burn + 6-hour window at 2x burn catches fast burns without noise. Configure Cloud SQL PITR for precise recovery from accidental data deletion. Design chaos engineering experiments to validate failover assumptions before incidents happen. Apply Production Readiness Reviews to gate production launches.

SLI definition: good events / total events (not uptime %) Error budget calculation: (1 - SLO) × window minutes Multi-window burn rate alert: 1h@14.4x + 6h@2x Cloud SQL PITR: binary log recovery to any timestamp Chaos engineering: deliberate zone/instance failure + SLI observation Cloud Logging exclusion filters: reduce ingestion cost Pub/Sub message storage policy: EU data residency Production Readiness Review (PRR) checklist
📖 Read in-depth chapter
Lesson 7.1 SLI / SLO / Error Budget framework

Google invented the SLI/SLO/error-budget framework. PCA tests it more rigorously than any other domain — define metrics correctly + alert from burn rate, not from individual errors.

Key concepts
  • SLI (Service Level Indicator): a metric expressed as good_events / total_events. E.g., "HTTP 2xx responses / total HTTP requests". NOT "uptime %" (vague + uninformative).
  • SLO (Service Level Objective): a target for the SLI over a rolling window. "99.9% of HTTP requests over 28 days return 2xx within 500ms". Defines the success threshold.
  • Error budget: the allowable failure budget = (1 - SLO) × window. For 99.9% over 28 days = 40.32 minutes of "failure" allowed. Burning the budget too fast = engineering pause / postmortem.
  • Multi-window burn rate alerts: Google SRE Workbook pattern. Page when (1h burn > 14.4× AND 6h burn > 2×). Catches fast burn quickly without flapping on transient spikes.
  • Burn rate math: "burn rate X×" = consuming the budget X times faster than sustainable. 14.4× over 1h = 24h of budget gone in 1h.
  • Cloud Monitoring SLOs: declarative SLO objects. Cloud Monitoring computes burn rate + provides built-in alerting policies for multi-window patterns.
Concrete example

A checkout API has SLO 99.9% requests return 2xx within 500ms over 28 days. Error budget = 0.1% × 40320 min = ~40 min/month. Multi-window burn rate alert: fire if (last 1h burn rate > 14.4× AND last 6h burn rate > 2×). Goes to PagerDuty. Slow-burn (1% over a week) catches via secondary alert at (6h > 1× AND 24h > 0.5×) routed to email — heads-up, not page.

Key takeaway: SLI = good/total. SLO = target. Error budget = allowance. Multi-window burn rate = page only on real problems. Cloud Monitoring SLOs build all of this in.
⚡ Mini-quiz
Drill SLO + error-budget scenarios → study mode (10 questions).
Lesson 7.2 DR patterns + Cloud SQL PITR

GCP's DR primitives mirror AWS's but with GCP-specific service flavours. PCA expects you to map RTO/RPO requirements to a specific GCP design.

Key concepts
  • DR strategy ladder: Backup & Restore (RTO hours), Pilot Light, Warm Standby (RTO minutes), Active/Active (RTO < 1 min). Same as AWS — design picks based on RTO/RPO targets.
  • Cloud SQL HA + cross-region read replica: primary in zone A, sync standby in zone B (HA). Async read replicas can be cross-region for DR. Promote replica = manual cross-region failover.
  • Cloud SQL PITR (Point-in-Time Recovery): binary-log-based recovery to any second within retention (default 7 days). Use for accidental-deletion recovery.
  • Spanner multi-region: 99.999% SLA, automatic cross-region failover, no RTO concept (writes globally consistent). For applications that can afford Spanner cost.
  • GCS multi-region buckets: data replicated across 2+ regions automatically. RPO < 15min, RTO depends on app logic. Standard storage class supports multi-region; lower-cost tiers are regional only.
  • Backup & DR service: GCP's centralised backup orchestration. Supports VMs, databases, file systems. Replaces hand-rolled backup scripts.
Concrete example

A SaaS needs RTO 30 min / RPO 5 min cross-region. Choice: warm standby. Primary in europe-west1; DR in europe-west4 with scaled-down 25% capacity. Cloud SQL with cross-region read replica in DR region; promote on regional failure. GCS bucket configured multi-region (eu) for user uploads. Cloud DNS health checks flip the LB endpoint on RTO breach. Cloud SQL PITR retained 7 days for accidental-deletion recovery.

Key takeaway: RTO/RPO drives strategy. Cloud SQL HA + cross-region replica + PITR covers most needs. Spanner for "no failover concept needed". GCS multi-region for object data.
⚡ Mini-quiz
Practise DR scenarios → quick quiz (5 questions).
Lesson 7.3 Chaos engineering + Production Readiness Review

Two SRE-discipline practices that PCA expects you to understand at the design-question level. Chaos engineering validates assumptions; PRR gates new launches.

Key concepts
  • Chaos engineering: deliberate fault injection in production (or production-like) to validate failover assumptions. Kill a VM → observe SLI; sever a zone → observe failover.
  • Game days: scheduled chaos exercises with the on-call team. Simulate an incident; team works the runbook. Catches gaps before real incidents.
  • Production Readiness Review (PRR): SRE-led review before production launch. Checklist: capacity plan, runbook, dashboard, alert rules, dependencies, SLO, error-budget plan, rollback procedure.
  • Postmortems: blameless writeup after every incident. Includes timeline, root cause, action items, monitoring gaps. Knowledge base for future incidents.
  • Toil reduction: SRE principle — engineering time should be 50%+ on automation/improvement, not toil (manual repetitive work). Toil > 50% = hire / automate.
  • Cloud Logging exclusion filters: exclude noisy log entries at ingest to control cost. Critical for high-volume services where every log line ingested = $ at the petabyte scale.
Concrete example

Pre-launch PRR for a new service: SRE reviews capacity (predicted peak QPS = 5k, MIG max = 20 instances at 250 QPS each ✓), SLO (99.9% / 500ms ✓), runbook (5 common incident playbooks documented ✓), alerts (multi-window burn rate to PagerDuty ✓), dashboard (RPS, latency, error rate, dependency health ✓), rollback (Cloud Deploy auto-rollback on alarm ✓). Pre-launch game day: simulate zone failure; team failover-flips with the runbook in 8 minutes — well inside the 30-min RTO. Launch approved.

Key takeaway: chaos engineering + game days validate assumptions BEFORE incidents. PRR is the launch gate. Postmortem every incident blamelessly. Toil > 50% = automate or hire.
⚡ Mini-quiz
Drill SRE-practice scenarios → study mode (10 questions).
Test your knowledge as you learn 60 scenario-based GCP PCA questions — each mapped to real exam domains
⚡ Start quiz Podcast
🌐

Shared VPC vs VPC Peering

Shared VPC: multiple service projects share one host project's VPC. Centralized networking, routing, and firewall rules. VPC Peering: two VPCs connect bidirectionally but do NOT share routes transitively (A↔B, B↔C ≠ A↔C). Choose Shared VPC when you need centralized egress control or a common DNS/proxy architecture.

🔒

VPC Service Controls vs IAM

IAM controls WHO can access a resource. VPC Service Controls controls FROM WHERE. Even with valid IAM credentials, a request from outside the VPC SC perimeter is denied. This is the data exfiltration defense-in-depth layer — stolen credentials used from an attacker's network cannot exfiltrate data from BigQuery or Cloud Storage.

🎯

SLO Burn Rate Alerting

Don't alert on raw error rate — alert on burn rate. A 14.4x burn rate on a 99.9% SLO exhausts the monthly budget in 2 hours. Configure multi-window alerts: short window (1h) detects fast burns, long window (6h) suppresses single-spike noise. This pattern from the Google SRE Workbook is the #1 high-value SRE topic on the PCA exam.

6-week study plan

Week 1
Architecture Fundamentals + Compute & Storage GCP resource hierarchy, IAM policy inheritance, global VPC model. Compute Engine instance types, GKE Standard vs Autopilot, Cloud Run vs App Engine Standard. Storage services: Cloud SQL, Cloud Spanner, Bigtable row key design, Firestore, Cloud Storage lifecycle tiers. Complete 15 practice questions from Domain 1.
Week 2
Networking, Pub/Sub, and Load Balancing Shared VPC vs VPC Peering vs Cloud Interconnect. Global HTTPS LB with anycast TLS termination. Cloud CDN + Cloud Armor WAF + DDoS protection. Pub/Sub fan-out patterns and message storage policies. Cloud DNS and Traffic Director. Complete 10 practice questions from the networking domain.
Week 3
Security, Compliance, and Identity (Domain 3) VPC Service Controls perimeters. Workload Identity for GKE. Binary Authorization with attestors and policies. Organization Policy constraints. CMEK with Cloud HSM. Cloud IAP zero-trust access. Secret Manager rotation. Security Command Center Premium. Complete 20 questions from Domain 3 — this is heavily tested.
Week 4
Infrastructure Provisioning + Cost Optimization (Domain 2 & 4) Terraform GCS backend patterns. GKE Autopilot billing model. MIG autoscaling warmup periods. Cluster Autoscaler for GPU node pools. BigQuery Editions reservations. CUDs vs Spot VMs strategy. Cloud Profiler + Cloud Trace for performance analysis. VPA auto mode for right-sizing. BigQuery BI Engine for Looker acceleration.
Week 5
CI/CD, GitOps, and Domain 5 Cloud Build cloudbuild.yaml syntax and triggers. Cloud Deploy pipeline with requireApproval. Anthos Config Management Config Sync across 15 clusters. kubectl rollout undo patterns. Cloud Run configuration (32GB RAM, 60-min timeout). Binary Authorization supply chain enforcement. Complete 10 questions on implementation management.
Week 6
SRE Reliability + Full Mock Exam SLI/SLO/Error Budget framework from scratch. Multi-window burn rate alert configuration. Cloud SQL PITR recovery scenarios. Chaos engineering methodology. Production Readiness Review checklist. Complete all 60 practice questions under timed conditions. Review every wrong answer. Target 80%+ before booking the exam.

Top 4 mistakes on the GCP PCA exam

  • Confusing Cloud Spanner with Cloud SQL for global scenarios. Cloud SQL is single-region (read replicas are not globally consistent). Cloud Spanner is the only GCP database that offers global strong consistency with ACID transactions at scale. Any question mentioning "multi-region", "global", and "strong consistency" together = Cloud Spanner.
  • Picking IAM over VPC Service Controls for data exfiltration prevention. IAM controls who, VPC SC controls from where. Compromised credentials used from an external network are blocked by VPC SC even if the IAM policy would allow the operation. The keyword "even if credentials are compromised" always points to VPC Service Controls.
  • Using burn rate alerting incorrectly in case studies. Many candidates configure simple error-rate threshold alerts and get tripped up on SRE questions. The Google SRE Workbook multi-window pattern (14.4x fast burn + 2x slow burn) is the expected answer. If a question asks about "proactive" error budget management, burn rate alerting is always the answer.
  • Forgetting to set minimum nodes to 0 for GPU node pool cost optimization. Simply enabling Cluster Autoscaler doesn't scale to zero. You must set the minimum node count to 0 on the GPU node pool and use taints/tolerations to ensure only training pods can schedule there. Without min=0, you pay for GPU nodes even when no training jobs run.

GCP PCA vs GCP ACE — Key differences

GCP Associate Cloud Engineer (ACE)

  • Configures and deploys GCP resources
  • Single-service questions (how to set up a GKE cluster)
  • Basic IAM and networking
  • $125 USD, 2 years validity
  • Recommended before attempting PCA

GCP Professional Cloud Architect (PCA)

  • Designs complete multi-service architectures
  • Multi-service case studies (design + justify tradeoffs)
  • Advanced: VPC SC, Anthos, SLO engineering
  • $200 USD, 2 years validity
  • Highest-impact GCP professional cert
Start practicing →