| Exam fact | Details |
|---|---|
| Exam code | Professional Cloud Architect (GCP PCA) |
| Full name | Google Cloud Certified – Professional Cloud Architect |
| Questions | 50–60 (mix of MCQ and case study questions) |
| Passing score | 70%+ (Google uses scaled scoring, not publicly disclosed) |
| Duration | 120 minutes |
| Price | $200 USD |
| Prerequisites | 3+ years of industry experience, 1+ year of GCP hands-on (GCP ACE recommended) |
| Renewal | Recertify every 2 years via a 2-hour recertification exam |
Exam domain weights
Course modules
Build the mental model for GCP's global infrastructure before diving into services. Understand regions, zones, points-of-presence, and how Google's private backbone differs from the public internet. Learn the Resource Hierarchy (organization → folders → projects → resources) and how IAM policies are inherited. Master the shared responsibility model for GCP and the Well-Architected Framework's six pillars: operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. Grasp network fundamentals: how GCP VPC is global (not regional like AWS), how default routes and VPC flows differ from traditional networks.
📖 Read in-depth chapter ▾
GCP's resource hierarchy is the single biggest mental shift coming from AWS. Organization → Folder → Project → Resource — and IAM policies flow downward. Get the topology right on day one or you'll fight inheritance bugs forever.
- Organization: top-level container tied to a Cloud Identity / Google Workspace domain. One per company. Owns all folders / projects / billing.
- Folder: grouping for projects + sub-folders (up to 10 levels deep). Used for OUs / business units. Inherit IAM + Org Policies from parent.
- Project: the resource container — every GCP resource lives in exactly one project. Has a unique project ID (immutable) and project number (immutable). Billing scoped per project.
- IAM inheritance: additive. A role granted at the org level applies to every project beneath. Most-specific does NOT override — you can't deny inheritance below.
- Deny Policies: separate from IAM Allow policies. Deny rules take precedence. Use sparingly — primary access control is via not-granting at the lower tier.
- Org Policies: constraints enforced top-down (e.g.,
compute.vmExternalIpAccess: deny). Different from IAM. Override at child levels only if the parent allows.
A 200-employee SaaS organises GCP as: Org → folders prod + non-prod + sandbox → ~80 projects under each. IAM: org-level Viewer for the security team (cascades to everything). Prod folder: only specific roles. Sandbox: free-for-all developer + Viewer at the folder so devs can audit each others' work. Org Policy compute.vmExternalIpAccess: deny at org level — no project can grant a VM a public IP without explicit folder-level exception.
GCP VPC is GLOBAL — a single VPC spans every region. Subnets are regional. This is the opposite of AWS where VPCs are regional and subnets are zonal. The PCA exam tests this distinction relentlessly.
- Global VPC: one VPC, many regions. Subnets are per-region inside the same VPC. Routes propagate across regions automatically.
- Auto-mode vs custom-mode VPC: auto-mode auto-creates a /20 subnet in every region — convenient for demos, never for production. Custom-mode lets you carve CIDRs explicitly; production-only choice.
- Shared VPC vs VPC Peering: Shared VPC = host project shares its VPC with multiple service projects — central network team owns; workload teams attach. VPC Peering = connect two VPCs (within or across orgs). Non-transitive like AWS.
- Premium vs Standard network tier: Premium routes traffic over Google's global backbone, exits at the nearest edge to the destination. Standard hands traffic to the public internet at the exit region — cheaper but slower / less reliable.
- Default firewall behaviour: implicit allow egress, implicit deny ingress. Explicit allow rules required for inbound. Tag-based + service-account-based firewall rules.
- VPC Flow Logs: capture flow records per subnet. Filter expressions to control cost. Visible in Logs Explorer + exportable to BigQuery.
A global SaaS deploys to 5 regions. Single Shared VPC in the network host project, custom-mode. Five subnets — one per region, non-overlapping CIDRs. Workload projects attach as service projects of the Shared VPC. Premium network tier for the customer-facing Global LB (exit at user's nearest edge, traverse Google backbone to nearest GCE backend). Internal admin traffic via Standard tier for cost.
Four compute primitives at different abstraction levels. PCA expects you to climb the ladder — pick the most-abstracted option that fits the workload.
- Compute Engine (GCE): raw VMs. Full OS control. Use when you need a specific kernel, GPUs, custom drivers, or licence-tied software.
- Managed Instance Groups (MIG): AWS's equivalent of Auto Scaling Groups. Regional MIGs distribute across 3 zones. Autoscaler responds to CPU / LB capacity / custom metrics.
- GKE Standard: managed Kubernetes — Google runs the control plane, you manage the worker nodes (or use Autopilot for fully-managed). Use when you need K8s primitives + helm + ecosystem tools.
- GKE Autopilot: Google manages BOTH control plane AND nodes. Billed per pod request (CPU/memory). Right when you want K8s API without node operations.
- Cloud Run: serverless containers. Scale 0-N on demand, billed per request + per CPU-second. HTTP / event-driven. Right for stateless web services, background workers via Pub/Sub.
- Cloud Functions Gen 2: now built on Cloud Run under the hood. Trigger-driven (HTTP, Pub/Sub, Cloud Storage, EventArc). Smaller unit than Cloud Run service.
- App Engine Standard: the legacy PaaS. Auto-scale, no VMs. Avoid for new builds (Cloud Run is the modern equivalent); maintain only if you're already on it.
A startup: marketing site = Cloud Run (stateless container, scale-to-zero for cost). Image-processing job = Cloud Functions Gen 2 (Cloud Storage trigger). Main app on Kubernetes = GKE Autopilot (no node-ops burden). Legacy Windows app needing specific build tools = Compute Engine VM with managed image lifecycle.
Domain 1 carries the highest exam weight (24%) and focuses on selecting the right service for complex architectural scenarios. Learn to choose between Cloud Spanner (global ACID transactions), Bigtable (high-throughput time-series), Firestore (mobile real-time sync), and Cloud SQL (relational, single-region). Design Pub/Sub fan-out patterns for decoupled event-driven architectures. Configure Global HTTPS Load Balancers with anycast TLS termination at Google's edge. Design ephemeral Dataproc clusters for batch ML workloads to minimize cost. Understand when Anthos multi-cloud is the answer for hybrid/multi-cloud Kubernetes management.
📖 Read in-depth chapter ▾
GCP has more purpose-built databases than any other cloud. PCA tests the decision matrix obsessively — Spanner vs Bigtable vs Firestore vs Cloud SQL vs BigQuery, given access pattern + scale + consistency.
- Cloud Spanner: globally-distributed relational with external strong consistency (TrueTime). Massive write scale, multi-region active-active. Expensive — only when you need global ACID writes.
- Bigtable: wide-column NoSQL for high-throughput time-series / IoT / ad-tech. Single-digit-ms reads at million-QPS. Schema-less. Row key design is everything — bad keys = hot tablets.
- Firestore: document NoSQL. Native mode (modern, real-time listeners, offline SDK, multi-region) or Datastore mode (legacy). Use for mobile / web app real-time sync.
- Cloud SQL: managed MySQL / PostgreSQL / SQL Server. Single-region with HA via standby in another zone. Cross-region read replicas; not active-active. The lift-and-shift answer.
- AlloyDB: Google's PostgreSQL-compatible high-performance DB — beats Cloud SQL for HTAP workloads, ~4× analytics speed. Newer; appears in PCA scenarios with "transactional + analytical hybrid" framing.
- BigQuery: serverless analytics warehouse. Pay per query (on-demand) or reserved slots. NOT a transactional store; design dimension/fact tables, partitioning, clustering.
- Cloud Storage: object storage. Classes: Standard, Nearline (≥30d), Coldline (≥90d), Archive (≥365d). Lifecycle policies move data; per-bucket multi-region option.
A FinTech: global trade ledger requiring strong consistency + multi-region writes → Cloud Spanner. Per-customer time-series transaction logs at millions of events/sec → Bigtable with row key customerId#reverseTimestamp. Mobile app real-time notifications → Firestore Native. Historical analytics → BigQuery with daily partitions clustered on customerId. Document attachments → Cloud Storage with lifecycle Nearline → Coldline → Archive at 30/90/365 days.
Pub/Sub is GCP's at-scale message bus. Eventarc bridges GCP services into Pub/Sub-style routing. PCA tests fan-out patterns and ordering guarantees.
- Pub/Sub topics + subscriptions: producer publishes to topic, each subscription independently consumes. Multiple subscriptions on one topic = fan-out. Default at-least-once delivery.
- Pull vs Push: Pull = consumer requests messages (default, more control over flow). Push = Pub/Sub HTTP-POSTs to your endpoint (Cloud Run / Cloud Functions sweet spot). Push has stricter latency requirements.
- Ordering keys: messages with the same ordering key are delivered in order within a subscription. Lower throughput than no-order (per-key serialised). Use for sequential-by-user / sequential-by-device flows.
- Dead Letter Topics: failed messages route to a DLT after configurable redelivery attempts. Mandatory for production — without it, poison messages spin forever.
- Message retention: default 7 days. Configurable up to 31 days. Determines replay window.
- Eventarc: routes GCP service events (Cloud Audit Log entries, Cloud Storage events, Firestore changes) to Cloud Run / Cloud Functions / Workflows. Built on top of Pub/Sub. Replaces hand-rolled trigger pipelines.
- Pub/Sub Lite: regional, single-zone, cheaper alternative to Pub/Sub. Lower SLA. Use only for non-critical or cost-sensitive workloads — most PCA answers prefer full Pub/Sub.
An e-commerce platform's order-placed event needs to fan out to (1) billing, (2) inventory, (3) notifications. Design: Pub/Sub topic orders.placed. Three subscriptions, one per service, each pull-based with Cloud Run services consuming. Ordering keys = customerId so a single customer's events stay sequential. DLT after 5 failed deliveries. For real-time customer-status updates, Eventarc routes Firestore document changes to a Cloud Run notification service — no Pub/Sub topic to manage.
Google's network is its biggest moat. The Global HTTP(S) LB is the canonical PCA answer for customer-facing apps. Understand the edge anycast + premium tier interaction.
- Global HTTP(S) LB: one anycast IP, terminated at Google's nearest edge. TLS terminated at edge; backend traffic over Google's private backbone. The default customer-facing choice.
- Backend services: serve HTTP from regional MIGs, GKE, Cloud Run, Cloud Storage buckets (for static), Internet NEG (proxying to external endpoints). Health-checked + load-balanced.
- Cloud CDN: attach to backend service for edge caching. Cacheable signals via Cache-Control headers. Signed URLs for paywall content.
- URL maps + path rules: route by host header / path prefix to different backend services. Path-based microservice routing without per-service public IPs.
- SSL policies: set minimum TLS version + cipher suite per LB. Mandatory for PCI / FedRAMP compliance.
- Regional LBs: regional external HTTPS LB (Standard tier, cheaper). Internal HTTPS LB for east-west traffic inside the VPC. Use when global is overkill.
- Cloud Armor: WAF + DDoS protection in front of Global LBs. OWASP Core Rule Set + custom rate-limit rules.
A SaaS API: Global HTTP(S) LB with anycast IP, edge TLS termination via Google-managed cert. URL map: /api/* → backend MIG in 3 regions; /static/* → Cloud Storage bucket with Cloud CDN enabled, 1-day TTL. Backend services have HTTP health checks. Cloud Armor with OWASP CRS + custom rate-limit (1000 req/min per IP). SSL policy enforces TLS 1.2 minimum.
Domain 2 covers the provisioning and management of GCP infrastructure. Master Terraform with GCS backend for state locking and version history — the standard IaC approach on GCP. Configure GKE Autopilot (fully managed node infrastructure, billed per pod request) versus GKE Standard with Cluster Autoscaler (GPU node pools with minimum=0 for zero-cost idle). Design MIG autoscaling with warmup periods so new instances are ready before traffic hits them. Implement GKE regional clusters with pod anti-affinity for zone-resilient deployments. Configure Cloud Monitoring SLO burn rate alerts for proactive error budget management.
📖 Read in-depth chapter ▾
GCP's preferred IaC is Terraform with GCS-backed state. PCA expects you to know the GCS-backend mechanics + module structure + state-locking patterns.
- GCS backend: Terraform state file stored in a Cloud Storage bucket with versioning enabled. Object locking via the GCS object generation provides built-in state locking — no separate DynamoDB-style table needed (unlike AWS Terraform).
- Bucket versioning: mandatory for state buckets. Lets you recover from accidental state corruption / deletion.
- Workspace pattern: separate state file per environment (dev / staging / prod) via different state prefix in the same bucket. Terraform
workspacecommand supports this. - Module structure: typical GCP Terraform repo: root module per environment, child modules per component (network / IAM / GKE / Cloud SQL). Modules version-pinned via Git tags.
- Service Account impersonation: avoid long-lived service account keys. Terraform uses your gcloud identity to impersonate a Terraform service account via
gcloud auth application-default login+--impersonate-service-account. - Cloud Deployment Manager (DM): GCP's native IaC alternative. YAML/Jinja2 templates. Less popular than Terraform but appears on PCA — know it exists for the "which tool" question.
A team standardises on Terraform: GCS state bucket per environment with versioning ON and Object Lifecycle policy "non-current version → archive at 30d". Terraform code in terraform/ with modules network, iam, gke. CI runs Terraform via a service account impersonated from the CI's Workload Identity Federation token. State locks via GCS automatically — multiple CI runs serialize on the state object.
Two GKE flavours. PCA tests the operational trade-offs and which fits a given scenario.
- GKE Standard: you manage worker nodes (node pools, sizing, OS upgrades, taints). Full control + full burden. Use when you need bespoke node configs, GPU pools with custom drivers, or single-tenant node isolation.
- GKE Autopilot: Google manages everything — control plane + workers. Billed per pod resource request (vCPU + memory + ephemeral storage). No node-level access. Pods scheduled on Google-managed infrastructure.
- When Autopilot fits: standard stateless workloads, when ops team is small, when you want least-burden K8s. PCA's default answer unless the scenario implies node-level control.
- When Standard fits: GPU/TPU pools you scale to zero (Autopilot has GPU support but more limited), custom node images, specific kernel versions, single-tenant compliance requirements, very high pod density.
- Cluster Autoscaler (Standard only): scales node pools based on pending pods. Combine with HPA (pod-level) and VPA (right-sizing).
- Regional vs zonal cluster: regional clusters spread nodes (and control plane replicas) across 3 zones — 99.95% SLA. Zonal cluster is cheaper but single-zone failure kills it. Production always regional.
A team runs 80% standard stateless services + 20% GPU ML training. Design: two clusters — GKE Autopilot for the standard services (lower ops burden); GKE Standard with a GPU node pool (min=0, autoscaler scales up under load) for ML. Both regional in 3 zones. The cost-justified split — Autopilot covers the bulk; Standard handles the niche.
For VM-based workloads, Managed Instance Groups + Cloud Monitoring SLOs give you the production-grade scaling + alerting pattern.
- MIG autoscaler signals: CPU utilisation, HTTP LB serving capacity, Pub/Sub queue depth, custom Cloud Monitoring metrics. Combine signals for richer scaling.
- Warmup period: grace window after instance creation before counted in autoscaler decision. Lets the app finish initialising before traffic hits.
- Predictive autoscaling: ML-driven based on historical patterns. Pre-scales before predicted demand — useful for daily/weekly cycles.
- Instance template + rolling updates: immutable template versions.
gcloud compute instance-groups managed rolling-actionrolls instances to a new template version with configurable surge + max-unavailable. - Cloud Monitoring SLOs: declarative SLO objects with SLI definition + window + target. Cloud Monitoring computes the rolling error budget automatically.
- Burn rate alerts: multi-window burn rate is the Google SRE Workbook pattern — alert at 1h@14.4x burn (fast) + 6h@2x burn (slow). Catches both fast-burning and slow-degrading SLOs.
A web tier: regional MIG with min=4 / max=20 instances. Autoscaler signal = HTTP LB serving capacity. Warmup period 90s for app boot. Predictive scaling ON (daily traffic pattern). Cloud Monitoring SLO: 99.9% of requests respond in < 500ms over 28 days. Multi-window burn rate alert: 1h>14.4x AND 6h>2x both true → page on-call. SLO dashboard surfaces error budget for the eng manager weekly.
Domain 3 (18%) focuses on Google's zero-trust security model. VPC Service Controls create a security perimeter around GCP services — even valid credentials from outside the perimeter cannot exfiltrate data. Workload Identity eliminates service account key files by binding Kubernetes service accounts to GCP service accounts. Binary Authorization ensures that only attested, scanned container images can be deployed to GKE production. Cloud Armor with OWASP WAF rules protects against L7 attacks. Organization Policy constraints (vmExternalIpAccess, compute.restrictCloudSQLInstances) enforce security at the org level. Cloud IAP enables zero-trust access to internal web applications without VPN.
📖 Read in-depth chapter ▾
VPC Service Controls are GCP's unique answer to data exfiltration — a perimeter that even valid credentials can't escape. Combined with Org Policy constraints, you get preventive controls at scale.
- VPC Service Controls (VPC-SC): defines a security perimeter around GCP services (BigQuery, GCS, etc.). Calls to those services from inside the perimeter succeed; from outside fail — even with valid IAM. Prevents exfil-via-leaked-credentials.
- Ingress / Egress policies: explicit rules permitting cross-perimeter calls. E.g., "allow service X from project Y to read BQ tables in this perimeter". Without explicit rules, the perimeter is hermetic.
- Access Levels: conditions (IP / device / identity) that grant access to the perimeter. Layered on top of IAM.
- Org Policy: SCP-equivalent — declarative constraints enforced top-down. Common:
compute.vmExternalIpAccess: deny,iam.disableServiceAccountKeyCreation: enforced,compute.restrictCloudSQLInstancesregions. - Boolean vs list constraints: boolean = on/off, list = allow/deny list of values. Override at child scope only if parent's enforcement allows.
- Dry-run perimeter: evaluate VPC-SC violations without enforcing. Use during rollout to find legitimate cross-perimeter calls before flipping to enforce.
A regulated workload's GCS bucket holds PHI. Wrap in VPC Service Controls perimeter with the workload project. Ingress policy allowing only the workload project's service account + only from corporate IPs (Access Level). Egress policy denying all outbound except to approved analytics project. Org Policy at folder level: storage.publicAccessPrevention: enforced, iam.disableServiceAccountKeyCreation: enforced. Even if PHI service account credentials leak, they can't read from outside the perimeter — VPC-SC blocks.
Two zero-trust primitives. Workload Identity eliminates service-account keys for workloads; Cloud IAP eliminates VPN for users.
- Workload Identity (GKE): binds Kubernetes service accounts (KSAs) to GCP service accounts (GSAs) via OIDC federation. Pods authenticate as GSAs without storing keys.
- Workload Identity Federation (non-GCP): federate external workloads (GitHub Actions, AWS, Azure) to GCP without service-account keys. The external token gets exchanged for a short-lived GCP token.
- Service-account-key bans: Org Policy
iam.disableServiceAccountKeyCreationblocks new keys. Forces teams to migrate to Workload Identity. - Cloud IAP: identity-aware proxy in front of HTTP(S) backends. Authenticates the user (Google account / Workforce Identity Federation) + authorises via IAM "IAP-secured Web App User" role. No VPN needed.
- IAP for TCP forwarding: SSH into private VMs through IAP — no public IP, no bastion host. Uses gcloud's
--tunnel-through-iapflag. - IAP for on-prem: BeyondCorp-style remote access pattern. Replaces traditional corporate VPN for internal app access.
A GKE workload accesses GCS + BigQuery. Old way: service-account key file mounted as a secret. New way: enable Workload Identity on the cluster. Bind the KSA app-sa to GSA app@project.iam.gserviceaccount.com. Pods automatically authenticate as the GSA — no keys. Internal admin dashboard previously accessed via VPN — now behind Cloud IAP on the Global LB; users sign in with their Google identity, IAP validates + injects identity header. SSH into private GCE VMs via IAP TCP forwarding — gcloud compute ssh --tunnel-through-iap.
Supply chain security (Binary Authorization), secrets (Secret Manager), and encryption keys (Cloud KMS / Cloud HSM) — the cryptographic foundations of GCP security.
- Binary Authorization: GKE admission control that enforces image attestation policies — only images that pass scans + are signed by an attestor get deployed. Closes the supply-chain attack vector.
- Attestors: identities (KMS-backed signing keys) that vouch for images. Common pipeline: image built → vulnerability scan → if clean, attestor signs → Binary Auth permits deploy.
- Cloud KMS: managed key management. Software-protected (default) or HSM-protected keys. Symmetric / asymmetric / MAC. Per-region keys; multi-region for global services.
- CMEK (Customer-Managed Encryption Keys): use your KMS key to encrypt service data (GCS, BigQuery, GCE disks, Cloud SQL, etc.). Required for compliance regimes mandating key custody.
- Cloud HSM: FIPS 140-2 Level 3 HSM. Single-tenant. Use when compliance requires Level 3.
- Cloud External Key Manager (Cloud EKM): use keys hosted in a third-party HSM (Equinix SmartKey, Fortanix). For organisations that require keys OUTSIDE Google entirely.
- Secret Manager: centralised secrets with versioning + IAM-based access. Automatic rotation via Cloud Functions trigger. Replaces secret-in-config-files.
A regulated GKE deployment: Cloud Build pipeline scans image with Artifact Registry vulnerability scanning. If criticals = 0, attestor signs via KMS key. Binary Authorization policy on GKE prod cluster requires the attestation — block any other deploy. Secrets pulled at startup from Secret Manager (DB password rotates monthly via Cloud Function). All cluster persistent volumes encrypted with CMEK backed by Cloud HSM for FIPS 140-2 Level 3 compliance.
Domain 4 (18%) covers optimizing both performance and costs across the GCP portfolio. BigQuery Editions introduce reservation-based pricing for predictable ETL workloads — combine with on-demand pricing for ad-hoc analytics. BigQuery BI Engine accelerates Looker dashboards to sub-second response via in-memory caching. VPA (Vertical Pod Autoscaler) automatically right-sizes pod CPU/memory requests to eliminate over-provisioning waste. Cloud Profiler continuously profiles production services at under 1% overhead. Cloud Trace provides distributed tracing waterfall views for diagnosing multi-service latency. Committed Use Discounts (CUDs) for stable VMs, Spot VMs for seasonal burst workloads.
📖 Read in-depth chapter ▾
GCP's compute pricing has three main discount mechanisms. PCA expects you to know which fits which workload pattern.
- Sustained Use Discounts (SUD): automatic discount up to 30% if a VM runs continuously through the month. No commitment. Applies retroactively at month-end.
- Committed Use Discounts (CUD): 1-year (~37%) or 3-year (~57%) commitments. Resource-based CUDs (specific machine families) or spend-based CUDs (flexible — apply to any compute).
- Spot VMs: up to 60-91% off on-demand. Can be preempted at any time with 30s notice. For fault-tolerant batch / dev-test / stateless workloads.
- Preemptible VMs: legacy 24-hour-max version of Spot. Spot replaces it for new workloads.
- GKE Spot pools: mix Spot + on-demand node pools. Pod taints/tolerations + node affinity steer workloads to right pool. Cost-optimised stateless workloads → Spot; stateful → on-demand.
- Custom machine types: spec exactly the vCPU + memory you need vs paying for a fixed shape. Particularly useful for memory-heavy workloads where standard shapes over-provision CPU.
A 24/7 production tier (steady baseline) + nightly batch jobs (interruptible) + ML training (GPU, interruptible): cover the baseline with a 3-year Spend-Based CUD at 57% off. Nightly batch and ML on Spot VMs + checkpointing. GKE cluster has a Spot node pool for stateless workloads tagged with toleration. Total cost ~50% below on-demand.
BigQuery is the most exam-tested GCP service for cost optimisation. PCA tests Editions vs on-demand, partition + cluster design, and BI Engine for dashboards.
- On-demand pricing: $6.25/TB scanned. Pay per query. Variable cost — runaway query can be expensive.
- BigQuery Editions: Standard, Enterprise, Enterprise Plus. Reserve "slots" (parallel-execution units). Predictable cost; right when you know your steady ETL load. Mix-and-match with on-demand for ad-hoc.
- Partitioning: partition tables by ingest time, date column, or integer range. Queries that filter on the partition column scan only relevant partitions — massive cost cut.
- Clustering: sort table data by 1-4 columns. Queries filtering on cluster columns prune blocks. Stacks with partitioning.
- Materialized views: precomputed query results stored automatically. Refreshes incrementally. Use for common aggregations on huge tables.
- BI Engine: in-memory acceleration for BigQuery queries. Looker / Looker Studio dashboards become sub-second. Reservation-based pricing.
- Query best practices: avoid SELECT *. Filter on partition columns. Use approximate aggregations (APPROX_COUNT_DISTINCT) where exact is unnecessary.
A retailer's 50TB orders table: partition by order_date (daily partitions) + cluster on customer_id, store_id. Daily ETL runs predictable; cover with BigQuery Enterprise edition reserving 100 slots. Ad-hoc analyst queries stay on-demand. Looker dashboards on the orders table use BI Engine reservation for sub-second response. Most-common aggregation query has a materialized view — analysts query the view, BQ uses fresh incremental data.
Right-sizing kills waste at source. Cloud Profiler + Cloud Trace tell you WHERE the spend is going. Together they close the performance / cost loop.
- VPA (Vertical Pod Autoscaler): automatically right-sizes pod CPU + memory REQUESTS based on observed usage. Recommendation mode = suggest; Auto mode = applied automatically on pod restart.
- VPA + HPA conflict: can't apply both to the same CPU metric. Workaround: HPA on a different metric (RPS) + VPA on CPU/memory. Or VPA in Recommendation mode + HPA on CPU.
- Cloud Recommender (CCR): Google-managed recommendation engine across all services. "These GCE VMs are over-provisioned by 40%". "This Cloud SQL instance is right-sized". Quarterly review.
- Cloud Profiler: continuous statistical profiling of production services. Flame graphs at < 1% overhead. Surfaces the actual hot code paths.
- Cloud Trace: distributed tracing across microservices. Spans + traces + waterfall view. Diagnose "this API endpoint p99 = 2s" by showing the actual slow span.
- Cloud Logging: centralised logs. Sinks export to BigQuery / GCS / Pub/Sub for long-term retention or downstream processing. Use exclusion filters to control ingest cost.
- Cloud Monitoring: metrics + dashboards + alerts. Workspaces aggregate across projects. Custom metrics via OpenTelemetry SDK.
A microservices app: deploy Cloud Profiler agent in each Cloud Run service — flame graphs reveal that a JSON-serialisation library accounts for 60% of CPU. Replace with a faster lib; CPU drops 50%. Cloud Trace reveals one service waits 800ms on a Firestore query; add an index; latency drops to 50ms. VPA in Recommendation mode on the GKE cluster suggests pod CPU requests be cut by 35% on average. Apply via deploy — cluster scales down node count by 30% naturally.
Domain 5 (11%) covers the tools for delivering software reliably. Cloud Build is GCP's native CI service — cloudbuild.yaml defines sequential steps for build, test, vulnerability scan, and artifact signing. Cloud Deploy provides a managed CD pipeline with promotion gates and requireApproval for production deployments, ensuring the same release artifact flows through all stages. Anthos Config Management's Config Sync watches a Git repository and continuously reconciles configuration across all registered GKE clusters. Binary Authorization enforcement at GKE admission ensures only internally built, attested images can be deployed — closing the supply chain security loop.
📖 Read in-depth chapter ▾
Cloud Build is GCP's native CI service. Artifact Registry is the modern container + package registry. Together they form the CI half of the pipeline.
- Cloud Build: serverless CI.
cloudbuild.yamldefines sequential steps as containers. Triggers: Cloud Source Repositories, GitHub, GitLab, Bitbucket, manual. - Build steps: each step is a container that gets shared workspace + env vars. Common steps:
gcr.io/cloud-builders/dockerfor image build;gcloud builds submitfor submit-and-wait. - Workload Identity Federation for CI: external CI (GitHub Actions) can obtain GCP tokens via OIDC federation — no service-account key file in the runner.
- Artifact Registry: replacement for Container Registry (legacy). Supports Docker, Maven, npm, Python, apt, yum. Per-repo IAM. Vulnerability scanning auto-runs on push.
- Vulnerability scanning: Artifact Analysis scans images for CVEs. Findings via API + integration with Binary Authorization attestors.
- Custom build pools: Cloud Build private pools run on private network — required when builds must access private VPC resources (private GKE, on-prem DB via VPN).
A team builds + deploys a Cloud Run service: Cloud Build trigger on GitHub push. cloudbuild.yaml steps: (1) run tests, (2) build image, (3) push to Artifact Registry, (4) run Trivy scan (also via Artifact Analysis), (5) sign with attestor if criticals=0, (6) trigger Cloud Deploy delivery pipeline. CI runs via Workload Identity Federation from the build trigger — no SA key. Repo IAM grants build-pool SA push access.
Cloud Deploy is GCP's managed CD service. It coordinates rollouts across environments with promotion gates + automatic rollback signal.
- Delivery pipeline: declarative YAML describing stages (dev → staging → prod) + targets per stage. Same artifact promotes through every stage.
- Targets: deployment destinations — GKE cluster, Cloud Run service, GKE Autopilot, Anthos cluster. One target per environment.
- Releases + rollouts: a release is an artifact + pipeline run. A rollout is one stage's deploy.
gcloud deploy releases createkicks off the pipeline. - requireApproval: set on a target so a human must approve before promotion. Critical for prod gates.
- Canary deployment: Cloud Deploy supports canary strategies on GKE / Cloud Run — % traffic to new version, soak, then 100%.
- Rollback: rollback to previous release with one command. For GKE specifically,
kubectl rollout undois the fastest emergency rollback when Cloud Deploy isn't available. - Cloud Deploy + Skaffold: Skaffold provides build / deploy plumbing locally; Cloud Deploy uses Skaffold internally for rendering manifests + invoking kubectl.
CD pipeline: 3 targets (dev / staging / prod). Dev auto-deploys on push. Staging requires manual approval + soak. Prod target uses canary: 25% → 50% → 100% with 5-minute soak between, plus requireApproval before promotion. Cloud Monitoring SLO alert triggers automatic abort if error budget burns during canary. Emergency rollback via gcloud deploy rollouts rollback.
For multi-cluster fleets, Config Sync is the GitOps reconciler. PCA tests the multi-cluster pattern + the Anthos umbrella scope.
- Config Sync: reconciles cluster state from a Git repo. Watches the repo for changes; applies them to the cluster. Detects drift; re-applies.
- Hierarchy: repo structure with
system/+cluster/+namespaces/directories. Cluster-scoped resources incluster/; namespaced innamespaces/<ns>/. - Multi-cluster scope: register multiple GKE / Anthos clusters in a fleet; Config Sync pulls from one repo + applies to all. Per-cluster overrides via labels.
- Policy Controller: OPA Gatekeeper-based admission policy. Enforce constraints (e.g., "no privileged containers", "must have resource limits") cluster-wide.
- Anthos overall: umbrella branding for GCP's hybrid/multi-cloud K8s — Config Sync + Policy Controller + Service Mesh + GKE on AWS / on-prem. Pick when you have non-GCP Kubernetes that needs unified mgmt.
- GKE Hub: registers clusters to a fleet. Required for Anthos features. Free to register GKE clusters; paid for non-GCP.
Multi-cluster fleet: 5 GKE clusters across 3 regions + 1 on-prem cluster. GKE Hub registers all 6 as a fleet. Config Sync watches the cluster-config Git repo. Common resources (namespaces, RBAC, NetworkPolicy) apply to ALL clusters; region-specific overrides via cluster-label selector. Policy Controller enforces "every pod must have resource limits" + "no privileged containers" cluster-wide. Audit logs via fleet observability.
Domain 6 (14%) is the SRE domain — and one of the highest-yield areas for the PCA exam. Master the SLI/SLO/Error Budget framework: define SLIs as ratio metrics (HTTP 2xx / total requests), express SLOs as a percentage target over a rolling window, and calculate the error budget. Implement multi-window burn rate alerting from the Google SRE Workbook: 1-hour window at 14.4x burn + 6-hour window at 2x burn catches fast burns without noise. Configure Cloud SQL PITR for precise recovery from accidental data deletion. Design chaos engineering experiments to validate failover assumptions before incidents happen. Apply Production Readiness Reviews to gate production launches.
📖 Read in-depth chapter ▾
Google invented the SLI/SLO/error-budget framework. PCA tests it more rigorously than any other domain — define metrics correctly + alert from burn rate, not from individual errors.
- SLI (Service Level Indicator): a metric expressed as good_events / total_events. E.g., "HTTP 2xx responses / total HTTP requests". NOT "uptime %" (vague + uninformative).
- SLO (Service Level Objective): a target for the SLI over a rolling window. "99.9% of HTTP requests over 28 days return 2xx within 500ms". Defines the success threshold.
- Error budget: the allowable failure budget = (1 - SLO) × window. For 99.9% over 28 days = 40.32 minutes of "failure" allowed. Burning the budget too fast = engineering pause / postmortem.
- Multi-window burn rate alerts: Google SRE Workbook pattern. Page when (1h burn > 14.4× AND 6h burn > 2×). Catches fast burn quickly without flapping on transient spikes.
- Burn rate math: "burn rate X×" = consuming the budget X times faster than sustainable. 14.4× over 1h = 24h of budget gone in 1h.
- Cloud Monitoring SLOs: declarative SLO objects. Cloud Monitoring computes burn rate + provides built-in alerting policies for multi-window patterns.
A checkout API has SLO 99.9% requests return 2xx within 500ms over 28 days. Error budget = 0.1% × 40320 min = ~40 min/month. Multi-window burn rate alert: fire if (last 1h burn rate > 14.4× AND last 6h burn rate > 2×). Goes to PagerDuty. Slow-burn (1% over a week) catches via secondary alert at (6h > 1× AND 24h > 0.5×) routed to email — heads-up, not page.
GCP's DR primitives mirror AWS's but with GCP-specific service flavours. PCA expects you to map RTO/RPO requirements to a specific GCP design.
- DR strategy ladder: Backup & Restore (RTO hours), Pilot Light, Warm Standby (RTO minutes), Active/Active (RTO < 1 min). Same as AWS — design picks based on RTO/RPO targets.
- Cloud SQL HA + cross-region read replica: primary in zone A, sync standby in zone B (HA). Async read replicas can be cross-region for DR. Promote replica = manual cross-region failover.
- Cloud SQL PITR (Point-in-Time Recovery): binary-log-based recovery to any second within retention (default 7 days). Use for accidental-deletion recovery.
- Spanner multi-region: 99.999% SLA, automatic cross-region failover, no RTO concept (writes globally consistent). For applications that can afford Spanner cost.
- GCS multi-region buckets: data replicated across 2+ regions automatically. RPO < 15min, RTO depends on app logic. Standard storage class supports multi-region; lower-cost tiers are regional only.
- Backup & DR service: GCP's centralised backup orchestration. Supports VMs, databases, file systems. Replaces hand-rolled backup scripts.
A SaaS needs RTO 30 min / RPO 5 min cross-region. Choice: warm standby. Primary in europe-west1; DR in europe-west4 with scaled-down 25% capacity. Cloud SQL with cross-region read replica in DR region; promote on regional failure. GCS bucket configured multi-region (eu) for user uploads. Cloud DNS health checks flip the LB endpoint on RTO breach. Cloud SQL PITR retained 7 days for accidental-deletion recovery.
Two SRE-discipline practices that PCA expects you to understand at the design-question level. Chaos engineering validates assumptions; PRR gates new launches.
- Chaos engineering: deliberate fault injection in production (or production-like) to validate failover assumptions. Kill a VM → observe SLI; sever a zone → observe failover.
- Game days: scheduled chaos exercises with the on-call team. Simulate an incident; team works the runbook. Catches gaps before real incidents.
- Production Readiness Review (PRR): SRE-led review before production launch. Checklist: capacity plan, runbook, dashboard, alert rules, dependencies, SLO, error-budget plan, rollback procedure.
- Postmortems: blameless writeup after every incident. Includes timeline, root cause, action items, monitoring gaps. Knowledge base for future incidents.
- Toil reduction: SRE principle — engineering time should be 50%+ on automation/improvement, not toil (manual repetitive work). Toil > 50% = hire / automate.
- Cloud Logging exclusion filters: exclude noisy log entries at ingest to control cost. Critical for high-volume services where every log line ingested = $ at the petabyte scale.
Pre-launch PRR for a new service: SRE reviews capacity (predicted peak QPS = 5k, MIG max = 20 instances at 250 QPS each ✓), SLO (99.9% / 500ms ✓), runbook (5 common incident playbooks documented ✓), alerts (multi-window burn rate to PagerDuty ✓), dashboard (RPS, latency, error rate, dependency health ✓), rollback (Cloud Deploy auto-rollback on alarm ✓). Pre-launch game day: simulate zone failure; team failover-flips with the runbook in 8 minutes — well inside the 30-min RTO. Launch approved.
Shared VPC vs VPC Peering
Shared VPC: multiple service projects share one host project's VPC. Centralized networking, routing, and firewall rules. VPC Peering: two VPCs connect bidirectionally but do NOT share routes transitively (A↔B, B↔C ≠ A↔C). Choose Shared VPC when you need centralized egress control or a common DNS/proxy architecture.
VPC Service Controls vs IAM
IAM controls WHO can access a resource. VPC Service Controls controls FROM WHERE. Even with valid IAM credentials, a request from outside the VPC SC perimeter is denied. This is the data exfiltration defense-in-depth layer — stolen credentials used from an attacker's network cannot exfiltrate data from BigQuery or Cloud Storage.
SLO Burn Rate Alerting
Don't alert on raw error rate — alert on burn rate. A 14.4x burn rate on a 99.9% SLO exhausts the monthly budget in 2 hours. Configure multi-window alerts: short window (1h) detects fast burns, long window (6h) suppresses single-spike noise. This pattern from the Google SRE Workbook is the #1 high-value SRE topic on the PCA exam.
6-week study plan
Top 4 mistakes on the GCP PCA exam
- Confusing Cloud Spanner with Cloud SQL for global scenarios. Cloud SQL is single-region (read replicas are not globally consistent). Cloud Spanner is the only GCP database that offers global strong consistency with ACID transactions at scale. Any question mentioning "multi-region", "global", and "strong consistency" together = Cloud Spanner.
- Picking IAM over VPC Service Controls for data exfiltration prevention. IAM controls who, VPC SC controls from where. Compromised credentials used from an external network are blocked by VPC SC even if the IAM policy would allow the operation. The keyword "even if credentials are compromised" always points to VPC Service Controls.
- Using burn rate alerting incorrectly in case studies. Many candidates configure simple error-rate threshold alerts and get tripped up on SRE questions. The Google SRE Workbook multi-window pattern (14.4x fast burn + 2x slow burn) is the expected answer. If a question asks about "proactive" error budget management, burn rate alerting is always the answer.
- Forgetting to set minimum nodes to 0 for GPU node pool cost optimization. Simply enabling Cluster Autoscaler doesn't scale to zero. You must set the minimum node count to 0 on the GPU node pool and use taints/tolerations to ensure only training pods can schedule there. Without min=0, you pay for GPU nodes even when no training jobs run.
GCP PCA vs GCP ACE — Key differences
GCP Associate Cloud Engineer (ACE)
- Configures and deploys GCP resources
- Single-service questions (how to set up a GKE cluster)
- Basic IAM and networking
- $125 USD, 2 years validity
- Recommended before attempting PCA
GCP Professional Cloud Architect (PCA)
- Designs complete multi-service architectures
- Multi-service case studies (design + justify tradeoffs)
- Advanced: VPC SC, Anthos, SLO engineering
- $200 USD, 2 years validity
- Highest-impact GCP professional cert