Google Cloud · cloud

GCP Associate Cloud Engineer

Master Google Cloud Platform fundamentals: compute, storage, networking, IAM, GKE, serverless, and operations. Covers all ACE exam objectives.

6Modules
30 hoursDuration
intermediateLevel
🎧

Study on the go with our IT certification podcast

Tune in to GCP tips, cloud architecture breakdowns, and exam strategies while commuting or working out. New episodes weekly.

Listen on Spotify

Course Modules

01
Google Cloud Foundations & Resource Hierarchy
3 lessons · ~3 hours
Every GCP scenario hangs off one diagram: Organization → Folder → Project → Resource, with IAM policies inheriting top-down and resources living inside regions and zones. Master that hierarchy plus the four billing levers (SUDs, CUDs, Spot, free tier) and the day-1 admin tasks — gcloud CLI, enabling APIs, setting quotas — and the ACE exam's foundations domain becomes a series of "where does this policy attach?" questions with obvious answers.
Cloud Concepts & GCP Architecture

What is Google Cloud?

  • GCP is Google's public cloud — the same infrastructure that runs Search, YouTube, and Gmail
  • Available in 40+ regions, each with multiple zones (typically 3) for high availability
  • A region is a geographic area (e.g., us-central1); a zone is a single deployment area within a region (e.g., us-central1-a)
  • Google's private fiber network (Jupiter, Andromeda) connects all regions — low-latency global routing
  • GCP follows the shared responsibility model: Google manages physical infrastructure; you manage your workloads, data, and access controls

GCP Resource Hierarchy

  • Organization — top-level node, maps to a Google Workspace or Cloud Identity domain
  • Folder — optional grouping layer (e.g., by department or environment); enables IAM/Org Policy inheritance
  • Project — the primary unit: billing, API enablement, and IAM boundaries. Every resource belongs to a project
  • Resources — VMs, buckets, databases, etc. within a project
  • IAM policies applied at a higher level inherit downward; lower-level policies can be more permissive but not more restrictive
Think of it as: Organization > Folder(s) > Project > Resource. When you want to isolate dev/staging/prod, use separate projects. When you want to apply a policy to an entire department, use a folder.
The ACE exam frequently asks about the resource hierarchy and where IAM policies should be applied. Understand inheritance: a role granted at the Organization level propagates to all child resources.
Cloud SDK & gcloud CLI Essentials

Setting Up Your Environment

  • Install the Cloud SDK: provides gcloud, gsutil (Storage), bq (BigQuery), and kubectl
  • gcloud init — interactive setup: authenticate, set default project and region
  • gcloud config set project PROJECT_ID — set active project
  • gcloud config set compute/region us-central1 — set default region
  • gcloud config configurations create my-config — manage multiple environments

Essential gcloud Commands

  • gcloud compute instances list — list all VMs in current project
  • gcloud compute instances create NAME --zone=ZONE --machine-type=e2-medium
  • gcloud compute ssh INSTANCE --zone=ZONE — SSH with automatic key management
  • gcloud services enable compute.googleapis.com — enable APIs (required before use)
  • gcloud projects list — list all accessible projects
APIs are disabled by default in new projects. Always enable the required API (Compute Engine API, Kubernetes Engine API, etc.) before making API calls. The exam tests this.
Billing, Quotas & Cost Management

Billing Concepts

  • Each project is linked to a billing account; billing accounts can cover multiple projects
  • Labels (key-value pairs on resources) enable cost allocation and reporting per team/environment
  • Set budgets and alerts in Cloud Billing to receive email or Pub/Sub notifications at spending thresholds (e.g., 50%, 80%, 100%)
  • Use Cloud Cost Management and Recommender for rightsizing suggestions

Pricing Models

  • Sustained Use Discounts (SUDs) — automatic discounts up to 57% for VMs running most of the month; no commitment required
  • Committed Use Discounts (CUDs) — 1 or 3-year commitments for 57–70% off
  • Spot VMs — up to 91% off, can be preempted with 30-second notice; ideal for batch workloads
  • Free Tier — always-free products include: 1 f1-micro VM/month, 5 GB Cloud Storage, Cloud Functions invocations, BigQuery queries up to 1 TB/month
Know the difference between SUDs (automatic, no action), CUDs (commitment-based), and Spot VMs (interruptible). The exam tests when to recommend each pricing model.

☁ Scenario — structuring a GCP resource hierarchy for a startup

Situation: A startup has 3 teams (frontend, backend, data). Each needs isolated billing and separate IAM boundaries, but all engineers share a single Google Workspace account.

Design: One Organization node (tied to the Google Workspace domain). One Folder per team (frontend-folder, backend-folder, data-folder). One Project per environment per team (e.g., backend-dev, backend-prod). Resources (VMs, buckets, databases) live inside projects. IAM policies applied at the folder level propagate to all child projects automatically.

Why projects matter on the ACE exam: Projects are the billing unit and IAM boundary. All GCP resources belong to exactly one project. The gcloud config set project PROJECT_ID command sets the default project for CLI commands — forgetting this is a common mistake on the real exam's lab tasks.

Key takeaways
  • The hierarchy is Organization → Folder → Project → Resource; IAM policies inherit downward — bind at the lowest level that satisfies the requirement (least privilege).
  • Projects are the unit of billing, quota, and API enablement; APIs are off by default in new projects (gcloud services enable first or expect a 403).
  • Discount stack: SUDs apply automatically, CUDs need a 1- or 3-year commitment, Spot/Preemptible VMs are 60–91% off but can be evicted at any time — pick by workload tolerance to interruption.
⚡ Mini-quiz — Drill resource hierarchy, IAM inheritance, project APIs, and pricing models.
Quick quiz →
02
Compute Engine & Managed Instance Groups
3 lessons · ~6 hours
Compute Engine is the IaaS workhorse: VMs with machine types and disks, Managed Instance Groups (MIGs) for autoscaling + self-healing, and Load Balancers in front (Global HTTP(S), Internal, Network, TCP/SSL Proxy). The ACE exam loves "which LB and which disk type?" — answers depend on traffic shape (internal vs internet, HTTP vs TCP) and persistence needs (pd-balanced default, local SSD ephemeral, pd-ssd for high IOPS).
VM Instance Fundamentals

Machine Types

  • General purpose (E2, N2, N2D, T2D) — balanced price/performance for most workloads
  • Compute-optimized (C2, C3) — high CPU frequency for compute-intensive apps
  • Memory-optimized (M2, M3) — large in-memory databases, SAP HANA
  • Custom machine types — specify exact vCPU and memory for right-sizing
  • Accelerator-optimized (A2, G2) — NVIDIA GPUs for ML/AI workloads

Boot Disks & Persistent Storage

  • Standard Persistent Disk (pd-standard) — HDD, cost-efficient, sequential workloads
  • Balanced Persistent Disk (pd-balanced) — SSD, good general purpose (recommended default)
  • SSD Persistent Disk (pd-ssd) — high IOPS for databases
  • Local SSDs — ephemeral NVMe attached directly to the host; very fast but data lost on VM stop
  • Snapshots — incremental backups of persistent disks; stored in Cloud Storage; used for disaster recovery

VM Lifecycle

  • States: Provisioning → Staging → Running → Stopping → Terminated
  • Stopped VMs do not incur compute charges but retain disk storage costs
  • Metadata server at 169.254.169.254 — VMs access instance metadata and service account tokens without needing key files
Know when to use each disk type. For most ACE scenarios: pd-balanced is the default recommendation. Local SSDs are fast but ephemeral — don't use them for persistent data.
Instance Templates & Managed Instance Groups

Instance Templates

  • Define VM configuration once (machine type, disk, network, service account, startup script) — reuse for MIGs and Spot VMs
  • Templates are immutable — create a new version to update; MIGs rolling updates use the new template

Managed Instance Groups (MIGs)

  • MIGs deploy identical VM instances from a template, enabling autoscaling and autohealing
  • Autoscaling — adds/removes VMs based on CPU utilization, HTTP load balancing capacity, or custom metrics
  • Autohealing — uses health checks to detect and automatically replace unhealthy VMs
  • Rolling updates — gradually deploy new templates across the MIG with configurable maxSurge and maxUnavailable
  • Regional MIGs — spread instances across multiple zones for high availability
MIGs are the backbone of scalable, resilient Compute Engine architecture. Pair a regional MIG with a Global HTTP(S) Load Balancer for a highly available web application.
Load Balancing on GCP

Load Balancer Types

  • Global External HTTP(S) Load Balancer — Layer 7, URL routing, global Anycast IP, integrates with Cloud CDN and Cloud Armor
  • Regional External TCP/UDP Network LB — Layer 4, non-proxy, preserves client IP, for non-HTTP protocols
  • Internal TCP/UDP Load Balancer — Layer 4, private VPC traffic only
  • Internal HTTP(S) Load Balancer — Layer 7, for microservices within VPC
  • SSL Proxy and TCP Proxy LB — terminates SSL/TCP connections globally

Key Concepts

  • Health checks — LBs use health checks to route only to healthy backends
  • Backend services — define the backend (MIG, NEG) and health check for the LB
  • URL maps — HTTP(S) LB routing rules (host/path-based)
  • Cloud Armor — WAF and DDoS protection; attaches to the Global HTTP(S) LB
For internet-facing web apps needing global routing and DDoS protection: Global External HTTP(S) LB. For internal microservices: Internal HTTP(S) LB. For UDP/non-HTTP external: Regional Network LB.

☁ Scenario — deploying a preemptible VM for batch processing

Situation: A data pipeline needs to process 500 GB of logs nightly. The job takes ~2 hours and can restart from a checkpoint if interrupted. Cost matters — this job runs every night.

Walk: 1) gcloud compute instances create batch-worker-1 --zone=us-central1-a --machine-type=n1-standard-4 --preemptible --image-family=debian-11 --image-project=debian-cloud. Preemptible VMs cost ~80% less but can be reclaimed by GCP with 30 seconds notice. 2) Script handles SIGTERM: saves a checkpoint to Cloud Storage before shutdown. 3) A Cloud Scheduler job retriggers the pipeline each night; if the VM was preempted, the job resumes from the last checkpoint. 4) After migration: cost drops from ~$150/night (standard) to ~$30/night (preemptible).

ACE exam note: Preemptible VMs are ideal for fault-tolerant batch jobs. Spot VMs (the successor) offer the same discount with more flexible preemption. For long-running services, use standard or committed-use VMs instead.

Key takeaways
  • Pick the machine type by workload: e2 general-purpose / cost-optimized, n2/n2d balanced, c2 compute-intensive, m2 memory-optimized; default disk = pd-balanced, never local SSD for persistent data.
  • MIGs = autoscaling + self-healing + rolling updates; regional MIGs span zones for HA, instance templates declare the immutable spec, and stateful MIGs preserve per-VM disks/IPs across recreation.
  • LB choice cheat-sheet: Global External HTTP(S) for internet web apps (Cloud CDN + Armor friendly), Internal HTTP(S) for service-to-service inside the VPC, Network LB for non-HTTP/UDP external traffic.
⚡ Mini-quiz — Drill machine types, disk choices, MIG behaviours, and load-balancer selection.
Quick quiz →
03
Kubernetes Engine (GKE) & Containers
3 lessons · ~6 hours
GKE is the second-heaviest ACE domain. Three decisions drive every exam question: cluster mode (Autopilot fully managed vs Standard with node-level control), workload type (Deployment / StatefulSet / DaemonSet / Job), and identity (Workload Identity binding KSAs to GCP service accounts — never raw JSON keys). Master those and the rest is service exposure and node-pool sizing.
GKE Cluster Architecture

Cluster Modes

  • GKE Standard — you manage node configuration, machine types, node pools; full control
  • GKE Autopilot — Google manages node infrastructure; you only define pod specs; pay per pod not node
  • Regional clusters — control plane and nodes replicated across 3 zones; no single zone is a SPOF; recommended for production
  • Zonal clusters — single control plane in one zone; lower cost but less resilient

Node Pools

  • A cluster can have multiple node pools with different machine types (e.g., standard pool + GPU pool)
  • Node pools can be independently upgraded and scaled
  • Cluster Autoscaler — automatically adds nodes when pods are pending; removes nodes when underutilized
  • Node auto-provisioning — creates new node pools automatically for pending pods requiring specific resources
For the ACE exam: use Regional clusters for production HA. Use GKE Autopilot when the team wants minimal infrastructure management. Use Standard when you need GPU nodes or specific OS configurations.
Kubernetes Workload Objects

Core Objects

  • Pod — smallest deployable unit; one or more containers sharing network/storage
  • Deployment — manages stateless Pods with rolling updates and rollbacks; use for web apps and APIs
  • StatefulSet — stateful workloads with stable network identity and persistent per-pod volumes; use for databases
  • DaemonSet — ensures one Pod per node; use for log collectors, monitoring agents (Fluentd, Prometheus node exporter)
  • CronJob — scheduled batch jobs on a cron schedule

Scaling

  • Horizontal Pod Autoscaler (HPA) — scales Pod replicas based on CPU/memory or custom metrics
  • Vertical Pod Autoscaler (VPA) — adjusts Pod resource requests/limits automatically
  • kubectl scale deployment nginx --replicas=5 — manual scaling

Networking

  • ClusterIP — internal service, reachable only within the cluster
  • NodePort — exposes service on each node's IP at a static port
  • LoadBalancer — provisions a GCP External Load Balancer for the service
  • Ingress — HTTP(S) routing rules; on GKE creates a Global HTTP(S) Load Balancer
Know which Kubernetes object to use for each scenario. Deployment = stateless. StatefulSet = stateful with stable identity. DaemonSet = one pod per node. This is heavily tested.
Workload Identity & GKE Security

Workload Identity

  • Workload Identity is the recommended way to grant GKE workloads access to GCP APIs
  • Maps a Kubernetes Service Account (KSA) to a GCP Service Account (GSA)
  • Pods use the KSA to impersonate the GSA — no key files stored in Secrets
  • Enable at cluster creation: --workload-pool=PROJECT_ID.svc.id.goog

Other GKE Security Best Practices

  • Use Private clusters — nodes have no external IPs; API server accessible only via authorized networks
  • Enable Binary Authorization — only signed, approved container images can run
  • Use Network Policies — restrict pod-to-pod traffic
  • Cloud SQL Auth Proxy as a sidecar for database connections — handles IAM auth and TLS
  • Enable Shielded GKE Nodes for protection against rootkits and bootkits
Workload Identity replaces the pattern of downloading a service account JSON key and mounting it as a Kubernetes Secret — which is a security risk if the Secret is misconfigured or exposed.

☁ Scenario — deploying a containerised API to GKE

Situation: A REST API packaged as a Docker image needs to run on GKE, scale from 2 to 10 replicas based on CPU, and be reachable via a public load balancer.

Walk: 1) Push image: docker tag api gcr.io/my-project/api:v1 && docker push gcr.io/my-project/api:v1. 2) Create cluster: gcloud container clusters create api-cluster --zone=us-central1-a --num-nodes=3. 3) Deploy: kubectl create deployment api --image=gcr.io/my-project/api:v1 --replicas=2. 4) Expose: kubectl expose deployment api --type=LoadBalancer --port=80 --target-port=8080. GKE provisions a GCP HTTP load balancer and assigns a public IP. 5) Autoscale: kubectl autoscale deployment api --cpu-percent=70 --min=2 --max=10. When CPU >70%, new pods spin up automatically.

ACE exam note: LoadBalancer Service type creates an external L4 load balancer. For L7 (HTTP routing, path-based, TLS termination) use an Ingress with a GKE Ingress controller.

Key takeaways
  • Autopilot for hands-off (pay per pod, Google manages nodes), Standard when you need GPU nodes, custom OS images, or per-node config; Regional clusters for production HA (multi-zone control plane).
  • Workload types: Deployment stateless / rolling updates, StatefulSet stable identity + ordered rollout, DaemonSet one pod per node (log shipper / agent), Job + CronJob for batch.
  • Identity = Workload Identity (KSA ↔ GSA mapping, no JSON keys); expose with ClusterIP (internal), NodePort (per-node), LoadBalancer (cloud LB), or Ingress (HTTP routing + TLS termination).
⚡ Mini-quiz — Drill Autopilot vs Standard, workload types, services, and Workload Identity.
Quick quiz →
04
Serverless, Storage & Databases
4 lessons · ~6 hours
The serverless + data domain is a giant decision tree: App Engine Std / Cloud Run / Cloud Functions for compute; Cloud Storage with four classes (Standard / Nearline / Coldline / Archive) + Lifecycle + Versioning for objects; Cloud SQL / Spanner / Firestore / Bigtable / Memorystore for databases. The exam asks the same question repeatedly: given these constraints (scale, consistency, latency, schema), which managed service? Module 04 walks each leaf of that tree.
Serverless Compute: Cloud Run, Cloud Functions & App Engine

Cloud Run

  • Runs stateless containers on a fully managed platform; scales to zero; pay per CPU/memory during request processing
  • Supports any language/runtime packaged as a Docker container
  • Traffic splitting — split traffic between revisions for canary deployments
  • Invoke via HTTP or Pub/Sub push subscriptions

Cloud Functions (2nd gen)

  • Event-driven serverless functions; trigger via HTTP, Pub/Sub, Cloud Storage, Firestore, etc.
  • 2nd gen is built on Cloud Run — longer timeouts (up to 60 min), larger instances
  • Pair with Cloud Scheduler for cron-like scheduled execution

App Engine

  • Standard environment — language-specific runtimes (Python, Node.js, Go, Java, PHP, Ruby); scales to zero; fast startup
  • Flexible environment — custom Docker containers; minimum 1 instance (cannot scale to zero); use when Standard constraints are too limiting
  • Versions and traffic splitting enable canary and blue/green deployments
Scale-to-zero requires App Engine Standard or Cloud Run — not Flexible. This is a common exam trap. If cost optimization for idle apps is the goal, avoid Flexible.
Cloud Storage Deep Dive

Storage Classes

  • Standard — frequently accessed data; no minimum storage duration
  • Nearline — accessed at most once per month; 30-day minimum; ~50% cheaper than Standard
  • Coldline — accessed at most once per 90 days; 90-day minimum
  • Archive — long-term archive; <1 access/year; 365-day minimum; cheapest per GB

Key Features

  • Object Lifecycle Management — rules to auto-transition or delete objects based on age, version count, etc.
  • Versioning — retains every version with a generation number; enables accidental deletion recovery
  • Uniform bucket-level access — disables ACLs; IAM-only access control (recommended)
  • Signed URLs — time-limited, pre-signed URLs for unauthenticated access to specific objects
  • Retention policies — prevent objects from being deleted or modified before a minimum age
Lifecycle rules + Versioning are frequently tested together. A common question: "automatically delete objects older than 30 days" → Lifecycle rule with Age=30 + Delete action. "Prevent accidental deletion" → enable Versioning.
Relational Databases: Cloud SQL & Cloud Spanner

Cloud SQL

  • Fully managed MySQL, PostgreSQL, or SQL Server; regional (not global)
  • High Availability (HA) — synchronous standby in a different zone; automatic failover
  • Read replicas — asynchronous copies for read-heavy workloads; reduce primary load
  • Connect securely via Cloud SQL Auth Proxy (recommended) or authorized networks
  • Automated backups and point-in-time recovery (PITR) up to 7 days

Cloud Spanner

  • Globally distributed, horizontally scalable relational database with ACID transactions
  • 99.999% SLA for multi-region instances — use when Cloud SQL's regional scope is insufficient
  • Ideal for: global financial apps, inventory systems, gaming leaderboards requiring strong consistency at scale
  • Significantly more expensive than Cloud SQL — use it only when global distribution is truly required
Cloud SQL = regional relational DB. Cloud Spanner = global relational DB. If the scenario mentions "global", "multiple regions", and "strong consistency", the answer is Spanner.
NoSQL Databases: Firestore, Bigtable & Memorystore

Firestore

  • Serverless NoSQL document database; real-time sync; offline support
  • Best for: mobile apps, web apps, user profiles, content management
  • Two modes: Native mode (new apps, real-time) and Datastore mode (server-side, legacy)

Cloud Bigtable

  • Fully managed, wide-column NoSQL database; petabyte scale; millisecond latency
  • Best for: time-series data, IoT sensor data, financial data, ML training datasets
  • HBase-compatible API; integrates with Hadoop, Dataflow, Dataproc
  • NOT suitable for: transactions, complex queries, small datasets (<1 TB)

Memorystore

  • Fully managed Redis and Memcached — no infrastructure management
  • Use for: session caching, real-time leaderboards, message queuing, rate limiting
  • In-VPC only — not publicly accessible
Database choice questions: Firestore = mobile/web app document data. Bigtable = time-series/IoT at massive scale. Memorystore = in-memory caching/session. Cloud SQL/Spanner = relational/transactional.

☁ Scenario — event-driven thumbnail generation with Cloud Functions + Cloud Storage

Situation: Users upload images to a Cloud Storage bucket. Every upload should trigger automatic thumbnail creation and save the thumbnail to a second bucket — no server should be provisioned or managed.

Walk: 1) Create two buckets: gs://uploads-raw and gs://uploads-thumbs. 2) Write a Cloud Function (Python or Node.js): triggered by google.storage.object.finalize on uploads-raw. Function downloads the uploaded file, generates a 200×200 thumbnail using Pillow/Sharp, and writes it to uploads-thumbs. 3) Deploy: gcloud functions deploy generate-thumbnail --runtime=python311 --trigger-bucket=uploads-raw --entry-point=handler. 4) Test: upload a JPG to uploads-raw → Cloud Function triggers → thumbnail appears in uploads-thumbs within 2 seconds.

ACE exam note: Cloud Functions = event-driven, serverless, per-invocation billing. Cloud Run = containerised, HTTP-triggered, also serverless. App Engine Standard = managed runtime, scales to zero. App Engine Flex = custom runtime (Docker), always warm instance.

Key takeaways
  • Serverless compute: Cloud Run for containerised stateless services (scale to zero), Cloud Functions for event-driven snippets, App Engine Standard for sandbox-friendly runtimes — App Engine Flexible cannot scale to zero.
  • Cloud Storage classes split on access frequency: Standard (hot), Nearline (~monthly), Coldline (~quarterly), Archive (yearly); add Lifecycle rules for auto-tiering/delete and Versioning for accidental-delete recovery.
  • DB picker: Cloud SQL regional relational (MySQL/Postgres/SQL Server), Spanner global relational with strong consistency, Firestore document for mobile/web, Bigtable wide-column for time-series/IoT, Memorystore for Redis/Memcached caching.
⚡ Mini-quiz — Drill serverless choice, storage classes & lifecycle, and DB selection.
Quick quiz →
05
Networking & IAM Security
3 lessons · ~5 hours
Two pieces of trivia define this domain: a VPC is global in GCP (subnets are regional, the VPC spans them — different from AWS), and IAM grants permissions on resources, never on users. From there, master firewall rules (stateful, evaluated by priority), VPC peering vs Shared VPC vs VPN/Interconnect, and the four IAM role types: Basic / Predefined / Custom / Service-account-scoped. Least privilege via predefined roles is the recurring exam answer.
VPC Networking Fundamentals

VPC Concepts

  • GCP VPCs are global — a single VPC spans all regions (unlike AWS where VPCs are regional)
  • Subnets are regional — each subnet has an IP range in a specific region
  • Auto mode VPC — one /20 subnet per region created automatically; easy to start but can complicate peering
  • Custom mode VPC — you define all subnets; recommended for production (avoid IP overlap)
  • VMs in the same VPC communicate using internal IPs regardless of region — no VPC peering needed

Firewall Rules

  • VPCs have an implicit deny-all ingress and allow-all egress by default
  • Rules are stateful — established connections are tracked; return traffic is automatically allowed
  • Target with tags or service accounts to apply rules to specific VMs
  • Priority 0–65535 (lower = higher priority); 0.0.0.0/0 = all sources

Hybrid Connectivity

  • Cloud VPN — IPsec tunnels over the public internet; up to 3 Gbps per tunnel; simple setup
  • Dedicated Interconnect — direct physical connection to Google's network; 10 or 100 Gbps; 99.99% SLA with redundancy
  • Partner Interconnect — connect via a service provider; for locations without Dedicated Interconnect PoPs
  • Cloud NAT — allows VMs without external IPs to make outbound internet connections
A VPC is global in GCP (unlike AWS). This means a VM in us-central1 and a VM in europe-west1 in the same VPC can communicate via internal IPs without VPC peering.
IAM: Identity & Access Management

IAM Principals

  • Google Account — individual user account (user@gmail.com)
  • Service Account — machine identity for workloads (apps, VMs, functions)
  • Google Group — set of users/service accounts; apply one IAM binding to many principals
  • Workspace/Cloud Identity Domain — all users in your organization's domain
  • allUsers — anyone on the internet (unauthenticated); use cautiously
  • allAuthenticatedUsers — any signed-in Google account

Roles

  • Basic roles — Owner, Editor, Viewer; coarse-grained; avoid in production
  • Predefined roles — curated by Google for specific services (e.g., roles/storage.objectViewer)
  • Custom roles — define exact permissions needed; enforce least privilege

Best Practices

  • Principle of least privilege — grant only the minimum permissions required
  • Prefer predefined roles over basic roles
  • Use service accounts for workloads — never use personal accounts
  • Avoid creating service account keys when possible — use Workload Identity or metadata server instead
  • Organization Policy Service — enforce constraints organization-wide (e.g., prevent public IPs, restrict allowed regions)
IAM questions often test least privilege. When asked which role to grant, pick the most specific predefined role that covers only what's needed. Don't grant Editor or Owner unless explicitly required.
Security Services: Cloud KMS, VPC Service Controls & Cloud Armor

Cloud KMS (Key Management Service)

  • Manages encryption keys for GCP services
  • Google-managed keys — default; Google handles rotation; no visibility to customer
  • Customer-managed keys (CMEK) — you create/manage keys in Cloud KMS; GCP services use them to encrypt your data
  • Customer-supplied keys (CSEK) — you provide raw key material; used for Compute Engine persistent disks
  • Key rotation, audit logs, and IAM-controlled access to keys

VPC Service Controls

  • Creates a security perimeter around GCP services (Storage, BigQuery, etc.)
  • Restricts access to resources to only requests from authorized VPCs or IP ranges
  • Prevents data exfiltration by blocking data from leaving the perimeter

Cloud Armor

  • WAF (Web Application Firewall) and DDoS mitigation attached to the Global HTTP(S) LB
  • Rules for: IP allowlisting/blocklisting, SQL injection protection, XSS protection, rate limiting, geo-based access
  • Adaptive Protection — ML-based detection for volumetric DDoS attacks

☁ Scenario — locking down a VM with IAM + firewall rules

Situation: A backend VM should only be reachable on port 8080 from the frontend subnet (10.0.1.0/24) and only via SSH from a bastion host. No public IP. Only the deployment service account can write to the Cloud Storage bucket it reads from.

Walk: 1) No public IP: create VM with --no-address flag. 2) Firewall rules: gcloud compute firewall-rules create allow-frontend --allow=tcp:8080 --source-ranges=10.0.1.0/24 --target-tags=backend. gcloud compute firewall-rules create allow-bastion-ssh --allow=tcp:22 --source-tags=bastion --target-tags=backend. 3) Assign tag backend to the VM. 4) IAM: create service account deploy-sa@project.iam.gserviceaccount.com. Grant it roles/storage.objectViewer on the specific bucket (not the project). Attach the SA to the VM. 5) Verify: SSH from bastion works; direct SSH from internet fails; frontend can reach :8080; VM can read the bucket but not write to it.

ACE exam note: Firewall rules are stateful. GCP uses deny-all implicit rules — you must explicitly allow traffic. Prefer service account-based IAM over user-based IAM for VM workloads.

Key takeaways
  • VPC is global, subnets are regional; firewall rules are stateful, evaluated by priority (lower number = higher), default deny on ingress, default allow on egress.
  • Identity: prefer Predefined roles over Basic (Owner/Editor/Viewer are too broad); bind to groups instead of individuals; for workloads use service accounts + Workload Identity, never personal credentials or downloaded JSON keys.
  • Defence-in-depth: Cloud KMS for CMEK/CSEK, VPC Service Controls for service perimeters (data exfiltration protection), Cloud Armor for L7 WAF + DDoS on the Global HTTP(S) LB.
⚡ Mini-quiz — Drill VPC scope, firewall priority, IAM least privilege, KMS / VPC-SC / Armor selection.
Quick quiz →
06
Operations: Monitoring, Logging & Deployment
3 lessons · ~4 hours
Day-2 operations on GCP rest on three pillars: Cloud Monitoring (metrics, dashboards, alerting policies), Cloud Logging (log buckets, sinks, exports to BigQuery/GCS for retention), and a CI/CD pipeline built on Cloud Build + Artifact Registry + Cloud Deploy. The exam frequently asks "what do I need to install on a Compute Engine VM to get metrics?" — answer is always the Ops Agent (managed services auto-emit).
Cloud Monitoring & Alerting

Cloud Monitoring (formerly Stackdriver)

  • Collects metrics from GCP resources, AWS, and on-premises with the Ops Agent
  • Metrics Explorer — query and visualize any metric
  • Dashboards — custom or pre-built resource dashboards
  • Alerting policies — trigger notifications via email, Pub/Sub, PagerDuty, Slack when metrics breach thresholds
  • Uptime checks — periodic HTTP/HTTPS/TCP checks to verify service availability globally
  • Ops Agent — required on Compute Engine VMs to collect system metrics and logs; install with one command

Cloud Trace & Profiler

  • Cloud Trace — distributed tracing; analyzes latency across microservices; identifies slow operations
  • Cloud Profiler — continuous CPU and memory profiling for production workloads
  • Error Reporting — aggregates application exceptions and errors; groups similar errors; notifies on new error types
Compute Engine does NOT automatically send logs or metrics to Cloud Monitoring. You must install the Ops Agent. Managed services (GKE, App Engine, Cloud Run) auto-send logs.
Cloud Logging & Audit Logs

Cloud Logging

  • Centralized log management for GCP services, VMs (with Ops Agent), and custom applications
  • Log sinks — route log entries to Cloud Storage, BigQuery, Pub/Sub, or Splunk for archiving and analytics
  • Log-based metrics — create custom metrics from log patterns to trigger alerts
  • Retention: Admin Activity logs = 400 days; Data Access logs = 30 days (default)

Cloud Audit Logs

  • Admin Activity — records API calls that modify resources (always on, no charge)
  • Data Access — records API calls that read resource configurations or data (disabled by default; can generate very high volume)
  • System Event — Google-generated system events (always on)
  • Policy Denied — records when access is denied by VPC Service Controls
For compliance, export Admin Activity and Data Access logs to a Cloud Storage bucket with Object Lock (WORM) or to BigQuery for long-term retention. This is a common ACE exam scenario.
Infrastructure as Code & CI/CD on GCP

Deployment Options

  • Cloud Deployment Manager — GCP-native IaC using YAML/Python/Jinja templates; older tooling
  • Terraform — industry standard IaC; GCP provider; state stored in Cloud Storage GCS backend; multi-cloud capable
  • Use Terraform for new projects — better community support, state management, and multi-cloud portability

CI/CD on GCP

  • Cloud Build — fully managed CI/CD service; runs build steps in containers; triggered by GitHub, Cloud Source Repositories, or manually
  • Artifact Registry — stores Docker images, Maven, npm, Python packages; replaces Container Registry
  • Cloud Deploy — managed continuous delivery to GKE and Cloud Run; enforces deployment pipelines with approvals
  • Typical pipeline: git push → Cloud Build (build + test + push image) → Artifact Registry → Cloud Deploy/gcloud deploy

Exam-Relevant Scenarios

  • "Store Terraform state for team collaboration" → GCS backend bucket with versioning
  • "Build and deploy containers automatically on merge" → Cloud Build + Artifact Registry + Cloud Run
  • "Enforce only approved images run on GKE" → Binary Authorization
Cloud Build = CI (build/test). Cloud Deploy = CD (release management). Artifact Registry = image storage. Know how these three connect for a complete GCP-native pipeline.

☁ Scenario — setting up an uptime check alert with Cloud Monitoring

Situation: A public-facing web app at https://app.example.com must page the on-call engineer within 2 minutes if the site goes down. Currently there is no alerting.

Walk: 1) Cloud Console → Monitoring → Uptime Checks → Create. Target: HTTPS, hostname app.example.com, path /health, check interval 1 min, timeout 10s. Checker regions: USA, Europe, Asia (multi-region validates it's not a regional blip). 2) Create an Alerting Policy: condition = uptime check failure from ≥2 out of 3 regions for ≥1 min. Notification channel: PagerDuty webhook + email. 3) Test: temporarily block the health endpoint → Monitoring flags failure after 1 min → alert fires → PagerDuty page sent. 4) Review logs: Cloud Logging → Logs Explorer → resource.type="https_lb_rule" to see which requests errored.

ACE exam note: Cloud Monitoring = metrics + uptime + alerting. Cloud Logging = structured logs, queryable with Logging Query Language. Error Reporting = aggregates application exceptions automatically. Cloud Trace = distributed request tracing for latency analysis.

Key takeaways
  • Cloud Monitoring needs the Ops Agent on Compute Engine VMs to ship metrics + logs; managed services (GKE, App Engine, Cloud Run) auto-emit; alerting policies fire on thresholds and route to channels (email, Slack, PagerDuty).
  • Cloud Logging stores logs in _Default + _Required buckets; for long-term compliance, create a sink exporting Admin Activity / Data Access logs to GCS (with Object Lock) or BigQuery — a recurring ACE scenario.
  • GCP-native CI/CD chain: Cloud Build (CI) → Artifact Registry (image / package storage) → Cloud Deploy (CD with progression policies); pair with Binary Authorization on GKE to only run signed/approved images.
⚡ Mini-quiz — Drill Ops Agent vs auto-emit, log sinks, and the Cloud Build/Deploy/Registry pipeline.
Quick quiz →
Start practicing →