CNCF · devops

Certified Kubernetes Administrator (CKA)

Build, secure, and operate production Kubernetes clusters. Hands-on kubectl + kubeadm workflows for the CKA performance-based exam. Cluster architecture, RBAC, workloads, networking, storage, and the troubleshooting domain that decides the exam.

7Modules
35 hoursDuration
advancedLevel
CKAExam code
2 hoursDuration
66%Passing score
$445Exam fee (USD)
2 yearsValidity
Performance-basedFormat
Study on the go — CertQuests Podcast

Reinforce kubectl drills, etcd backup walkthroughs, and RBAC patterns while commuting. New CKA-focused episodes drop weekly.

▶ Listen on Spotify

Why earn the CKA?

The CKA is the de-facto Kubernetes credential — a vendor-neutral, hands-on performance-based exam where you type real commands in a live kubectl terminal rather than picking multiple-choice answers.

  • Hands-on performance-based exam — you type real commands in a live kubectl terminal, not multiple choice
  • CNCF-recognized and vendor-neutral — same value at AWS, GCP, on-prem, or any managed-K8s shop
  • The de-facto Kubernetes credential — every K8s job posting either requires or asks for it
  • Prerequisite mindset for CKAD (developers) and CKS (security) — clear cluster mechanics first, specialise after
  • Gateway to platform / SRE / DevOps roles (~$120-160k US, ~€80-110k EU for K8s-fluent engineers)
  • Validates the skills employers actually test in interviews — debugging a broken cluster live under time pressure
Exam strategy: the 2024+ blueprint weighs Cluster Architecture 25%, Workloads 15%, Services & Networking 20%, Storage 10%, Troubleshooting 30% — Troubleshooting is the single biggest domain and the one that decides pass/fail. The killer.sh practice environment is bundled with your exam fee — use both attempts. Universal manifest scaffolder: kubectl <create|run> ... --dry-run=client -o yaml | tee file.yaml. Never write YAML from scratch on the clock.

CKA exam domains

Five domains. Troubleshooting alone is 30% of the score — almost a third of the exam is "the cluster is broken, fix it". Cluster Architecture (kubeadm + etcd + RBAC) is the next-heaviest. Storage is the lightest but always tests StatefulSets + dynamic provisioning.

Domain 1 — Troubleshooting 30%
Domain 2 — Cluster Architecture, Installation & Configuration 25%
Domain 3 — Services & Networking 20%
Domain 4 — Workloads & Scheduling 15%
Domain 5 — Storage 10%

7 modules · ~35 hours

Each module maps to one or more exam domains. Work through them in order — Architecture and Installation lay the foundation every later module depends on. Module 7 (Troubleshooting) is where most exam points are won or lost.

01

Kubernetes Architecture & Core Concepts3 lessons

The mental model every later module depends on. Master the control-plane components (apiserver, etcd, scheduler, controller-manager), worker-node components (kubelet, kube-proxy, container runtime), and the kubectl muscle memory + imperative-then-edit workflow that wins exam time. Knowing which component is responsible for what is how you diagnose the troubleshooting domain.

control-plane kube-apiserver etcd kubelet kube-proxy kubectl cri static-pods
~5h
📖 Read in-depth chapter
Lesson 1.1 Control plane components

The Kubernetes control plane is the brain of the cluster — every API call, scheduling decision, and reconciliation loop runs through these five components. CKA constantly asks "which component is responsible for X" or surfaces a broken control plane and expects you to identify which static pod is failing.

Key concepts
  • kube-apiserver: the front-end for the Kubernetes control plane; all internal and external communication passes through it; exposes the Kubernetes API over HTTPS on port 6443.
  • etcd: a consistent, highly-available key-value store used as Kubernetes' backing store for all cluster data; treat it as the source of truth. Lose etcd and you lose the cluster state.
  • kube-scheduler: watches for newly created Pods with no assigned node; selects the best node based on resource requirements, policies, taints/tolerations, and affinity rules.
  • kube-controller-manager: runs all controller processes as a single binary (Node Controller, Replication Controller, Endpoints Controller, ServiceAccount Controller, etc.) — the reconciliation engine.
  • cloud-controller-manager: interfaces with the underlying cloud provider API to manage nodes, routes, and load balancers; separates cloud-specific logic from the core controllers.
  • Static pods on kubeadm: control plane components run as static Pods in kube-system. Their manifests live in /etc/kubernetes/manifests/ and kubelet auto-restarts them on file change.
Concrete example

Task: the cluster's kubectl get pods hangs and times out. Diagnose: SSH to the control-plane node and run crictl ps | grep apiserver — the api-server container is missing. Inspect the static-pod manifest cat /etc/kubernetes/manifests/kube-apiserver.yaml and find a typo in --etcd-servers. Fix the manifest and watch kubelet auto-recreate the pod; verify with kubectl get pods -n kube-system. The five-component mental model — apiserver gateway, etcd store, scheduler placement, controller reconciliation — is what lets you triage in under a minute.

Key takeaway: apiserver = gateway, etcd = data store, scheduler = Pod placement, controller-manager = reconciliation loops. On kubeadm clusters they all live as static pods under /etc/kubernetes/manifests/.
⚡ Mini-quiz
Drill control-plane component scenarios → study mode (10 questions).
Lesson 1.2 Worker node components

Every worker node runs three pieces of infrastructure under systemd, not as static pods. When a node goes NotReady on the exam, it is almost always one of these three — kubelet, kube-proxy, or the container runtime. Knowing the journalctl + systemctl reflexes turns a 15-minute hunt into a 90-second fix.

Key concepts
  • kubelet: the primary node agent; registers the node with the API server; ensures containers described in PodSpecs are running and healthy; communicates with the container runtime via CRI.
  • kube-proxy: maintains network rules (iptables or IPVS) on each node that implement Services; handles traffic routing to Pod endpoints. Runs as a DaemonSet in kube-system.
  • Container runtime: the software responsible for running containers (containerd, CRI-O); communicates with kubelet via the Container Runtime Interface (CRI). Docker is no longer supported in K8s 1.24+.
  • Node status conditions: Ready, MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable — surface via kubectl describe node.
  • Inspection commands: kubectl describe node <node-name> for K8s-side view; systemctl status kubelet + journalctl -u kubelet -n 50 for OS-side. Kubelet config: /var/lib/kubelet/config.yaml.
  • crictl as a backup: when kubectl is down, crictl ps / crictl logs <id> talks directly to the runtime — your last-resort debugging tool on the node.
Concrete example

Task: kubectl get nodes shows worker-2 as NotReady. SSH to worker-2 and run systemctl status kubelet — the service is inactive (dead). Check logs with journalctl -u kubelet -n 100 --no-pager and find "failed to start container runtime: connection refused". Verify containerd: systemctl status containerd shows it stopped. Fix: systemctl start containerd && systemctl start kubelet && systemctl enable kubelet containerd. Within seconds the node flips back to Ready.

Key takeaway: kubelet and kube-proxy live under systemd, not as static pods. When a node is NotReady, check systemctl status kubelet + journalctl -u kubelet first.
⚡ Mini-quiz
Practise worker-node debugging scenarios → quick quiz (5 questions).
Lesson 1.3 kubectl & core API resources

CKA is a typing exam. Every minute spent hand-writing YAML is a minute lost — and the candidates who pass have kubectl run / create / expose --dry-run=client -o yaml in their muscle memory. This lesson cements the imperative-then-edit workflow plus the core API resources every later lesson builds on.

Key concepts
  • kubectl command syntax: basic pattern kubectl [command] [TYPE] [NAME] [flags]. kubectl get pods, kubectl get pods -A for all namespaces, kubectl describe pod <name> for events.
  • Output formats: kubectl get pod <name> -o yaml dumps the full spec, -o wide shows node + IP, -o jsonpath='{.status.containerStatuses[0].state}' extracts precise fields.
  • Imperative resource creation: kubectl run nginx --image=nginx --restart=Never creates a Pod; kubectl create deployment app --image=nginx --replicas=3 a Deployment; kubectl expose deployment app --port=80 --type=ClusterIP a Service.
  • Dry-run scaffolder: kubectl run nginx --image=nginx --dry-run=client -o yaml > pod.yaml generates a valid manifest you edit. Universal shortcut for anything kubectl can imperatively create.
  • Inline documentation: kubectl explain pod.spec.containers shows valid fields without leaving the terminal — fastest reference during the exam.
  • Core API resources: Pod (smallest deployable unit), ReplicaSet (replica count), Deployment (declarative + rolling), Namespace (defaults: default, kube-system, kube-public, kube-node-lease).
Concrete example

Task: deploy nginx with 3 replicas exposed as a ClusterIP, all under 30 seconds. Scaffold + apply: kubectl create deployment web --image=nginx --replicas=3, then kubectl expose deployment web --port=80 --target-port=80. Verify: kubectl get deploy,svc,pods -l app=web. Need to edit something not in flags? Scaffold then patch: kubectl create deployment web --image=nginx --dry-run=client -o yaml > web.yaml, edit the file, kubectl apply -f web.yaml. This is the exam-winning rhythm.

Key takeaway: the CKA is a hands-on performance-based exam. Master --dry-run=client -o yaml to generate resource templates instead of writing YAML from scratch. Use kubectl explain when you forget a field.
⚡ Mini-quiz
Drill kubectl + API resource scenarios → study mode (10 questions).
02

Cluster Installation & Configuration3 lessons

Building, upgrading, and securing the cluster from scratch with kubeadm. The 25%-weight Cluster Architecture domain leans heavily on this module: kubeadm init / join, one-minor-version-at-a-time upgrades with drain + uncordon, etcd snapshot save + restore, kubeconfig contexts, and RBAC + TLS CSR workflows. These tasks appear on virtually every CKA attempt.

kubeadm cluster-upgrade etcdctl kubeconfig rbac role-binding tls-csr cert-renewal
~6h
📖 Read in-depth chapter
Lesson 2.1 kubeadm — bootstrapping & etcd backup/restore

kubeadm is the official cluster-bootstrap tool and the only one CKA tests. The etcd snapshot save / restore task is on virtually every exam. Memorise the full etcdctl command with all three cert flags or you will burn the entire question fishing through /etc/kubernetes/pki/.

Key concepts
  • Pre-flight requirements: swap disabled (swapoff -a), required ports open (6443, 10250, 2379-2380), container runtime installed and running, br_netfilter kernel module loaded.
  • kubeadm init: kubeadm init --pod-network-cidr=192.168.0.0/16 --apiserver-advertise-address=<IP> initializes the control plane and prints a join command + admin kubeconfig path.
  • Post-init setup: mkdir -p $HOME/.kube && cp -i /etc/kubernetes/admin.conf $HOME/.kube/config, then install a CNI (e.g. kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml).
  • kubeadm join: run the printed command on each worker as root. Bootstrap tokens expire after 24 hours — regenerate with kubeadm token create --print-join-command.
  • etcd snapshot save: ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key. Verify with etcdctl snapshot status.
  • etcd restore: ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db --data-dir=/var/lib/etcd-restore, then update /etc/kubernetes/manifests/etcd.yaml to point --data-dir at the new path so kubelet recreates the etcd pod against the restored state.
Concrete example

Task: take a snapshot of the running cluster, then prove you can restore it. Snapshot: run the full etcdctl snapshot save command above and verify etcdctl snapshot status /backup/etcd-snapshot.db returns hash + revision. Restore to a fresh data dir /var/lib/etcd-restore, then edit /etc/kubernetes/manifests/etcd.yaml and change the --data-dir arg + the volume hostPath. Kubelet detects the manifest change and respawns etcd against the restored data. Verify with kubectl get pods -n kube-system and confirm any test workload from before the snapshot is back.

Key takeaway: always set ETCDCTL_API=3. The certs live in /etc/kubernetes/pki/etcd/. Restore to a fresh data dir and update the static pod manifest — never restore on top of the running data dir.
⚡ Mini-quiz
Drill kubeadm + etcd backup scenarios → study mode (10 questions).
Lesson 2.2 Cluster upgrade & kubeconfig management

kubeadm upgrade follows a strict order: one minor version at a time, control plane first, then workers, each preceded by a drain and followed by an uncordon. Skip drain and the upgrade still works — but you fail the exam criterion. Kubeconfig context-switching is the silent killer: running the wrong command on the wrong cluster costs you the question.

Key concepts
  • One minor version at a time: 1.28 → 1.29 is supported; 1.28 → 1.30 is not. Upgrade kubeadm package first, then run the planner/applier, then upgrade kubelet + kubectl on the same node.
  • Control-plane upgrade: apt-mark unhold kubeadm && apt-get install -y kubeadm=1.29.0-00 && apt-mark hold kubeadm, then kubeadm upgrade plan and kubeadm upgrade apply v1.29.0.
  • Drain / uncordon flow: kubectl drain <node> --ignore-daemonsets before upgrading kubelet, systemctl daemon-reload && systemctl restart kubelet, then kubectl uncordon <node>. Repeat per node.
  • Worker upgrade: on each worker run kubeadm upgrade node (not apply), upgrade kubelet + kubectl, restart kubelet. Drain from the control plane between workers.
  • kubeconfig basics: default at ~/.kube/config, override via the KUBECONFIG env var. kubectl config get-contexts / use-context / current-context. Set namespace: kubectl config set-context --current --namespace=dev.
  • Merge multiple kubeconfigs: KUBECONFIG=~/.kube/config:~/.kube/prod-config kubectl config view --merge --flatten > ~/.kube/merged-config.
Concrete example

Task: upgrade a 1-control-plane + 2-worker cluster from 1.28 → 1.29. Control plane: apt-mark unhold kubeadm, install kubeadm=1.29.0-00, re-hold, then kubectl drain cp1 --ignore-daemonsets + kubeadm upgrade apply v1.29.0 + upgrade kubelet/kubectl + restart kubelet + kubectl uncordon cp1. Each worker: from the control plane kubectl drain worker-N --ignore-daemonsets --delete-emptydir-data; on the worker upgrade kubeadm, run kubeadm upgrade node, upgrade kubelet + kubectl, restart kubelet; from the control plane kubectl uncordon worker-N. Verify with kubectl get nodes — all on v1.29.0.

Key takeaway: upgrade one minor version at a time, control plane before workers. Always kubectl config current-context before destructive commands — wrong context is the most common exam loss.
⚡ Mini-quiz
Practise cluster upgrade + kubeconfig scenarios → quick quiz (5 questions).
Lesson 2.3 RBAC & TLS certificate management

RBAC is the most-tested part of cluster admin. The exam expects you to fluently scope Roles vs ClusterRoles and verify with kubectl auth can-i in seconds. Certificate management (renewal, custom CSRs) sits alongside it — both are namespace-versus-cluster decisions that candidates routinely get backwards.

Key concepts
  • Role: namespaced; grants permissions to resources within a specific namespace. ClusterRole: cluster-wide; can grant access to cluster-scoped resources (nodes, PVs) or any namespace.
  • RoleBinding: binds a Role or ClusterRole to subjects (users, groups, ServiceAccounts) within a namespace. ClusterRoleBinding: binds a ClusterRole to subjects across the entire cluster.
  • Imperative RBAC creation: kubectl create role pod-reader --verb=get,list,watch --resource=pods -n dev, kubectl create rolebinding dev-binding --role=pod-reader --user=jane -n dev.
  • Permission audit: kubectl auth can-i get pods --as=jane -n dev for a single check, kubectl auth can-i --list --as=jane -n dev for everything. Always verify before submitting.
  • Certificate lifecycle: kubeadm certs check-expiration shows all cluster certs + expiry; kubeadm certs renew all regenerates them. View a cert: openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout.
  • User CSR workflow: generate key with openssl genrsa -out jane.key 2048; build CSR openssl req -new -key jane.key -subj "/CN=jane/O=dev-team" -out jane.csr; submit as a Kubernetes CertificateSigningRequest object with the base64-encoded CSR; approve with kubectl certificate approve <csr-name>.
Concrete example

Task: grant user jane read-only access to Pods in the dev namespace, with no other permissions. Create Role: kubectl create role pod-reader --verb=get,list,watch --resource=pods -n dev. Bind: kubectl create rolebinding jane-pod-reader --role=pod-reader --user=jane -n dev. Verify: kubectl auth can-i list pods --as=jane -n dev returns yes; kubectl auth can-i delete pods --as=jane -n dev returns no; kubectl auth can-i list pods --as=jane -n production returns no — scope is correct.

Key takeaway: Role + RoleBinding = namespaced. ClusterRole + ClusterRoleBinding = cluster-wide. You CAN bind a ClusterRole with a RoleBinding — it limits the ClusterRole's permissions to that namespace only. Always verify with kubectl auth can-i.
⚡ Mini-quiz
Drill RBAC + cert scenarios → study mode (10 questions).
03

Workloads & Scheduling3 lessons

The 15%-weight Workloads domain: Deployments + rolling updates + rollback, the DaemonSet / StatefulSet / Job / CronJob picker, resource requests + limits + ResourceQuota + LimitRange, and the scheduling triad — Taints, Tolerations, and Node Affinity. The exam keeps asking "Pod is Pending, why?" and the answer is almost always one of these three.

deployment rolling-update daemonset statefulset resource-limits resource-quota taints-tolerations node-affinity
~5h
📖 Read in-depth chapter
Lesson 3.1 Deployments — rolling updates, rollback, scaling

Deployments are the default workload type and the exam's default question. You need fluency in kubectl set image, kubectl rollout, and picking the right workload kind (Deployment vs DaemonSet vs StatefulSet vs Job vs CronJob) for a given scenario.

Key concepts
  • RollingUpdate (default): gradually replaces old Pods with new ones; configurable with maxSurge (extra new Pods) and maxUnavailable (how many old Pods can be down). Recreate: kills all old Pods first — causes downtime.
  • Image update: kubectl set image deployment/webapp nginx=nginx:1.25 --record rolls out the new image; check progress with kubectl rollout status deployment/webapp.
  • Rollback: kubectl rollout history deployment/webapp lists revisions; kubectl rollout undo deployment/webapp reverts to previous; --to-revision=2 picks a specific one.
  • Scaling: manual via kubectl scale deployment webapp --replicas=5; auto via kubectl autoscale deployment webapp --min=2 --max=10 --cpu-percent=70 (HorizontalPodAutoscaler).
  • Pause / resume: kubectl rollout pause deployment/webapp stops mid-rollout for staged inspection; resume continues.
  • Workload type picker: Deployment for stateless apps, StatefulSet for databases / ordered Pods, DaemonSet for node-level agents (logging, CNI), Job for one-shot work, CronJob for scheduled jobs schedule: "*/5 * * * *".
Concrete example

Task: deploy nginx with a rolling-update strategy that adds 25% extra Pods but never drops below baseline capacity. Scaffold: kubectl create deployment web --image=nginx:1.24 --replicas=4 --dry-run=client -o yaml > web.yaml; edit to add strategy.rollingUpdate.maxSurge: 25% + maxUnavailable: 0. Apply and set image to nginx:1.25; watch kubectl rollout status. If logs show errors, kubectl rollout undo deployment/web reverts instantly without manual YAML editing.

Key takeaway: Deployment for stateless, StatefulSet for databases, DaemonSet for node agents, Job for one-time tasks, CronJob for schedules. kubectl rollout + set image is the rolling-update toolkit; undo is free rollback.
⚡ Mini-quiz
Drill Deployment + rollout scenarios → study mode (10 questions).
Lesson 3.2 Resource management — requests, limits, quotas

Requests drive scheduling; limits drive throttling and OOMKill. ResourceQuota caps a namespace; LimitRange supplies defaults. CKA scenarios often surface a Pod stuck Pending or OOMKilled — both trace back to this lesson.

Key concepts
  • requests: the minimum resources a container needs; used by the scheduler for placement decisions. limits: the maximum resources a container can use; enforced by kubelet — CPU over-limit = throttle, memory over-limit = OOMKill.
  • YAML shape: resources: {requests: {cpu: "250m", memory: "128Mi"}, limits: {cpu: "500m", memory: "256Mi"}}. CPU units: 1 CPU = 1000m (millicores); memory: Ki, Mi, Gi.
  • LimitRange: set default requests/limits and min/max constraints per container / Pod / PVC within a namespace. Pods that omit requests get the LimitRange defaults applied automatically.
  • ResourceQuota: limits total resource consumption within a namespace — CPU, memory, object counts (Pods, Services, PVCs). If a namespace has a Quota for CPU/memory, every Pod MUST specify requests + limits or it is rejected.
  • Inspection: kubectl describe limitrange -n dev, kubectl describe resourcequota -n dev shows used vs hard.
  • QoS classes: Guaranteed (requests = limits), Burstable (requests < limits), BestEffort (no requests/limits) — kubelet evicts BestEffort first under DiskPressure / MemoryPressure.
Concrete example

Task: a Pod fails to schedule in namespace dev with event "exceeded quota". Inspect: kubectl describe resourcequota -n dev shows hard requests.cpu: 2 and used 1.8; the Pod asks for 500m. Two fixes: shrink the Pod's request to 200m (fits remaining 200m), or raise the quota. If the Pod also has no requests at all, check kubectl describe limitrange -n dev — a LimitRange would normally supply defaults; if absent, the Pod is rejected because Quota demands explicit requests.

Key takeaway: requests = scheduler input, limits = runtime enforcement. Quota caps the namespace, LimitRange supplies defaults. A Pending Pod in a quota'd namespace is usually missing explicit requests/limits.
⚡ Mini-quiz
Practise resource limit + quota scenarios → quick quiz (5 questions).
Lesson 3.3 Taints, tolerations & node affinity

Three knobs control Pod placement: taints (repel from a node), tolerations (allow scheduling onto tainted nodes), and node affinity (attract to nodes with labels). The CKA scheduling questions almost always boil down to "this Pod is Pending — pick the right one of these three".

Key concepts
  • Taint: applied to a node to repel Pods that don't tolerate it: kubectl taint nodes node1 key=value:NoSchedule. Remove with the trailing minus: kubectl taint nodes node1 key=value:NoSchedule-.
  • Taint effects: NoSchedule (no new Pods), PreferNoSchedule (avoid if possible), NoExecute (also evict existing non-tolerating Pods).
  • Toleration: added to Pod spec to allow scheduling onto tainted nodes — must match key, value, effect (operator Equal or Exists).
  • Node affinity: requiredDuringSchedulingIgnoredDuringExecution (hard — Pod won't schedule without match), preferredDuringSchedulingIgnoredDuringExecution (soft — best-effort). Uses node labels: kubectl label nodes node1 disktype=ssd.
  • Pod affinity / anti-affinity: schedule Pods relative to other Pods — co-locate an app with its cache (affinity) or spread replicas across nodes (anti-affinity).
  • Pod priority & preemption: a PriorityClass with a high value can preempt lower-priority Pods to free space. Built-in classes system-cluster-critical and system-node-critical protect core components.
Concrete example

Task: reserve one worker node for a "critical" workload only. Taint the node: kubectl taint nodes worker-3 dedicated=critical:NoSchedule. Build the Pod: scaffold with kubectl run critical-app --image=... --dry-run=client -o yaml and add a tolerations entry with key: dedicated, operator: Equal, value: critical, effect: NoSchedule. Also add a node affinity stanza that requires the label dedicated=critical (which you set on worker-3 with kubectl label) so other tolerating Pods don't accidentally land there. Verify with kubectl describe node worker-3 + kubectl get pods -o wide.

Key takeaway: taints repel, tolerations allow, node affinity attracts. A Pending Pod usually means (1) untolerated taint, (2) no node matches affinity, or (3) not enough resources. Check those three first.
⚡ Mini-quiz
Drill taint / toleration / affinity scenarios → study mode (10 questions).
🎧

Halfway through the workloads module? Reinforce taint/toleration patterns and rollout strategies by listening to the CertQuests podcast — bite-sized CKA scenario walkthroughs for your commute.

▶ Open Spotify
04

Services & Networking3 lessons

The 20%-weight Networking domain. Service types (ClusterIP, NodePort, LoadBalancer, ExternalName, headless), CoreDNS resolution, Ingress + TLS termination, and NetworkPolicy with the CNI tradeoffs (Calico for native policy, Flannel without). Endpoints-mismatch debugging and DNS troubleshooting are the most-tested networking patterns.

clusterip nodeport loadbalancer coredns ingress network-policy cni calico
~6h
📖 Read in-depth chapter
Lesson 4.1 Services & DNS resolution

Services are how stable network identity is layered on top of ephemeral Pods. CoreDNS is the resolver. CKA tests both: which Service type fits a scenario, and how to debug a Service that has zero endpoints because the selector doesn't match Pod labels.

Key concepts
  • ClusterIP (default): stable internal IP accessible only within the cluster; ideal for service-to-service communication.
  • NodePort: exposes the service on each node's IP at a static port (30000-32767); allows external access via <NodeIP>:<NodePort>.
  • LoadBalancer: provisions an external load balancer from the cloud provider; assigns a public IP. Used in managed Kubernetes environments.
  • ExternalName: maps the service to a DNS name (e.g., external database hostname); returns a CNAME record; no proxying or port mapping.
  • Headless Service: clusterIP: None; returns Pod IPs directly from DNS; used with StatefulSets for stable per-Pod addressing (pod-0.svc.ns.svc.cluster.local).
  • CoreDNS: runs as a Deployment in kube-system. Service DNS pattern: <service-name>.<namespace>.svc.cluster.local (just <service-name> within the same namespace). Pod DNS: <pod-ip-dashes>.<namespace>.pod.cluster.local. Debug: kubectl run tmp --image=busybox --restart=Never -- nslookup kubernetes.default.
Concrete example

Task: front-end Pods can't reach the backend Service. Check Endpoints: kubectl get endpoints backend shows zero IPs. Diagnose: the Service selector is app: backend but Pods are labeled app: back-end — selector mismatch means no Endpoints, regardless of how many Pods are Ready. Fix either side (relabel Pods with kubectl label pod backend-xyz app=backend --overwrite or patch the Service selector). Verify: kubectl get endpoints backend now lists Pod IPs; kubectl exec frontend-pod -- nc -zv backend 80 succeeds.

Key takeaway: Service selector must match Pod labels exactly. Empty Endpoints = selector typo. CoreDNS lives in kube-system — if DNS fails, check CoreDNS Pods + the Pod's /etc/resolv.conf.
⚡ Mini-quiz
Drill Service + DNS scenarios → study mode (10 questions).
Lesson 4.2 Ingress — HTTP routing & TLS termination

Ingress is the L7 router. Without an Ingress Controller installed, Ingress objects do nothing — the most common exam trap. Once a controller is in place, you wire host- and path-based rules and optionally terminate TLS via a Secret.

Key concepts
  • Ingress: API object that manages external HTTP/HTTPS access to services; provides load balancing, name-based virtual hosting, and SSL termination.
  • Ingress Controller required: nginx-ingress, Traefik, or others must be deployed in the cluster — Ingress objects without a controller are inert.
  • Host-based routing: route app.example.com to one service, api.example.com to another from a single Ingress object.
  • Path-based routing: route /api to backend-service, / to frontend-service. Path types: Prefix, Exact, ImplementationSpecific.
  • Imperative creation: kubectl create ingress my-ingress --rule="app.example.com/=webapp:80" --rule="app.example.com/api=api-svc:8080".
  • TLS termination: reference a Secret containing tls.crt and tls.key in the Ingress spec under tls:. Use kubectl create secret tls my-tls --cert=cert.pem --key=key.pem.
Concrete example

Task: expose webapp at app.example.com/ and api-svc at app.example.com/api, with TLS terminated at the ingress. Create a TLS Secret: kubectl create secret tls app-tls --cert=cert.pem --key=key.pem -n web. Scaffold the Ingress: kubectl create ingress app --class=nginx --rule="app.example.com/=webapp:80" --rule="app.example.com/api=api-svc:8080" --dry-run=client -o yaml > ing.yaml; add a tls: block referencing app-tls. Apply and verify with kubectl describe ingress app -n web — Address populated, backends healthy.

Key takeaway: Ingress needs an Ingress Controller. Always set ingressClassName (or the legacy kubernetes.io/ingress.class annotation) so the right controller picks the resource up.
⚡ Mini-quiz
Practise Ingress + TLS scenarios → quick quiz (5 questions).
Lesson 4.3 NetworkPolicy & CNI plugins

By default every Pod can talk to every other Pod. NetworkPolicy is how you lock that down — but only if your CNI actually enforces it. Calico does; Flannel alone does not. CKA NetworkPolicy questions reward careful reading of direction (ingress / egress), scope (which Pods are selected), and source (podSelector / namespaceSelector / ipBlock).

Key concepts
  • Default behaviour: all Pods can communicate with all other Pods. NetworkPolicy restricts this. Policies are namespace-scoped.
  • Selectors: a NetworkPolicy uses podSelector to pick which Pods it applies to, plus namespaceSelector and ipBlock for source/destination matching.
  • Ingress vs egress: ingress rules control incoming traffic to selected Pods; egress rules control outgoing traffic from selected Pods. A NetworkPolicy that selects Pods but has no ingress/egress rules blocks all traffic in that direction.
  • Default deny: create a NetworkPolicy selecting all Pods with empty ingress: [] to block all incoming traffic in a namespace; do the same with egress: [] for outgoing. Then explicitly re-allow what's needed.
  • Example rule shape: allow traffic to Pods with label app: backend only from Pods labeled app: frontend: podSelector: {matchLabels: {app: backend}} + ingress from: [{podSelector: {matchLabels: {app: frontend}}}].
  • CNI enforcement matters: Calico (most-used in CKA labs) supports policy natively. Flannel does not — policies are accepted by the API but never enforced. Weave Net supports policy + optional encryption.
Concrete example

Task: in the secure namespace, allow Pods labeled role=db to receive traffic only from Pods labeled role=api on port 5432, and block everything else. Scaffold: build a YAML with podSelector: {matchLabels: {role: db}}, policyTypes: [Ingress], and one ingress rule from: [{podSelector: {matchLabels: {role: api}}}], ports: [{port: 5432, protocol: TCP}]. Apply and test: kubectl exec api-pod -n secure -- nc -zv db-svc 5432 succeeds; kubectl exec other-pod -n secure -- nc -zv db-svc 5432 times out. Always test both directions explicitly.

Key takeaway: NetworkPolicy is default-allow until you write one. Selectors define WHICH Pods. Ingress + egress + direction matter. Calico enforces; Flannel does not. Always test with kubectl exec ... -- nc -zv.
⚡ Mini-quiz
Drill NetworkPolicy + CNI scenarios → study mode (10 questions).

Test your knowledge on Domains 1–4 before moving to storage, security, and troubleshooting.

⚡ Take practice test ▶ Spotify episodes
05

Storage3 lessons

The 10%-weight Storage domain. PersistentVolume + PVC + StorageClass lifecycle and access modes, the ephemeral / node-level / configMap / Secret / NFS volume types, plus StatefulSet + volumeClaimTemplates + CSI driver model with binding modes (Immediate vs WaitForFirstConsumer). Stateful workload questions are the storage exam's bread and butter.

pv-pvc storageclass access-modes emptydir configmap-volume statefulset csi-driver binding-modes
~4h
📖 Read in-depth chapter
Lesson 5.1 PersistentVolumes, PVCs & StorageClasses

The PV/PVC pattern decouples storage from Pods. StorageClass enables dynamic provisioning. CKA tests the binding logic — size, access mode, storageClassName — and the diagnostic flow when a PVC is stuck Pending.

Key concepts
  • PersistentVolume (PV): a piece of storage provisioned by an administrator or dynamically; has its own lifecycle independent of any Pod.
  • PersistentVolumeClaim (PVC): a request for storage by a user; specifies size and access mode; Kubernetes binds the PVC to a matching PV.
  • StorageClass: defines the "class" of storage (provisioner, reclaim policy, parameters); enables dynamic provisioning. kubectl get storageclass lists available classes; the default is annotated.
  • Access modes: ReadWriteOnce (RWO — single node), ReadOnlyMany (ROX — many nodes read-only), ReadWriteMany (RWX — many nodes read-write, needs NFS/CephFS), ReadWriteOncePod (RWOP — single Pod, K8s 1.22+).
  • Reclaim policies: Delete (PV deleted with PVC; default for dynamic), Retain (PV kept for manual cleanup), Recycle (deprecated).
  • Mount in a Pod: reference PVC name in volumes section, then volumeMounts in the container spec. PVC status Bound means it is ready to mount.
Concrete example

Task: a PVC data-claim is stuck Pending. Diagnose: kubectl describe pvc data-claim shows "no persistent volumes available for this claim and no storage class is set". Inspect: kubectl get sc shows standard (default) exists but the PVC requested storageClassName: fast which doesn't exist. Two fixes: edit the PVC to use standard (or omit storageClassName), or create the missing fast StorageClass. After fix, kubectl get pvc shows Bound and a dynamically-provisioned PV.

Key takeaway: a Pending PVC means size, access mode, or storageClassName doesn't match. Always check kubectl get pv,pvc + kubectl get sc. RWX claim never binds to an RWO PV.
⚡ Mini-quiz
Drill PV / PVC binding scenarios → study mode (10 questions).
Lesson 5.2 Volume types — emptyDir, hostPath, ConfigMap, Secret, NFS

Not every volume needs a PV. ConfigMap and Secret volumes are the canonical way to inject configuration as files; emptyDir handles in-Pod scratch space; hostPath touches the node filesystem (dangerous, use sparingly); NFS gives you ReadWriteMany without a CSI driver.

Key concepts
  • emptyDir: created empty when a Pod is assigned to a node; exists for the lifetime of the Pod; shared between containers in the same Pod; useful for scratch space, caching, sidecar communication. Memory-backed: emptyDir: {medium: Memory}.
  • hostPath: mounts a file or directory from the host node filesystem into the Pod; powerful but dangerous; creates tight coupling to a specific node; use only for node-level system components (DaemonSets). Types: Directory, File, Socket, DirectoryOrCreate, FileOrCreate.
  • configMap volume: mounts ConfigMap data as files in the Pod; each key becomes a filename; updates to the ConfigMap propagate to the Pod (with a short delay) — unlike env-var injection, which requires a Pod restart.
  • secret volume: mounts Secret data as files; stored in tmpfs (RAM) for security; each key becomes a filename in the mount path.
  • projected volume: combines ConfigMaps + Secrets + downward API + ServiceAccount tokens into one mount path.
  • NFS volumes: let multiple Pods across multiple nodes share the same filesystem (ReadWriteMany). Specify nfs: {server: <nfs-server-ip>, path: /exports/data}. No provisioner needed — the NFS server must be reachable from all nodes.
Concrete example

Task: inject a ConfigMap of nginx config files into a Pod, and have it pick up updates without a restart. Create: kubectl create configmap nginx-conf --from-file=./conf.d/. Mount as volume (not env var) in the Pod spec — each file in conf.d/ becomes a file under /etc/nginx/conf.d. Edit the ConfigMap with kubectl edit configmap nginx-conf and within ~60s the mounted files update in the Pod. If you had used valueFrom.configMapKeyRef as an env var, you'd need to delete and recreate the Pod to pick up changes.

Key takeaway: ConfigMap as env var = restart to update. ConfigMap as volume = live update. Secret volumes are tmpfs-backed. hostPath ties a Pod to a node — only use in DaemonSets.
⚡ Mini-quiz
Practise volume-type scenarios → quick quiz (5 questions).
Lesson 5.3 StatefulSets, binding modes & CSI drivers

Stateful workloads need stable network IDs and per-Pod persistent storage — that's what StatefulSets give you. volumeClaimTemplates auto-creates a PVC per Pod; the CSI driver model is how cloud / on-prem storage plugs in. Binding modes (WaitForFirstConsumer vs Immediate) decide when the PV is actually provisioned and which zone it lives in.

Key concepts
  • StatefulSet vs Deployment: Deployment for stateless interchangeable Pods; StatefulSet for ordered, named Pods (pod-0, pod-1) with stable network identity and per-Pod persistent storage. Scale order is deterministic (0 → N-1 up, N-1 → 0 down).
  • volumeClaimTemplates: a StatefulSet field that auto-creates a PVC per Pod replica. Scale 1 → 3 and you get data-pod-0, data-pod-1, data-pod-2 claims, each bound to a distinct PV. Survives Pod deletion.
  • Headless service for stable network ID: StatefulSets require a headless Service (clusterIP: None) so each Pod gets a DNS name like pod-0.svc.namespace.svc.cluster.local.
  • WaitForFirstConsumer binding mode: StorageClass attribute volumeBindingMode: WaitForFirstConsumer delays PV creation until a Pod actually consumes the PVC — the provisioner places the PV in the zone where the Pod is scheduled. Critical for multi-AZ clusters; the default Immediate mode can pin the PV to the wrong zone.
  • CSI driver model: Container Storage Interface is the plugin API every modern storage backend implements. Popular drivers: aws-ebs-csi, gcp-pd-csi, azure-disk-csi, longhorn, rook-ceph, openebs. The driver runs as a DaemonSet + Deployment in the cluster and exposes a StorageClass.
  • Snapshots via VolumeSnapshotClass: CSI drivers can implement the snapshot API. VolumeSnapshot object captures a PVC; restore by creating a new PVC with dataSource referencing the snapshot. Used for backups and clone-for-test workflows.
Concrete example

Task: deploy a 3-replica PostgreSQL StatefulSet with each Pod backed by its own 10Gi PVC, in a cluster running the AWS EBS CSI driver. StorageClass: ensure one with provisioner: ebs.csi.aws.com + volumeBindingMode: WaitForFirstConsumer exists. Headless Service: postgres with clusterIP: None for stable DNS. StatefulSet: serviceName: postgres, replicas: 3, volumeClaimTemplates with accessModes: [ReadWriteOnce] + storage: 10Gi. Apply and watch: kubectl get pods -l app=postgres shows Pods created sequentially (0, then 1, then 2); kubectl get pvc shows three PVCs data-postgres-0/1/2, each bound to a CSI-provisioned PV in the zone where the corresponding Pod landed.

Key takeaway: StatefulSet + headless Service + volumeClaimTemplates is the stateful trio. WaitForFirstConsumer is the right binding mode for multi-AZ. CSI drivers expose every modern storage backend through one common StorageClass interface.
⚡ Mini-quiz
Drill StatefulSet + CSI scenarios → study mode (10 questions).
06

Security3 lessons

Security context fields, Pod Security Admission (the post-1.25 replacement for PodSecurityPolicy), ServiceAccount + RBAC patterns for workloads, Secret types + image-pull-secret wiring, and a default-deny NetworkPolicy starting point. The exam routinely asks you to lock a Pod to non-root with read-only root filesystem, or to wire a Pod to a custom ServiceAccount with a specific RBAC binding.

security-context pod-security-admission runasnonroot service-account rbac secrets image-pull-secret default-deny
~5h
📖 Read in-depth chapter
Lesson 6.1 Pod security & security contexts

Pod Security Admission replaced PodSecurityPolicy in 1.25 — it's a built-in admission controller you configure per namespace with labels. Security contexts are the per-Pod knobs (runAsNonRoot, readOnlyRootFilesystem, dropped capabilities) that satisfy the restricted PSA level.

Key concepts
  • Pod Security Admission (PSA): Kubernetes 1.25+ replaces PodSecurityPolicy with a built-in admission controller — no extra installation needed.
  • Three policy levels: privileged (unrestricted), baseline (minimally restrictive, prevents known privilege escalations), restricted (current hardening best practices — non-root, no escalation, dropped caps, seccomp).
  • Applied per namespace: label the namespace, e.g. pod-security.kubernetes.io/enforce: restricted, plus optional audit / warn modes that log or warn instead of reject.
  • Security context fields: runAsNonRoot: true, runAsUser: 1000, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, capabilities: {drop: ["ALL"], add: ["NET_BIND_SERVICE"]}.
  • Pod-level vs container-level: pod-level securityContext applies to all containers; container-level overrides pod-level. Always check both when debugging.
  • Discovery: when you forget a field name, kubectl explain pod.spec.securityContext or pod.spec.containers.securityContext shows valid keys + types.
Concrete example

Task: in namespace secure, run an nginx Pod that satisfies the restricted PSA level. Label the namespace: kubectl label ns secure pod-security.kubernetes.io/enforce=restricted. Scaffold the Pod: kubectl run web --image=nginxinc/nginx-unprivileged --dry-run=client -o yaml > web.yaml. Edit to add securityContext: {runAsNonRoot: true, runAsUser: 101, allowPrivilegeEscalation: false, capabilities: {drop: ["ALL"]}, readOnlyRootFilesystem: true, seccompProfile: {type: RuntimeDefault}}. Apply: Pod is admitted. Try the standard nginx image instead — Pod is rejected because it runs as root.

Key takeaway: PSA gates the namespace via labels. Security context fields you must memorise: runAsNonRoot, runAsUser, readOnlyRootFilesystem, allowPrivilegeEscalation, capabilities. Use kubectl explain to recall syntax.
⚡ Mini-quiz
Drill security context + PSA scenarios → study mode (10 questions).
Lesson 6.2 ServiceAccounts & RBAC best practices

Every Pod runs as a ServiceAccount — the question is which one. Three-step ritual: create the SA, create a Role/RoleBinding for it, wire the Pod via serviceAccountName. CKA frequently asks for exactly this chain and for least-privilege auditing with kubectl auth can-i.

Key concepts
  • Default SA: every Pod runs with a ServiceAccount; if you don't set one, it defaults to default in the namespace.
  • Token mount: the SA token is auto-mounted at /var/run/secrets/kubernetes.io/serviceaccount/token — used by applications to authenticate to the API server.
  • Create a dedicated SA: kubectl create serviceaccount my-app-sa -n production. Wire to a Pod: set spec.serviceAccountName: my-app-sa.
  • Disable auto-mount: set automountServiceAccountToken: false on the Pod or SA when the workload doesn't talk to the API.
  • Kubernetes 1.24+ tokens: SA tokens are no longer auto-created as long-lived Secrets. Use the TokenRequest API or manually create a Secret with type: kubernetes.io/service-account-token.
  • Least privilege: grant only the verbs and resources the workload needs; prefer namespace-scoped Roles over ClusterRoles. Audit: kubectl auth can-i --list --as=system:serviceaccount:<ns>:<sa>.
Concrete example

Task: give a Pod read-only access to Pods + Services in its own namespace via a custom ServiceAccount. Create SA: kubectl create serviceaccount app-sa -n web. Create Role: kubectl create role app-reader --verb=get,list,watch --resource=pods,services -n web. Bind: kubectl create rolebinding app-binding --role=app-reader --serviceaccount=web:app-sa -n web. Wire the Pod: add spec.serviceAccountName: app-sa. Verify: kubectl auth can-i list pods --as=system:serviceaccount:web:app-sa -n web = yes; same against another namespace = no.

Key takeaway: three steps for SA-RBAC — create SA, create Role + RoleBinding referencing the SA, set serviceAccountName in Pod spec. Verify with kubectl auth can-i --as=system:serviceaccount:<ns>:<sa>.
⚡ Mini-quiz
Practise ServiceAccount + RBAC scenarios → quick quiz (5 questions).
Lesson 6.3 Secrets management & image security

Secrets are base64-encoded, not encrypted by default — etcd encryption at rest is a separate kube-apiserver flag. The exam tests creation, both injection patterns (env var vs volume), and the imagePullSecrets dance for private registries.

Key concepts
  • Secret types: Opaque (generic key-value, base64-encoded, not encrypted by default); kubernetes.io/tls (TLS cert + key — fields tls.crt, tls.key); kubernetes.io/dockerconfigjson (registry credentials for private images).
  • Imperative creation: kubectl create secret generic db-creds --from-literal=password=s3cr3t, kubectl create secret tls my-tls --cert=cert.pem --key=key.pem, kubectl create secret docker-registry registry-creds --docker-server=registry.example.com --docker-username=... --docker-password=....
  • Inject as env var: valueFrom: {secretKeyRef: {name: db-creds, key: password}} — requires Pod restart to pick up changes.
  • Inject as volume: Secret keys become files at the mount path (tmpfs-backed) — updates propagate automatically.
  • Private registry pull: create a dockerconfigjson Secret, reference it with imagePullSecrets: [{name: registry-creds}] in the Pod spec. Or patch the SA: kubectl patch sa default -p '{"imagePullSecrets":[{"name":"registry-creds"}]}' so all Pods using that SA auto-use it.
  • Production hardening: enable etcd encryption at rest by configuring an EncryptionConfiguration object and referencing it in the kube-apiserver manifest with --encryption-provider-config. Pair with a default-deny NetworkPolicy + explicit DNS allow to limit blast radius.
Concrete example

Task: pull a private image into a Pod, with the registry password stored as a Secret. Create the pull secret: kubectl create secret docker-registry regcreds --docker-server=registry.example.com --docker-username=ci --docker-password=$REG_PASS --docker-email=ci@example.com. Reference it in the Pod spec: imagePullSecrets: [{name: regcreds}] + containers[0].image: registry.example.com/app:1.0. Verify: kubectl describe pod app shows the image pulled successfully; without the secret you'd see ImagePullBackOff with auth errors.

Key takeaway: Secret env var = restart to update; volume = live update. Private registries need kubernetes.io/dockerconfigjson + imagePullSecrets. Enable etcd encryption at rest for production clusters.
⚡ Mini-quiz
Drill Secret + image-pull-secret scenarios → study mode (10 questions).
07

Troubleshooting3 lessons

The 30%-weight Troubleshooting domain — the biggest single domain and the one that decides exam pass/fail. Three layers: node-level (kubelet, runtime, journalctl), application + networking (CrashLoopBackOff, ImagePullBackOff, Service Endpoints, DNS, NetworkPolicy), and control plane + etcd (static pod manifests, api-server failures, etcdctl snapshot restore). Master these and you protect 30 points.

journalctl-kubelet crictl crashloopbackoff imagepullbackoff endpoints-debug coredns-debug etcd-health static-pod-manifest
~4h
📖 Read in-depth chapter
Lesson 7.1 Node-level troubleshooting

Node NotReady is the most common cluster-level failure on the exam. The fix is almost always one of three things: kubelet stopped, container runtime stopped, or disk pressure. Cordon + drain + uncordon is the maintenance triad you'll use whenever you need to take a node out of rotation safely.

Key concepts
  • Identify problematic node: kubectl get nodes — look for NotReady or Unknown status. kubectl describe node <node-name> shows the Conditions section: MemoryPressure, DiskPressure, NetworkUnavailable.
  • Kubelet first: systemctl status kubelet — is it running? journalctl -u kubelet -n 100 --no-pager for recent logs. Look for certificate errors, container runtime failures, configuration issues.
  • Container runtime second: systemctl status containerd (or cri-o). If down, kubelet can't start Pods even if kubelet itself is healthy.
  • Disk + network: df -h — a full disk stops kubelet from running new Pods. Verify the node can reach the API server IP on port 6443 with curl -k https://<cp-ip>:6443.
  • Taints applied automatically: kubelet adds node.kubernetes.io/unreachable:NoExecute or node.kubernetes.io/disk-pressure:NoSchedule when conditions trip — Pods without matching tolerations get evicted or refused.
  • Cordon / drain / uncordon: kubectl cordon <node> (unschedulable, keep existing Pods); kubectl drain <node> --ignore-daemonsets --delete-emptydir-data (evict + cordon); kubectl uncordon <node> (back to schedulable).
Concrete example

Task: worker-1 is NotReady. SSH and run systemctl status kubelet — service is dead. journalctl -u kubelet -n 100 --no-pager shows "failed to load Kubelet config file: open /var/lib/kubelet/config.yaml: no such file". Diagnose: someone deleted the config; restore it from the kubeadm-managed copy under /etc/kubernetes/ or regenerate with kubeadm join phases. Fix + systemctl daemon-reload && systemctl restart kubelet. Within seconds kubectl get nodes flips worker-1 back to Ready. For planned maintenance, the canonical flow is kubectl drain worker-1 --ignore-daemonsets, do work, then kubectl uncordon worker-1.

Key takeaway: NotReady = kubelet, runtime, or disk. Always check systemctl status kubelet + journalctl -u kubelet first. Drain before maintenance, uncordon after.
⚡ Mini-quiz
Drill node-troubleshooting scenarios → study mode (10 questions).
Lesson 7.2 Application + networking troubleshooting

Once the cluster is healthy, the next layer of trouble lives in Pods and Services. CrashLoopBackOff and ImagePullBackOff are the everyday symptoms; empty Service Endpoints and broken DNS are the network ones. The nicolaka/netshoot debug Pod + kubectl exec ... -- nc -zv / nslookup is the toolkit.

Key concepts
  • Pod status states: Pending (not yet scheduled — check Events for resources, taints, PVC, image pull), CrashLoopBackOff (container repeatedly crashes — exponential backoff; kubectl logs --previous), OOMKilled (memory limit exceeded), ImagePullBackOff / ErrImagePull (wrong image, missing imagePullSecret, registry unreachable), CreateContainerConfigError (missing referenced ConfigMap or Secret), Terminating stuck (finalizer — force delete with --grace-period=0).
  • Inspection ladder: kubectl describe pod <pod> for events; kubectl get events --sort-by=.lastTimestamp -n <ns> for a chronological stream; kubectl logs <pod> -c <container> for a specific container; kubectl logs <pod> --previous for crashed instance.
  • Connectivity tests: kubectl exec -it <pod> -- /bin/sh for an interactive shell; kubectl exec <pod> -- nc -zv <service> <port> for one-shot TCP; kubectl exec <pod> -- nslookup <service>.<ns>.svc.cluster.local for DNS.
  • Debug pod pattern: kubectl run debug --image=nicolaka/netshoot --rm -it --restart=Never -- bash — includes curl, nslookup, netstat, tcpdump, nmap. Your network swiss-army knife.
  • Service Endpoints: kubectl get endpoints <service> — empty list means the selector doesn't match any Pod labels. Number-one cause of "Service doesn't work but everything is Running".
  • NetworkPolicy debugging: temporarily delete the policy and retest; if traffic flows, the policy is over-restrictive. Always allow DNS egress (UDP/TCP 53) or every Pod loses name resolution.
Concrete example

Task: front-end Pod can't reach backend.web.svc.cluster.local. Shell in: kubectl exec -it frontend-abc -- sh. Test DNS: nslookup backend.web.svc.cluster.local returns NXDOMAIN. Check CoreDNS: kubectl get pods -n kube-system | grep coredns — Pods Running. Check the Service: kubectl get svc backend -n web exists, but kubectl get endpoints backend -n web returns zero IPs. Compare labels: Service selector is app: backend, Pod labels are app: back-end. Fix the typo on one side, retest with nc -zv backend 8080 — succeeds. Re-add a default-deny NetworkPolicy with explicit allow if security required.

Key takeaway: kubectl describe + kubectl logs --previous for Pod failures. Empty Endpoints = selector mismatch. nicolaka/netshoot debug Pod for any network puzzle. Always allow DNS egress in restrictive NetworkPolicies.
⚡ Mini-quiz
Practise application + network debugging scenarios → quick quiz (5 questions).
Lesson 7.3 Control plane + etcd troubleshooting

When the control plane is broken, kubectl itself stops working — you fall back to crictl on the control-plane node to inspect the static-pod containers directly. The exam routinely breaks a kube-apiserver, scheduler, or controller-manager manifest, and the etcd snapshot save + restore is on virtually every test. This is the highest-leverage lesson in the entire course.

Key concepts
  • Static-pod model: kube-apiserver, kube-controller-manager, kube-scheduler, etcd all run as static Pods. Their manifests live in /etc/kubernetes/manifests/ on the control-plane node. Kubelet watches the directory and recreates Pods when the file changes.
  • When kubectl is down: use crictl ps to list running containers, crictl logs <id> to read logs, crictl inspect <id> for detail. Use the runtime config at /etc/crictl.yaml or pass --runtime-endpoint unix:///run/containerd/containerd.sock.
  • API-server failure pattern: usually a typo in the static-pod manifest — wrong --etcd-servers, missing or misspelled cert path, wrong port. Compare against a known-good manifest from a working cluster.
  • etcd health checks: ETCDCTL_API=3 etcdctl endpoint health --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key; etcdctl member list for member topology.
  • etcd snapshot save + restore: etcdctl snapshot save /backup/etcd.db ...; restore with etcdctl snapshot restore /backup/etcd.db --data-dir=/var/lib/etcd-restore; then edit /etc/kubernetes/manifests/etcd.yaml to point --data-dir at the new path. Restore to a fresh dir — never on top of running data.
  • Audit logs: configure via --audit-log-path + --audit-policy-file kube-apiserver flags. Levels: None, Metadata, Request, RequestResponse — invaluable for "who deleted this".
Concrete example

Task: kubectl get pods hangs forever — the api-server is down. SSH to the control plane and run crictl ps -a | grep apiserver — the container exists in Exited state. Logs: crictl logs <id> shows "failed to connect to etcd at https://127.0.0.1:2380" — wrong port. Inspect manifest: cat /etc/kubernetes/manifests/kube-apiserver.yaml shows --etcd-servers=https://127.0.0.1:2380; correct port is 2379. Fix: edit the file in place; kubelet detects the change, restarts the api-server static pod within ~10s. Verify: kubectl get pods -n kube-system now responds. For a destructive recovery — the api-server is healthy but the cluster state was corrupted — take the latest etcdctl snapshot save, then restore by stopping the static api-server (move the manifest aside), running etcdctl snapshot restore to a new data dir, editing the etcd static pod manifest's --data-dir + hostPath, then putting the api-server manifest back.

Key takeaway: control plane lives as static pods under /etc/kubernetes/manifests/. When kubectl fails, crictl ps + crictl logs on the node. Restore etcd to a fresh data dir + update the static-pod manifest — this is the single highest-yield CKA item.
⚡ Mini-quiz
Drill control-plane + etcd troubleshooting scenarios → study mode (10 questions).

Capstone labs — 4 end-to-end exercises

Build a Vagrant or kind cluster (or use killer.sh) and execute each lab end-to-end under a 25-minute timer. These mirror the actual CKA performance-based question style.

Lab 1 — Cluster install with kubeadm + CNI

Bootstrap a single control-plane + one worker with kubeadm init --pod-network-cidr=192.168.0.0/16 + kubeadm join. Install Calico as the CNI (kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml). Verify CoreDNS + kube-proxy Pods are Running with kubectl get pods -n kube-system. Then drain + remove the worker (kubectl drain worker --ignore-daemonsets && kubectl delete node worker) and re-add it with a freshly-generated kubeadm token create --print-join-command.

Lab 2 — Workload rollout + scheduling

Deploy an nginx Deployment with a rolling-update strategy (maxSurge: 25%, maxUnavailable: 0) and inject CPU/memory requests + limits. Apply a taint dedicated=critical:NoSchedule to one worker and add a matching Toleration + node affinity so only critical Pods can land there. Verify placement with kubectl describe node + kubectl get pods -o wide. Update the image with kubectl set image, watch kubectl rollout status, then kubectl rollout undo as a drill.

Lab 3 — Storage + StatefulSet

Install the local-path-provisioner (Rancher) for dynamic PV creation, with a StorageClass set to volumeBindingMode: WaitForFirstConsumer. Create a headless Service + a 3-replica StatefulSet with volumeClaimTemplates (10Gi each, ReadWriteOnce). Scale from 1 → 3 with kubectl scale and watch kubectl get pvc — three PVCs auto-created and bound only when each Pod schedules. Delete a Pod and confirm its PVC survives.

Lab 4 — etcd backup + restore

Take an etcdctl snapshot save of the running cluster (full command with the three cert flags). Deploy a test Deployment kubectl create deployment marker --image=nginx AFTER the snapshot. Stop kube-apiserver by moving its manifest aside, run etcdctl snapshot restore to /var/lib/etcd-restore, update /etc/kubernetes/manifests/etcd.yaml --data-dir + hostPath, restore the api-server manifest. Verify the test Deployment is GONE (proving you rolled back to the snapshot), and any pre-snapshot resources are back.

Top 4 mistakes candidates make on the CKA

  • Writing YAML from scratch: typing out a Deployment or NetworkPolicy by hand burns 5-10 minutes per question. Always scaffold with kubectl <create|run|expose> ... --dry-run=client -o yaml | tee file.yaml, then edit. Use kubectl explain when you forget a field.
  • Forgetting kubectl drain --ignore-daemonsets before upgrades: draining without the flag fails because DaemonSet Pods (CNI, kube-proxy) can't be evicted. The upgrade step looks complete but you've left the node in a half-broken state; the exam grader catches it.
  • Mis-binding RBAC at namespace scope when the request needs cluster scope: using a RoleBinding when the resource is cluster-scoped (nodes, PVs, namespaces themselves) silently fails. ClusterRole + ClusterRoleBinding for cluster-scoped; Role + RoleBinding for namespaced. Always verify with kubectl auth can-i.
  • Restoring etcd to the same data-dir instead of a fresh one: etcdctl snapshot restore refuses if the target exists, but the more common mistake is forgetting to update the static-pod manifest's --data-dir + hostPath after restoring — kubelet just restarts etcd against the OLD data and your restore appears to have failed. Always restore to a fresh dir + update the manifest.

Ready for the CKA?

Scenario-based practice questions covering all 5 exam domains, with extra weight on Troubleshooting (30%) and Cluster Architecture (25%). Free, no signup, instant feedback on every answer.

Continue the Kubernetes path

CKA is the foundation. CKAD adds the developer perspective, CKS adds the security specialist layer, KCNA is the entry-level associate cert, and Docker DCA covers the container runtime underneath it all.

Start practicing →