Kubernetes Architecture & Core Concepts

3 lessons · ~4 hours

Control Plane Components

The Kubernetes Control Plane

kube-apiserver — the front-end for the Kubernetes control plane; all internal and external communication passes through it; exposes the Kubernetes API over HTTPS
etcd — a consistent, highly-available key-value store used as Kubernetes' backing store for all cluster data; treat it as the source of truth
kube-scheduler — watches for newly created Pods with no assigned node; selects the best node based on resource requirements, policies, taints/tolerations, and affinity rules
kube-controller-manager — runs all controller processes as a single binary (Node Controller, Replication Controller, Endpoints Controller, ServiceAccount Controller, etc.)
cloud-controller-manager — interfaces with the underlying cloud provider API to manage nodes, routes, and load balancers; separates cloud-specific logic from the core controllers

The control plane is the brain of Kubernetes. On a kubeadm cluster, control plane components run as static Pods in the kube-system namespace — you can inspect them with kubectl get pods -n kube-system. Their manifests live in /etc/kubernetes/manifests/.

Know which component is responsible for what. etcd = data store. apiserver = gateway. scheduler = Pod placement. controller-manager = reconciliation loops. The exam often asks you to identify which component is failing based on symptoms.

Worker Node Components

What Runs on Every Worker Node

kubelet — the primary node agent; registers the node with the API server; ensures containers described in PodSpecs are running and healthy; communicates with the container runtime via CRI
kube-proxy — maintains network rules (iptables or IPVS) on each node that implement Services; handles traffic routing to Pod endpoints
Container runtime — the software responsible for running containers (containerd, CRI-O); communicates with kubelet via the Container Runtime Interface (CRI)

Key Node Concepts

Node status conditions: Ready, MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable
Inspect a node: kubectl describe node <node-name>
Check kubelet status: systemctl status kubelet
Kubelet config: /var/lib/kubelet/config.yaml

Unlike control plane components that run as static Pods, kubelet and kube-proxy run as system services managed by systemd. When a node is NotReady, check kubelet logs first: journalctl -u kubelet -n 50.

The CKA exam often includes a broken-node scenario. The most common cause is a stopped kubelet service. Always check systemctl status kubelet and journalctl -u kubelet first.

kubectl & Core API Resources

kubectl Command Syntax

Basic pattern: kubectl [command] [TYPE] [NAME] [flags]
kubectl get pods — list Pods in current namespace
kubectl get pods -A — list Pods across all namespaces
kubectl describe pod <name> — detailed info including Events
kubectl delete pod <name> --force --grace-period=0 — immediate deletion
kubectl get pod <name> -o yaml — output full resource spec as YAML
kubectl explain pod.spec.containers — inline API documentation

Imperative Commands (exam speed tricks)

kubectl run nginx --image=nginx --restart=Never — create a Pod
kubectl run nginx --image=nginx --dry-run=client -o yaml > pod.yaml — generate YAML without creating
kubectl create deployment app --image=nginx --replicas=3 — create a Deployment
kubectl expose deployment app --port=80 --type=ClusterIP — create a Service
kubectl set image deployment/app nginx=nginx:1.25 — update container image

Core API Resources

Pod — smallest deployable unit; one or more containers sharing network and storage
ReplicaSet — ensures a specified number of Pod replicas are running; typically managed by a Deployment
Deployment — declarative Pod management with rolling update and rollback capabilities
Namespace — virtual cluster partition for isolating resources; default namespaces: default, kube-system, kube-public, kube-node-lease

The CKA is a hands-on performance-based exam. Master --dry-run=client -o yaml to generate resource templates quickly instead of writing YAML from scratch. This saves enormous time.

02

Cluster Installation & Configuration

3 lessons · ~5 hours

kubeadm: Bootstrapping a Cluster

kubeadm init Workflow

Pre-flight checks: swap disabled, required ports open, container runtime running
kubeadm init --pod-network-cidr=192.168.0.0/16 --apiserver-advertise-address=<IP> — initialize control plane
After init, copy kubeconfig: mkdir -p $HOME/.kube && cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
Install a CNI plugin (e.g. Calico): kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
Print join command: kubeadm token create --print-join-command

kubeadm join Workflow

Run the join command on each worker node as root: kubeadm join <apiserver-ip>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>
Bootstrap tokens expire after 24 hours by default — regenerate with kubeadm token create
Verify nodes joined: kubectl get nodes

etcd Backup & Restore

Take a snapshot: ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
Verify snapshot: ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db
Restore: ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db --data-dir=/var/lib/etcd-restore, then update the etcd static Pod manifest to point to the new data directory

The etcd backup/restore task appears in almost every CKA exam. Memorize the full etcdctl snapshot save command with all three certificate flags. The certs are always in /etc/kubernetes/pki/etcd/.

Always set ETCDCTL_API=3 before running etcdctl commands. Without it, etcdctl defaults to v2 API and the commands will be different or fail entirely.

Cluster Upgrade & kubeconfig Management

Upgrading a Cluster with kubeadm

Upgrades must happen one minor version at a time (e.g., 1.28 → 1.29, never 1.28 → 1.30)
Control plane first: apt-mark unhold kubeadm && apt-get install -y kubeadm=1.29.0-00 && apt-mark hold kubeadm
Check upgrade plan: kubeadm upgrade plan
Apply upgrade: kubeadm upgrade apply v1.29.0
Drain control plane node: kubectl drain <node> --ignore-daemonsets
Upgrade kubelet and kubectl: apt-get install -y kubelet=1.29.0-00 kubectl=1.29.0-00, then systemctl daemon-reload && systemctl restart kubelet
Uncordon: kubectl uncordon <node>
Repeat drain/upgrade/uncordon for each worker node

kubeconfig: Managing Cluster Access

Default location: ~/.kube/config — can be overridden with KUBECONFIG env var
List contexts: kubectl config get-contexts
Switch context: kubectl config use-context <context-name>
View current context: kubectl config current-context
Set default namespace for a context: kubectl config set-context --current --namespace=dev
Merge multiple kubeconfig files: KUBECONFIG=~/.kube/config:~/.kube/prod-config kubectl config view --merge --flatten > ~/.kube/merged-config

The CKA exam gives you multiple clusters. Always verify which cluster you are on with kubectl config current-context before running commands. Switching to the wrong context is a common exam mistake.

RBAC & TLS Certificate Management

Role-Based Access Control (RBAC)

Role — namespaced; grants permissions to resources within a specific namespace
ClusterRole — cluster-wide; can grant access to cluster-scoped resources (nodes, PVs) or any namespace
RoleBinding — binds a Role or ClusterRole to subjects (users, groups, ServiceAccounts) within a namespace
ClusterRoleBinding — binds a ClusterRole to subjects across the entire cluster

Creating RBAC Resources

Create a Role: kubectl create role pod-reader --verb=get,list,watch --resource=pods -n dev
Create a RoleBinding: kubectl create rolebinding dev-binding --role=pod-reader --user=jane -n dev
Test access: kubectl auth can-i get pods --as=jane -n dev
Inspect effective permissions: kubectl auth can-i --list --as=jane -n dev

TLS Certificates & CSRs

View cluster certificate details: openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout
Check certificate expiry: kubeadm certs check-expiration
Renew all certificates: kubeadm certs renew all
Create a user CSR: generate private key with openssl genrsa -out jane.key 2048, then openssl req -new -key jane.key -subj "/CN=jane/O=dev-team" -out jane.csr
Submit CSR to Kubernetes: create a CertificateSigningRequest object with the base64-encoded CSR
Approve CSR: kubectl certificate approve <csr-name>

Use kubectl auth can-i to verify your RBAC rules work before submitting your exam answer. A quick test like kubectl auth can-i create deployments --as=system:serviceaccount:dev:mysa -n dev confirms permissions instantly.

RBAC is heavily tested. Know the difference: Role + RoleBinding = namespaced. ClusterRole + ClusterRoleBinding = cluster-wide. You CAN bind a ClusterRole with a RoleBinding — it limits the ClusterRole's permissions to that namespace only.

03

Workloads & Scheduling

3 lessons · ~4 hours

Deployments: Rolling Updates, Rollbacks & Scaling

Deployment Strategy

RollingUpdate (default) — gradually replaces old Pods with new ones; configurable with maxSurge and maxUnavailable
Recreate — terminates all old Pods before creating new ones; causes downtime
Update image: kubectl set image deployment/webapp nginx=nginx:1.25 --record
Check rollout status: kubectl rollout status deployment/webapp
View rollout history: kubectl rollout history deployment/webapp
Rollback to previous revision: kubectl rollout undo deployment/webapp
Rollback to specific revision: kubectl rollout undo deployment/webapp --to-revision=2

Scaling

Manual scale: kubectl scale deployment webapp --replicas=5
HorizontalPodAutoscaler: kubectl autoscale deployment webapp --min=2 --max=10 --cpu-percent=70
Pause/resume rollout: kubectl rollout pause deployment/webapp / kubectl rollout resume deployment/webapp

DaemonSets, StatefulSets, Jobs & CronJobs

DaemonSet — ensures one Pod runs on every node (or a subset via nodeSelector); used for log collectors, monitoring agents, CNI plugins
StatefulSet — for stateful applications; Pods get stable network identifiers (pod-0, pod-1) and persistent volume claims per Pod; ordered deployment and scaling
Job — runs one or more Pods to completion; use completions and parallelism to control batch execution
CronJob — creates Jobs on a cron schedule; schedule: "*/5 * * * *" runs every 5 minutes

Know when to use each workload type: Deployment for stateless apps, StatefulSet for databases, DaemonSet for node agents, Job for one-time tasks, CronJob for scheduled tasks. The exam will describe a scenario and ask you to pick or create the right resource.

Resource Management: Requests, Limits & Quotas

Container Resources

requests — the minimum resources a container needs; used by the scheduler for placement decisions
limits — the maximum resources a container can use; enforced by the kubelet; exceeding CPU limit causes throttling; exceeding memory limit causes OOMKill
Example YAML: resources: {requests: {cpu: "250m", memory: "128Mi"}, limits: {cpu: "500m", memory: "256Mi"}}
CPU units: 1 CPU = 1000m (millicores); memory: Ki, Mi, Gi

LimitRange

Set default requests/limits and min/max constraints per container/Pod/PVC within a namespace
If a Pod has no resource requests/limits and a LimitRange exists in the namespace, the LimitRange defaults are applied automatically
Create: kubectl create -f limitrange.yaml, inspect: kubectl describe limitrange -n dev

ResourceQuota

Limits total resource consumption within a namespace: CPU, memory, number of objects (Pods, Services, PVCs, etc.)
If a namespace has a ResourceQuota for CPU/memory, all Pods MUST specify requests and limits — otherwise they will be rejected
Check quota usage: kubectl describe resourcequota -n dev

A common exam scenario: a Pod fails to schedule in a namespace. Check if a ResourceQuota exists and whether the Pod's requests are within the remaining quota. Also check if a LimitRange is requiring requests/limits that the Pod doesn't define.

Taints, Tolerations & Node Affinity

Taints & Tolerations

Taint — applied to a node to repel Pods that don't tolerate it: kubectl taint nodes node1 key=value:NoSchedule
Taint effects: NoSchedule (no new Pods), PreferNoSchedule (avoid if possible), NoExecute (evict existing Pods too)
Remove a taint: kubectl taint nodes node1 key=value:NoSchedule- (append minus)
Toleration — added to a Pod spec to allow scheduling onto tainted nodes

Node Affinity

requiredDuringSchedulingIgnoredDuringExecution — hard requirement; Pod won't schedule if no matching node exists
preferredDuringSchedulingIgnoredDuringExecution — soft preference; Pod schedules elsewhere if no match
Uses node labels; label a node: kubectl label nodes node1 disktype=ssd
Pod Affinity/Anti-affinity — schedule Pods relative to other Pods (e.g., co-locate app and cache, or spread replicas across nodes)

Pod Priority & Preemption

Define a PriorityClass with a value (higher = more important); assign to Pod with priorityClassName
When a high-priority Pod can't schedule due to resource constraints, the scheduler may evict lower-priority Pods to make room
Built-in priority classes: system-cluster-critical and system-node-critical for core Kubernetes components

Taints repel; tolerations allow. Node affinity attracts. If a Pod is stuck Pending, check: (1) Does a node have a taint the Pod doesn't tolerate? (2) Does node affinity match any available node? (3) Are there sufficient resources?

04

Services & Networking

3 lessons · ~5 hours

Services: Types & DNS Resolution

Service Types

ClusterIP (default) — stable internal IP accessible only within the cluster; ideal for service-to-service communication
NodePort — exposes the service on each node's IP at a static port (30000–32767); allows external access via <NodeIP>:<NodePort>
LoadBalancer — provisions an external load balancer from the cloud provider; assigns a public IP to the service; used in managed Kubernetes environments
ExternalName — maps the service to a DNS name (e.g., an external database hostname); returns a CNAME record; no proxying or port mapping
Headless Service — clusterIP: None; returns Pod IPs directly from DNS; used with StatefulSets for direct Pod addressing

DNS in Kubernetes (CoreDNS)

CoreDNS runs as a Deployment in kube-system; every Pod gets its DNS configured to use the CoreDNS ClusterIP
Service DNS pattern: <service-name>.<namespace>.svc.cluster.local
Within the same namespace, just <service-name> resolves correctly
Pod DNS: <pod-ip-dashes>.<namespace>.pod.cluster.local
Debug DNS: kubectl run tmp --image=busybox --restart=Never -- nslookup kubernetes.default

CoreDNS config is stored in a ConfigMap: kubectl get configmap coredns -n kube-system -o yaml. If DNS resolution fails in a Pod, check that CoreDNS Pods are running and that the Pod's /etc/resolv.conf points to the CoreDNS ClusterIP.

When a Pod cannot reach a Service, verify the Service selector matches Pod labels exactly. A single typo in the selector means zero endpoints: kubectl get endpoints <service-name> — an empty Endpoints list confirms a selector mismatch.

Ingress: HTTP Routing & TLS Termination

Ingress Concepts

Ingress — API object that manages external HTTP/HTTPS access to services in a cluster; provides load balancing, name-based virtual hosting, and SSL termination
Requires an Ingress Controller (e.g., nginx-ingress, Traefik) to be deployed in the cluster — Ingress objects do nothing without a controller
Host-based routing: route app.example.com to one service, api.example.com to another
Path-based routing: route /api to backend-service, / to frontend-service

Ingress YAML Patterns

Create with: kubectl create ingress my-ingress --rule="app.example.com/=webapp:80" --rule="app.example.com/api=api-svc:8080"
TLS termination: reference a Secret containing tls.crt and tls.key in the Ingress spec under tls:
Inspect: kubectl describe ingress my-ingress shows the routing rules and backend health

Always specify ingressClassName or the kubernetes.io/ingress.class annotation to tell Kubernetes which Ingress Controller should handle the resource. Without it, no controller may pick it up.

NetworkPolicy & CNI Plugins

NetworkPolicy

By default, all Pods can communicate with all other Pods — NetworkPolicy allows you to restrict this
NetworkPolicy is namespace-scoped; works based on podSelector, namespaceSelector, and ipBlock
Ingress rules — control incoming traffic to selected Pods
Egress rules — control outgoing traffic from selected Pods
A NetworkPolicy that selects Pods but has no ingress/egress rules blocks all traffic in that direction for those Pods
Default deny all ingress: create a NetworkPolicy selecting all Pods with empty ingress: []

NetworkPolicy Example Pattern

Allow traffic to app=backend Pods only from app=frontend Pods: set podSelector: {matchLabels: {app: backend}} and ingress from: [{podSelector: {matchLabels: {app: frontend}}}]
Allow egress to DNS only: egress to port 53 UDP/TCP on kube-dns namespace
Test connectivity: kubectl exec -it <pod> -- nc -zv <service> <port>

CNI Plugins Overview

CNI (Container Network Interface) is the standard for cluster networking — every kubeadm cluster requires one
Calico — most popular for production; supports NetworkPolicy natively; BGP-based routing; used heavily in CKA exam environments
Flannel — simple VXLAN overlay; does not support NetworkPolicy natively (requires Calico for policy enforcement)
Weave Net — easy setup; supports NetworkPolicy; encrypted inter-node traffic optional

NetworkPolicy requires a CNI plugin that supports it. If you apply NetworkPolicy on a cluster running Flannel alone, the policies will be accepted by the API server but not enforced — traffic won't actually be blocked.

NetworkPolicy questions are common on the CKA. Read the scenario carefully to understand the direction (ingress vs. egress) and scope (namespace vs. pod selector). Drawing a quick diagram of what needs to communicate helps avoid mistakes.

05

Storage

2 lessons · ~4 hours

PersistentVolumes, PVCs & StorageClasses

The Storage Lifecycle

PersistentVolume (PV) — a piece of storage provisioned by an administrator or dynamically; has its own lifecycle independent of any Pod
PersistentVolumeClaim (PVC) — a request for storage by a user; specifies size and access mode; Kubernetes binds the PVC to a matching PV
StorageClass — defines the "class" of storage (provisioner, reclaim policy, parameters); enables dynamic provisioning
Binding is based on: storage size, access mode, and optionally storageClassName or label selectors

Access Modes

ReadWriteOnce (RWO) — can be mounted read-write by a single node; supported by most block storage (EBS, GCE PD)
ReadOnlyMany (ROX) — can be mounted read-only by many nodes simultaneously
ReadWriteMany (RWX) — can be mounted read-write by many nodes; requires shared filesystems (NFS, CephFS, Azure Files)
ReadWriteOncePod (RWOP) — mounted read-write by a single Pod only (Kubernetes 1.22+)

Dynamic Provisioning with StorageClass

PVC references a StorageClass; the provisioner automatically creates the underlying storage and a PV
Reclaim policies: Delete (PV deleted when PVC deleted; default for dynamic), Retain (PV kept for manual cleanup), Recycle (deprecated)
Create PVC: kubectl create -f pvc.yaml; check binding: kubectl get pvc — status should show Bound
Mount in a Pod: reference PVC name in volumes section, then volumeMounts in container spec

A PVC stuck in Pending state means no suitable PV was found. Check: (1) Does a PV exist with matching access mode and sufficient size? (2) Is the StorageClass name correct? (3) Does the provisioner have enough capacity or permissions?

On the CKA exam, always check kubectl get pv and kubectl get pvc to see the CAPACITY, ACCESS MODES, RECLAIM POLICY, and STATUS columns. An RWX claim will never bind to an RWO PV — access mode must match.

Volume Types: emptyDir, hostPath, ConfigMap & Secret

Ephemeral Volumes

emptyDir — created empty when a Pod is assigned to a node; exists for the lifetime of the Pod; shared between containers in the same Pod; useful for scratch space, caching, or sidecar communication
Memory-backed emptyDir: emptyDir: {medium: Memory} — faster but counts against container memory limits

Node-Level Volumes

hostPath — mounts a file or directory from the host node filesystem into the Pod; powerful but dangerous; creates tight coupling to a specific node; use only for node-level system components (DaemonSets)
Types: Directory, File, Socket, DirectoryOrCreate, FileOrCreate

ConfigMap & Secret Volumes

configMap volume — mounts ConfigMap data as files in the Pod; each key becomes a filename; updates to the ConfigMap propagate to the Pod (with a short delay)
secret volume — mounts Secret data as files; stored in tmpfs (RAM) for security; each key becomes a filename in the mount path
Both can be projected together using the projected volume type

NFS Volumes

NFS volumes allow multiple Pods across multiple nodes to share the same filesystem (ReadWriteMany)
Specify in PV spec: nfs: {server: <nfs-server-ip>, path: /exports/data}
No special provisioner needed — the NFS server must be reachable from all nodes

Know how to inject a ConfigMap as environment variables vs. as a mounted file. As env vars: changes to the ConfigMap don't automatically reflect in running Pods (restart required). As a volume: changes propagate automatically.

06

Security

3 lessons · ~4 hours

Pod Security & Security Contexts

Pod Security Admission (PSA)

Kubernetes 1.25+ replaces PodSecurityPolicy with Pod Security Admission — a built-in admission controller
Three policy levels: privileged (unrestricted), baseline (minimally restrictive; prevents known privilege escalations), restricted (heavily restricted; current hardening best practices)
Applied per namespace with labels: pod-security.kubernetes.io/enforce: restricted
Modes: enforce (reject Pods that violate), audit (log violations), warn (warn users)

Security Contexts

runAsNonRoot: true — prevent containers from running as root
runAsUser: 1000 — set the UID for the container process
readOnlyRootFilesystem: true — make the container's root filesystem read-only
allowPrivilegeEscalation: false — prevents processes from gaining more privileges than their parent
capabilities: {drop: ["ALL"], add: ["NET_BIND_SERVICE"]} — drop all Linux capabilities and add only what's needed
Pod-level vs. container-level: pod-level securityContext applies to all containers; container-level overrides pod-level

The restricted PSA level requires: non-root user, no privilege escalation, drop all capabilities, read-only root filesystem where possible, and specific seccomp profiles. If your Pods fail to schedule in a restricted namespace, check security context settings.

Security context fields are frequently tested. Know the field names exactly: runAsNonRoot, runAsUser, readOnlyRootFilesystem, allowPrivilegeEscalation, capabilities. Use kubectl explain pod.spec.securityContext during the exam.

ServiceAccounts & RBAC Best Practices

ServiceAccounts

Every Pod runs with a ServiceAccount (defaults to default SA in the namespace)
The SA token is auto-mounted at /var/run/secrets/kubernetes.io/serviceaccount/token — used by applications to authenticate to the API server
Create a dedicated SA: kubectl create serviceaccount my-app-sa -n production
Disable auto-mount if not needed: automountServiceAccountToken: false in Pod or SA spec
Kubernetes 1.24+: SA tokens are no longer auto-created as Secrets; use TokenRequest API or create a Secret manually with type: kubernetes.io/service-account-token

RBAC Best Practices

Follow least privilege — grant only the verbs and resources the application actually needs
Prefer namespace-scoped Roles over ClusterRoles for namespaced workloads
Avoid binding ClusterRoles like cluster-admin to application ServiceAccounts
Audit existing bindings: kubectl get rolebindings,clusterrolebindings -A -o wide
Use kubectl auth can-i --list --as=system:serviceaccount:<ns>:<sa> to audit effective permissions

A common CKA task: "Create a ServiceAccount, bind it to a Role, and configure a Pod to use it." Remember three steps: create SA, create Role/RoleBinding referencing the SA, set serviceAccountName in Pod spec.

Secrets Management & Image Security

Kubernetes Secrets

Opaque — generic key-value data (base64-encoded, not encrypted by default); most common type
kubernetes.io/tls — TLS certificate and private key; fields: tls.crt and tls.key
kubernetes.io/dockerconfigjson — registry credentials for pulling private images
Create from literal: kubectl create secret generic db-creds --from-literal=password=s3cr3t
Create TLS secret: kubectl create secret tls my-tls --cert=cert.pem --key=key.pem
Inject as env var: valueFrom: {secretKeyRef: {name: db-creds, key: password}}
Inject as volume: Secret keys become files at the mount path

Image Security & Private Registries

Pull from private registry: create a kubernetes.io/dockerconfigjson Secret, then reference it with imagePullSecrets in Pod spec
Attach pull secret to a ServiceAccount: kubectl patch sa default -p '{"imagePullSecrets": [{"name": "registry-creds"}]}' — all Pods using that SA will auto-use the pull secret
Always use specific image tags (not :latest) in production — :latest with imagePullPolicy: Always re-pulls on every restart

Network Policies for Zero-Trust

Start with a default-deny NetworkPolicy in each namespace, then explicitly allow required traffic
Separate ingress and egress policies; explicitly allow DNS egress (port 53) or Pods can't resolve service names
Combine with RBAC to restrict who can create/modify NetworkPolicy objects

Secrets are base64-encoded, not encrypted, by default. For production security, enable etcd encryption at rest by configuring an EncryptionConfiguration object and referencing it in the kube-apiserver manifest with --encryption-provider-config.

The exam won't ask you to break base64 encoding. But it will ask you to create secrets and use them. Know both the env var injection pattern and the volume mount pattern — they're both tested.

07

Troubleshooting

4 lessons · ~9 hours

Diagnosing Node Issues

Node NotReady: Step-by-Step

Identify problematic node: kubectl get nodes — look for NotReady or Unknown status
Inspect conditions: kubectl describe node <node-name> — check Conditions section for MemoryPressure, DiskPressure, NetworkUnavailable
Check kubelet service: systemctl status kubelet — is it running? Any error messages?
Read kubelet logs: journalctl -u kubelet -n 100 --no-pager — look for certificate errors, container runtime failures, or configuration issues
Verify container runtime: systemctl status containerd or systemctl status cri-o
Check disk: df -h — a full disk stops kubelet from running new Pods
Verify network: ensure the node can reach the API server IP on port 6443

Cordon, Drain & Uncordon

kubectl cordon <node> — marks node as unschedulable; existing Pods continue running
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data — evicts all Pods and cordons the node; use for maintenance
kubectl uncordon <node> — marks node schedulable again after maintenance

In a CKA troubleshooting scenario, always SSH to the broken node and check kubelet first. The most common exam scenario: kubelet service is stopped — fix with systemctl start kubelet && systemctl enable kubelet.

Diagnosing Pod Failures

Pod Status States

Pending — Pod accepted but not yet scheduled; check Events in kubectl describe pod for: insufficient resources, taint/toleration mismatch, PVC not bound, image pull issue at scheduling
CrashLoopBackOff — container repeatedly crashes; Kubernetes applies an exponential back-off delay between restarts; inspect logs with kubectl logs <pod> --previous
OOMKilled — container exceeded its memory limit and was killed by the kernel; increase memory limit or reduce application memory usage
ImagePullBackOff / ErrImagePull — Kubernetes cannot pull the container image; causes: wrong image name, missing imagePullSecrets, registry unreachable, rate limiting
CreateContainerConfigError — missing ConfigMap or Secret referenced by the Pod; verify the referenced resources exist in the same namespace
Terminating (stuck) — Pod stuck in Terminating state; usually a finalizer issue; force delete with kubectl delete pod <pod> --force --grace-period=0

Inspection Commands

kubectl describe pod <pod> — full details: events, conditions, container statuses, volume mounts
kubectl get events --sort-by=.lastTimestamp -n <namespace> — chronological event stream
kubectl logs <pod> -c <container> — logs from a specific container in a multi-container Pod
kubectl logs <pod> --previous — logs from the previous (crashed) container instance
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].state}' — precise state in JSON

CrashLoopBackOff is one of the most common issues candidates see both in exams and in real clusters. Always check kubectl logs --previous first — the crashed container's last logs usually contain the root cause (missing env var, misconfigured mount, application startup error).

For ImagePullBackOff, check the exact image name (typos are common), verify the tag exists in the registry, and confirm imagePullSecrets are correctly configured if pulling from a private registry.

Network Connectivity Debugging

Testing Pod-to-Pod & Pod-to-Service Connectivity

Interactive shell in a running Pod: kubectl exec -it <pod> -- /bin/sh
One-shot connectivity test: kubectl exec <pod> -- nc -zv <target-service> <port>
DNS resolution test: kubectl exec <pod> -- nslookup <service-name>.<namespace>.svc.cluster.local
Deploy a debug pod on demand: kubectl run debug --image=nicolaka/netshoot --rm -it --restart=Never -- /bin/bash — netshoot includes curl, nslookup, netstat, tcpdump, and more
Check Service endpoints: kubectl get endpoints <service-name> — empty means no matching Pods
Verify kube-proxy rules: kubectl get configmap kube-proxy -n kube-system -o yaml

Common Network Failure Patterns

Empty Endpoints → Service selector doesn't match Pod labels; fix the selector or Pod labels
DNS resolution fails → CoreDNS is down or misconfigured; check kubectl get pods -n kube-system | grep coredns
NetworkPolicy blocking traffic → verify with temporary policy removal or by checking policy selectors
kube-proxy not running → no Service routing; check kube-proxy DaemonSet: kubectl get ds kube-proxy -n kube-system

The nicolaka/netshoot image is your best friend for network debugging. It contains every networking tool you need. Use kubectl run debug --image=nicolaka/netshoot --rm -it --restart=Never -- bash to get an instant debug environment.

etcd Health & Control Plane Troubleshooting

etcd Health Checks

Check etcd member health: ETCDCTL_API=3 etcdctl endpoint health --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
List etcd members: ETCDCTL_API=3 etcdctl member list --endpoints=... --cacert=... --cert=... --key=...
If etcd is unhealthy, the API server will be unresponsive — all kubectl commands will fail
Check etcd static Pod: cat /etc/kubernetes/manifests/etcd.yaml — verify data directory and cert paths are correct

API Server & Other Control Plane Issues

Control plane components run as static Pods in /etc/kubernetes/manifests/ — the kubelet monitors these files and restarts Pods on changes
If API server is down, check the static Pod manifest for misconfigurations (wrong port, bad cert path)
Logs for static Pods: kubectl logs kube-apiserver-<node> -n kube-system or crictl logs <container-id> if kubectl is unavailable
crictl commands (when kubectl is unavailable): crictl ps (list containers), crictl logs <id> (container logs), crictl inspect <id>

Audit Logs

Configure API server audit logging via --audit-log-path and --audit-policy-file flags in the kube-apiserver manifest
Audit policy levels: None, Metadata (request only), Request (request + body), RequestResponse (request + response)
Audit logs are invaluable for security investigations — who deleted a resource, what ServiceAccount made an unauthorized API call

When kubectl doesn't work, use crictl directly on the node. It talks to the container runtime and can inspect running containers, pull logs, and check container health without going through the API server. Essential for recovering a broken control plane.

Control plane troubleshooting scenarios on the CKA often involve a deliberately broken kube-apiserver manifest. Common issues: wrong etcd endpoint, missing or misspelled certificate path, incorrect port. Always check /etc/kubernetes/manifests/ and compare against a known-good reference.

Certified Kubernetes Administrator (CKA)

Study on the go with our IT certification podcast

Course Modules

The Kubernetes Control Plane

What Runs on Every Worker Node

Key Node Concepts

kubectl Command Syntax

Imperative Commands (exam speed tricks)

Core API Resources

kubeadm init Workflow

kubeadm join Workflow

etcd Backup & Restore

Upgrading a Cluster with kubeadm

kubeconfig: Managing Cluster Access

Role-Based Access Control (RBAC)

Creating RBAC Resources

TLS Certificates & CSRs

Deployment Strategy

Scaling

DaemonSets, StatefulSets, Jobs & CronJobs

Container Resources

LimitRange

ResourceQuota

Taints & Tolerations

Node Affinity

Pod Priority & Preemption

Service Types

DNS in Kubernetes (CoreDNS)

Ingress Concepts

Ingress YAML Patterns

NetworkPolicy

NetworkPolicy Example Pattern

CNI Plugins Overview

The Storage Lifecycle

Access Modes

Dynamic Provisioning with StorageClass

Ephemeral Volumes

Node-Level Volumes

ConfigMap & Secret Volumes

NFS Volumes

Pod Security Admission (PSA)

Security Contexts

ServiceAccounts

RBAC Best Practices

Kubernetes Secrets

Image Security & Private Registries

Network Policies for Zero-Trust

Node NotReady: Step-by-Step

Cordon, Drain & Uncordon

Pod Status States

Inspection Commands

Testing Pod-to-Pod & Pod-to-Service Connectivity

Common Network Failure Patterns

etcd Health Checks

API Server & Other Control Plane Issues

Audit Logs

Ready to test your knowledge?