Interview Prep · Senior · Published June 2026

Top 10 advanced CKA interview questions for senior platform & staff SRE loops in 2026

Published June 2, 2026 · ~8 min read · No CNCF, Linux Foundation, or training-vendor revenue
SeniorTarget level
5+ yrsK8s production exp
$180–240kSenior platform US
$220–320kStaff SRE US base
TL;DR — the 30-second version

The base CKA interview proves you can run a cluster. The senior interview proves you can run twenty. These ten questions are what comes up in 2026 staff SRE, senior platform, and Kubernetes architect loops — the level above the standard CKA interview prep. They test design judgment, scale economics, and the operational debugging that only shows up at the 500-node tier.

If you’re prepping for the base CKA exam, start with our standard CKA interview questions and CKA ROI breakdown first.

The 10 questions

1. Cluster Autoscaler vs Karpenter — when do you pick which?

Cluster Autoscaler scales pre-defined node groups (ASGs on AWS, MIGs on GCP, VMSS on Azure) up and down based on Pending pods. Predictable, well-understood, but constrained to instance shapes you decided in advance and slower to react (90 s+ to a new node).

Karpenter provisions nodes directly via the cloud fleet API, picking the cheapest instance type that fits the pending pod shape. On EKS in 2026 it usually wins on cost (20–40% bin-packing improvement) and node startup time (30–60 s). Trade-off: more moving parts, and the abstraction can hide bad pod requests — a workload requesting 64 vCPU will silently spin up an m7i.16xlarge.

Pick CA on GKE/AKS or when compliance pins you to specific instance families. Pick Karpenter on EKS when cost matters and your platform team can own one more controller.

2. How does API priority and fairness keep the API server alive at scale?

APF replaced the legacy max-inflight throttle in v1.20 and graduated to GA in v1.29. Requests are classified by a FlowSchema (matched on user, group, namespace, verb, resource) into a PriorityLevelConfiguration with a share of total concurrency.

When a misbehaving controller starts spamming list pods --all-namespaces, its requests are queued or rejected inside its own flow — kubelet heartbeats and the scheduler keep flowing. The signals: apiserver_flowcontrol_rejected_requests_total, the X-Kubernetes-PF-FlowSchema-UID response header, and rising apiserver_request_duration_seconds on system flows. At >500 nodes, expect to tune APF to give system-leader-election and kube-system more headroom.

3. Are namespaces enough for multi-tenant isolation?

No, and confidently saying so is the answer interviewers want. Namespaces are a naming scope, not a security boundary. Hard multi-tenancy needs:

For genuinely hostile workloads — SaaS running customer code — the only honest answer is one cluster per tenant.

4. Walk me through cutting Kubernetes cost 30% without dropping reliability.

Order matters. Skip measurement and the rest is theatre.

  1. Measure: OpenCost or Kubecost to attribute spend by namespace, workload, and label. Most platform teams discover 20–40% of spend goes to one dev cluster nobody owns.
  2. Right-size requests: VPA in recommend-only mode for two weeks, then walk teams through the diff. Most pods over-request CPU 3–5x. This alone clears 20%.
  3. Bin-pack: Karpenter or Cluster Autoscaler with consolidation, topology-spread constraints to avoid wasted nodes.
  4. Spot/Preemptible: stateless workloads with PDBs, preStop hooks, and a small on-demand floor. 60–80% discount on the spot portion.
  5. Commit: Reserved Instances or Savings Plans on the steady-state baseline only — never on the variable layer.

30% on a first pass is normal. 50% is achievable on a cluster nobody’s ever tuned.

5. A pod is stuck in ContainerCreating for 10 minutes. Senior debug.

kubectl describe pod Events first. Then in order of probability at scale:

6. Designing a zero-downtime upgrade for a 500-node multi-AZ production cluster.

Build the surge node group first: a parallel pool one minor version ahead, empty. Validate the control plane upgrade in a staging cluster against your CRDs and admission webhooks — v1.32 removed several legacy APIs that v1.29 manifests may still use. Then on prod:

  1. Upgrade control plane (managed services do this for you in 5–15 min). Watch apiserver_request_total for 4xx spikes from client-go versions that don’t match.
  2. Roll workers: replace the old node group rather than in-place upgrade. kubectl drain --ignore-daemonsets --delete-emptydir-data --grace-period=300. Honor PDBs.
  3. Watchdog: SLO-burn-rate alerts on every critical workload during the roll. Bail if any breaches.
  4. Soak: 48 h on the new minor before greenlighting the next environment.

7. CRDs and Operators — build vs. adopt?

Default to adopt. The CNCF Operator Hub has battle-tested operators for almost every domain (cert-manager, Strimzi, Crossplane, Argo, Flux, External Secrets). Building your own is justified only when:

If you build: scope the CRD narrowly, version it from day one with a conversion webhook strategy, ship a Status sub-resource with proper conditions, and write the controller to be idempotent and tolerant of out-of-order events. Most homegrown operators die because nobody planned the v1 → v2 migration.

8. Where does the kube-scheduler get expensive at 500+ nodes, and what do you do?

Scheduler latency is dominated by the Filter and Score phases. Symptoms: scheduler_pending_pods rising, scheduler_scheduling_attempt_duration_seconds p99 over 1 s. Mitigations:

9. Zero-trust pod-to-pod — NetworkPolicies, mesh, or both?

Both, with separate jobs. NetworkPolicies enforce L3/L4 deny-by-default at the CNI layer — cheap, fast, in-kernel with Cilium eBPF. A service mesh (Istio, Linkerd, or Cilium Service Mesh) handles mTLS identity, authZ policy on L7 (HTTP verb, path, header), and zero-trust workload identity via SPIFFE.

The mistake: doing only the mesh. Mesh policy fails open if the sidecar crashes or is bypassed (hostNetwork pods, init containers). NetworkPolicy at the CNI is your floor. Then layer mesh policy for L7 and identity.

10. What’s an SLO you’d set for the cluster control plane itself, and how do you enforce it?

The conversation interviewers actually want:

Quoting numbers is the easy half. The senior signal is owning the budget and the freeze.

What these questions test

The base CKA interview asks “can you run a cluster?” The senior loop asks “can you decide what the cluster should look like, and own it when it breaks at 3 a.m. with 500 nodes?” Every answer above pivots on a trade-off (CA vs Karpenter, mesh vs NetworkPolicy, build vs buy) and on instrumentation (which metric, what threshold, what action). Memorize the metric names — apiserver_flowcontrol_rejected_requests_total, scheduler_pending_pods, apiserver_admission_webhook_admission_duration_seconds. Senior interviewers screen on whether you reach for them unprompted.

Practice CKA questions right now — no signup

CertQuests has engineer-written CKA practice questions with full explanations on every answer. Free, no account required.

Frequently asked questions

Cluster Autoscaler vs Karpenter — when do you pick which in 2026?

Cluster Autoscaler scales pre-defined node groups; predictable and the only good option on GKE/AKS. Karpenter provisions nodes directly via the EC2 fleet API and usually wins on cost and startup latency on EKS. Pick CA when compliance pins instance families; pick Karpenter on EKS when cost matters and your team can own another controller.

How does API priority and fairness keep the API server stable at scale?

APF classifies requests via FlowSchemas into PriorityLevelConfigurations with a share of concurrency. A misbehaving controller is queued in its own flow without starving kubelet heartbeats or the scheduler. Watch apiserver_flowcontrol_rejected_requests_total and the X-Kubernetes-PF-FlowSchema-UID response header.

Are namespaces enough for multi-tenant isolation?

No. Namespaces are a naming scope, not a security boundary. Hard multi-tenancy needs separate node pools, sandboxed RuntimeClass, default-deny NetworkPolicies on an enforcing CNI, ResourceQuotas, no cluster-admin grants, and for hostile workloads — one cluster per tenant.

How do you cut Kubernetes cost 30% without losing reliability?

Measure first with OpenCost or Kubecost, then right-size requests (most pods over-request CPU 3–5x), bin-pack with Karpenter, run stateless workloads on spot with PDBs and preStop hooks, and commit Reserved or Savings Plans only on the baseline. 30–50% reduction is normal on a first pass.

How do you debug a pod stuck in ContainerCreating for 10 minutes?

Read kubectl describe pod Events first. Common causes at scale: image pull (private registry secret, Docker Hub rate limit), CNI IPAM exhaustion (AWS VPC CNI ENI pool), EBS attach across AZs, slow admission webhook, or a node still being provisioned. Check journalctl -u kubelet for what kubectl hides.

How we wrote this

No CNCF, Linux Foundation, or training-vendor revenue. Questions were sourced from senior platform and staff SRE interview reports on Reddit, Hacker News, the CNCF Slack #sig-scalability and #sig-autoscaling channels, and LinkedIn interview threads from 2025–2026, cross-referenced against the official Kubernetes architecture docs and the BLS Occupational Outlook for compensation context. Tell us what you’d update.

Last reviewed: June 2, 2026.