☸️ Kubernetes

Kubernetes Debug Coach prompts

Battle-tested prompts to debug a misbehaving Kubernetes cluster — CrashLoopBackOff, pending pods, broken services, OOMKilled, mystery YAML. Paste your kubectl output, get a root-cause diagnosis and a fix you can run.

Tested 2026-06 Claude 4.7 OpusGPT-5Gemini 2.5 Pro #kubernetes#debug#ops#cloud

Honest note — These prompts diagnose what kubectl is already telling you — they don't replace cluster access. Always re-run the suggested kubectl commands yourself; never apply a YAML the LLM writes without diffing it against your live manifests first. Behaviour also drifts between Kubernetes minor versions (1.28 vs 1.31 RBAC, sidecar containers, gateway-api, etc.) — name your version in the prompt.

Prompts in this set

1. Diagnose a CrashLoopBackOff in 60 seconds
2. Why is my Pod stuck in Pending?
3. Why can't traffic reach my Service / Pod?
4. Diagnose OOMKilled / right-size resources
5. Decode a wall of kubectl output
6. Write the YAML for the CKA / CKAD task (no shortcuts)

1. Diagnose a CrashLoopBackOff in 60 seconds

Your pod is restarting in a loop. You have a `kubectl describe pod` and the logs from the failing container, but the root cause isn't obvious.

Claude 4.7 Opus (2026-06)GPT-5 (2026-05)

You are a senior Kubernetes SRE. Diagnose this CrashLoopBackOff.

Kubernetes version: <K8S_VERSION_E_G_1_30>
Namespace: <NAMESPACE>

--- kubectl describe pod <POD> ---
<PASTE_DESCRIBE_OUTPUT>

--- kubectl logs <POD> -c <CONTAINER> --previous ---
<PASTE_PREVIOUS_LOGS>

Do this, in order:
1. **Root cause** in one sentence — the exact reason the container exited non-zero. Distinguish image/entrypoint bugs from runtime bugs from config bugs.
2. **Three pieces of evidence** from the inputs that prove your root cause (quote the line + explain why it's diagnostic).
3. **The fix** — the smallest possible change (one of: image tag, command/args, env var, volumeMount, resources, probe). Show the diff against the current Pod spec, not a from-scratch YAML.
4. **The 2-command verification** to run after applying the fix.
5. **If you are <70% confident**, list the single kubectl command I should run next to disambiguate — don't guess.

No "it could be many things" hedging. Pick the most likely cause from the evidence and commit.

TipAlways include `--previous` logs — they're the logs from the run that just crashed, not the one currently starting. 80% of CrashLoopBackOff diagnoses come from those logs.

2. Why is my Pod stuck in Pending?

Pod has been Pending for minutes. Scheduler isn't placing it. You need to know if it's resources, taints, nodeSelector, PVC, or the cluster autoscaler.

Claude 4.7 Opus (2026-06)GPT-5 (2026-05)Gemini 2.5 Pro (2026-05)

Diagnose why this pod is stuck Pending.

Kubernetes version: <K8S_VERSION>
Managed or self-hosted: <EKS|AKS|GKE|KUBEADM|K3S|OTHER>

--- kubectl describe pod <POD> ---
<PASTE_DESCRIBE>

--- kubectl get nodes -o wide ---
<PASTE_NODES>

--- (optional) kubectl describe node <SUSPECT_NODE> ---
<PASTE_NODE_DESCRIBE_OR_LEAVE_EMPTY>

Walk the standard Pending checklist and tell me which step fails:
1. Are there schedulable nodes at all? (Ready, no SchedulingDisabled, no NotReady taints I don't tolerate)
2. Does the pod's requests.cpu / requests.memory fit any node's Allocatable? Show the math.
3. nodeSelector / nodeAffinity / required topology constraints — does any node satisfy them?
4. Taints I'm not tolerating? List the taint and the tolerations I'm missing.
5. PVC bound? If WaitForFirstConsumer, is the chosen zone schedulable?
6. Pod Topology Spread / anti-affinity blocking placement?
7. Cluster Autoscaler — would it scale up? If no, why (max-size hit, ASG mismatch, scale-down hold).

For the step that fails, give me the fix as a `kubectl patch` or a 5-line YAML diff. End with the one command that proves the fix worked.

TipIf the FailedScheduling event already names the reason ("0/3 nodes are available: 3 Insufficient cpu"), you don't need this prompt — just resize. Use this when the event message is vague or contradictory.

3. Why can't traffic reach my Service / Pod?

Pod is Running and Ready but requests time out, or hit the wrong pod. Diagnose Service selector, EndpointSlice, NetworkPolicy, kube-proxy, or DNS.

Claude 4.7 Opus (2026-06)GPT-5 (2026-05)

Traffic isn't reaching my Service. Diagnose where it breaks.

Kubernetes version: <K8S_VERSION>
CNI: <CALICO|CILIUM|FLANNEL|AWS_VPC_CNI|OTHER>
Is a NetworkPolicy in use in this namespace? <YES|NO|UNKNOWN>

--- kubectl get svc <SVC> -o yaml ---
<PASTE_SVC_YAML>

--- kubectl get endpointslice -l kubernetes.io/service-name=<SVC> -o yaml ---
<PASTE_EPS_YAML>

--- kubectl get pods -l <SAME_LABELS_AS_SVC_SELECTOR> -o wide ---
<PASTE_POD_LIST>

--- symptom (from client) ---
<PASTE_CLIENT_ERROR_OR_CURL_OUTPUT>

Walk this path top-down and identify the FIRST broken link:
1. Does Service.spec.selector match labels on at least one Ready pod?
2. EndpointSlice — are there `addresses` populated? Are they Ready? Does the port match Service.targetPort?
3. If headless (`clusterIP: None`) — is the client doing DNS SRV lookups correctly?
4. NetworkPolicy — is there an egress rule on the client and an ingress rule on the server allowing this Service?
5. kube-proxy mode (iptables / ipvs) — any obvious stale rules? (skip if you can't tell from the inputs)
6. If the client is cross-namespace — `<svc>.<ns>.svc.cluster.local` actually resolving?

Give me: (a) the one broken link, (b) the exact kubectl command to fix it, (c) the curl/nslookup I run to prove it's fixed.

TipRun `kubectl run debug --rm -it --image=nicolaka/netshoot -- bash` to get a shell with curl + dig + nslookup + tcpdump in-cluster — half the diagnosis is doing it from a pod, not from your laptop.

4. Diagnose OOMKilled / right-size resources

Container is being killed for memory. Decide: raise limits, fix a leak, or move to a different workload type.

Claude 4.7 Opus (2026-06)GPT-5 (2026-05)

Container is being OOMKilled. Help me decide between raising limits, fixing a leak, or changing the deployment shape.

Language / runtime: <NODE|JAVA|PYTHON|GO|RUST|OTHER>
Workload type: <STATELESS_API|BATCH_JOB|STREAM_CONSUMER|CRON|DAEMONSET>

--- container resources block ---
<PASTE_RESOURCES>

--- kubectl top pod <POD> (over a few minutes if possible) ---
<PASTE_TOP>

--- last lines before kill (logs) ---
<PASTE_LOGS>

--- (optional) heap / GC config or runtime flags ---
<PASTE_RUNTIME_FLAGS_OR_LEAVE_EMPTY>

Do this:
1. Decide the category: **memory leak** / **undersized limit** / **JVM-or-runtime heap mis-tuned** / **batch with spiky peak** / **wrong workload shape**. One sentence justification from the inputs.
2. If leak — name the most likely culprit pattern for this runtime (unbounded cache, never-closed connection, growing in-memory queue, leaked goroutines) and how to confirm with a profiler.
3. If undersized — recommend a new requests/limits pair using the rule **requests = p95 usage, limits = ~1.5× requests** (show the math from `kubectl top`).
4. If JVM/Node/Python heap — give the specific flag to set (e.g. `-XX:MaxRAMPercentage=75`, `--max-old-space-size`, `MALLOC_ARENA_MAX`) and why the default fights the cgroup limit.
5. If it's actually a spiky batch — recommend Job/VPA/burstable QoS or a queue, not bigger pods.

Avoid the lazy answer of "just double the limits" — name the category first.

TipJVM apps OOM-killed under 2 GiB are almost always a `MaxRAMPercentage` issue, not a memory leak. Node.js apps under 512 MiB are almost always `--max-old-space-size` defaults fighting the cgroup. Diagnose the runtime before raising limits.

5. Decode a wall of kubectl output

You ran `kubectl describe`, `kubectl get events --sort-by=.lastTimestamp`, or `kubectl logs` and got a page of dense output. You want the 3 lines that matter.

Claude 4.7 Opus (2026-06)GPT-5 (2026-05)Gemini 2.5 Pro (2026-05)

Read this kubectl output and surface only the diagnostically important lines.

Command that produced it: <KUBECTL_COMMAND>
What I'm trying to figure out: <ONE_SENTENCE_GOAL>

--- output ---
<PASTE_OUTPUT>

Do this:
1. Quote the **3 most diagnostic lines** verbatim (no paraphrasing). For each, 1 sentence on why it matters for my goal.
2. List anything you'd normally call out but which is **noise here** (default events like Scheduled / Pulled / Started, normal probe successes, deprecated-API warnings unrelated to my goal) — one bullet, no more than 5 items.
3. Name the next single command I should run to confirm the diagnosis. Exactly one command, with arguments.
4. If the output answers my question fully, say so explicitly and skip step 3.

No bullet-listing every event. No "in summary, your pod is having issues". Be ruthless about signal-to-noise.

TipThis is the prompt to use when you're tired and the output is 200 lines. It also works on Helm output, `kubectl get events -A`, and audit logs.

6. Write the YAML for the CKA / CKAD task (no shortcuts)

Studying for CKA / CKAD. You need to write a Deployment / Service / Ingress / NetworkPolicy / RBAC role from scratch under exam time pressure.

Claude 4.7 Opus (2026-06)GPT-5 (2026-05)

Coach me on writing this Kubernetes manifest from scratch the way I'd have to on the CKA / CKAD exam.

Task in plain English: <DESCRIBE_TASK_EXACTLY_AS_THE_EXAM_WOULD_E_G_DEPLOY_NGINX_2_REPLICAS_EXPOSE_PORT_80_IN_NAMESPACE_WEB>
Exam: <CKA|CKAD|CKS>
Kubernetes version: <K8S_VERSION>

Do this in order:
1. **The fastest `kubectl create` / `kubectl run` one-liner with `--dry-run=client -o yaml`** that gets me 80% of the manifest. Real exam-day muscle memory — no helm, no kustomize, no `apply -f https://...`.
2. The handful of fields I have to add by hand on top of step 1 (the things `create` doesn't generate — resources, probes, env from secret, securityContext, NetworkPolicy podSelector, RBAC verbs).
3. The **final manifest**, ready to `kubectl apply -f -`. Properly indented, no comments mid-YAML, kind + apiVersion correct for my K8s version.
4. The **single verification command** I run after applying to prove the task is done (e.g. `kubectl get deploy -n web` showing the right replicas).
5. The **1 exam trap** for this kind of question (e.g. forgetting `-n <namespace>` on the verify step, using `Deployment` apiVersion `extensions/v1beta1`, missing `selector.matchLabels` matching template labels).

No lectures on what a Deployment is. Treat me like I've passed the practice quizzes and I'm drilling speed.

TipAdd `alias k=kubectl`, `export do='--dry-run=client -o yaml'`, `export now='--grace-period=0 --force'` to your shell on exam day. Half the time saving on CKA is muscle memory, not theory.

How to use these prompts

Each prompt has placeholders in <ANGLE_BRACKETS> — fill them in before pasting. Copy the prompt with the button, paste into Claude, ChatGPT, Gemini, or any chat-UI'd LLM.

Why "model tested" dates matter

LLMs improve and regress with every release. A prompt that worked on Claude 3.5 may need rewriting for Claude 4. The dates show when each prompt was last verified — anything older than 6 months should be re-tested before depending on it.

Found a better prompt?

Hit contact and share — we keep prompts that beat ours.