{"slug": "crashloopbackoff-in-kubernetes-the-real-causes-and-how-we-fix-it", "title": "CrashLoopBackOff in Kubernetes: The Real Causes and How We Fix It", "summary": "CrashLoopBackOff in Kubernetes is a pod status indicating a container repeatedly crashes and restarts with exponential backoff, caused by application errors, OOM kills, failing probes, bad images, missing dependencies, or init container failures. The diagnostic loop uses kubectl commands to inspect logs and events, while Cast AI's Workload Autoscaler automates OOMKill fixes by adjusting memory limits.", "body_md": "## Key takeaways\n\n**CrashLoopBackOff is a pod status**(`Waiting.Reason`\n\n), not an error code — the container starts, crashes, and Kubernetes keeps retrying with exponential backoff.**Backoff sequence:** 10s → 20s → 40s → 80s → 160s → 300s (capped at 5 minutes), resets after 10 minutes of successful operation.**Six root causes:** application/config errors, OOM kills, failing liveness probes, bad image or entrypoint, missing dependencies, and init container failures.**The CrashLoopBackOff Diagnostic Loop:**`get pods`\n\n→`describe pod`\n\n→`logs --previous`\n\n→`get events`\n\n→ fix.**Exit code 137**= OOMKilled (128 + SIGKILL signal 9). Exit codes 1 or 2 = application or config error. Init container failures show status`Init:CrashLoopBackOff`\n\n.**Cast AI Workload Autoscaler** detects OOMKill events and applies corrected memory limits immediately — up to 240 changes per hour, no manual intervention.\n\n## What is CrashLoopBackOff?\n\n**CrashLoopBackOff is a Kubernetes pod status ( Waiting.Reason: CrashLoopBackOff) indicating that a container is repeatedly starting, crashing, and being restarted by the kubelet.** It is not an error code. Kubernetes applies exponential backoff between each restart attempt to avoid overwhelming the node. The sequence is:\n\n**10s → 20s → 40s → 80s → 160s → 300s**, capped at 5 minutes. After 10 consecutive minutes of successful operation, the counter resets to zero.\n\nThis differs from **ImagePullBackOff**, which happens before the container starts — Kubernetes cannot pull the image (wrong tag, bad credentials, network issue). With CrashLoopBackOff, the image pulled fine; the container exits with a non-zero code. Deleting and recreating the pod only resets the backoff timer. The same crash will recur until you fix the underlying cause.\n\n## The common causes\n\n| Cause | Signal | Exit code hint | kubectl clue | Fix section |\n|---|---|---|---|---|\n| Misconfig (env vars, entrypoint, permissions) | Container exits immediately on start | 1 or 2 | `logs --previous` shows config error |\n|\n\n`describe pod`\n\nshows `OOMKilled: true`\n\n[OOM restarts](#out-of-memory-restarts)`describe pod`\n\nshows `Liveness probe failed`\n\n[Probe failures](#liveness-and-readiness-probe-failures)`logs --previous`\n\nempty or exec error[Config errors](#application-or-config-errors)`describe pod`\n\nshows volume mount error or missing resource[Config errors](#application-or-config-errors)`Init:CrashLoopBackOff`\n\n; `describe pod`\n\nshows init container exit[Init container failures](#init-container-failures)### Init container failures\n\nInit container failures look subtly different in `kubectl get pods`\n\n. Instead of `CrashLoopBackOff`\n\n, the STATUS column shows `Init:CrashLoopBackOff`\n\nor `Init:Error`\n\n. The main application container never starts — Kubernetes runs init containers sequentially, and if any one exits with a non-zero code, the whole pod restarts.\n\nThe diagnostic commands are slightly different too, because `kubectl logs <pod>`\n\ndefaults to the main container (which hasn’t run yet). You need to name the init container explicitly:\n\n```\n# List init container names for the pod\nkubectl get pod <pod-name> -o jsonpath='{.spec.initContainers[*].name}'\n\n# Read logs from a specific init container's last run\nkubectl logs <pod-name> -c <init-container-name> --previous\n```\n\nCommon init container failures break down into two categories. The first is waiting for an external dependency — a database readiness check, a service endpoint that isn’t up yet, or a network policy that blocks the init container’s probe. The second is a missing Secret or ConfigMap: the init container tries to mount or read a resource that doesn’t exist in the namespace, exits non-zero, and triggers the loop. Check both with `kubectl describe pod`\n\n— the Events section will show `FailedMount`\n\nor connection timeout errors before you even pull logs.\n\n## How to diagnose it step by step\n\nFollow **The CrashLoopBackOff Diagnostic Loop**: get → describe → logs → events → fix. Each step narrows the cause. Don’t skip steps — what looks like an OOM kill can be a liveness probe timeout, and the fixes are different.\n\n**Step 1: Confirm status and restart count**\n\n```\n# Check STATUS and RESTARTS columns\nkubectl get pods -n <namespace>\n# NAME                        READY   STATUS                RESTARTS   AGE\n# payments-api-7d9f4b-xkj2p   0/1     CrashLoopBackOff      14         47m\n# db-migrations-9c3a1b-zzz9k  0/1     Init:CrashLoopBackOff  3         12m\n```\n\n**Step 2: Describe the pod — exit code and events**\n\n```\n# Shows exit codes, OOMKilled flag, probe failures, volume mount errors\nkubectl describe pod <pod-name> -n <namespace>\n# Key fields:\n#   Last State: Terminated\n#     Reason:   OOMKilled       <-- OOM kill\n#     Exit Code: 137\n#   Liveness probe failed: ...  <-- probe issue\n#   Warning  FailedMount ...    <-- missing ConfigMap or Secret\n```\n\n**Init container callout:** If STATUS is `Init:CrashLoopBackOff`\n\n, the describe output’s `Init Containers`\n\nsection shows each init container’s last exit code and reason. Look for `FailedMount`\n\nevents or connection errors there — the main container’s `Last State`\n\nblock will be empty since it never ran.\n\n**Step 3: Read the previous container’s logs**\n\n```\n# --previous retrieves logs from the last terminated instance\n# Without this flag you get the current (empty) container\nkubectl logs <pod-name> -n <namespace> --previous\n\n# Multi-container pods: specify the container\nkubectl logs <pod-name> -n <namespace> -c <container-name> --previous\n\n# Init containers require the -c flag (main container hasn't run)\nkubectl logs <pod-name> -n <namespace> -c <init-container-name> --previous\n```\n\n**Sidecars in multi-container pods:** In multi-container pods, check each container individually. Service mesh sidecars — Istio’s `envoy-proxy`\n\n, Datadog agents — can consume memory that pushes the total pod over node capacity, causing the main container to be OOM-killed even if its own usage looks fine. Use `kubectl logs <pod> -c <sidecar-name> --previous`\n\nto check each container.\n\n**Step 4: Check cluster events**\n\n```\n# Events sorted by time — look for OOMKill, probe failures, scheduling issues\nkubectl get events -n <namespace> --sort-by='.lastTimestamp'\n```\n\n**Step 5: Check live resource usage**\n\n```\n# Requires metrics-server\nkubectl top pod <pod-name> -n <namespace> --containers\n```\n\n**Step 6: Extract the exit code programmatically**\n\n```\n# Read exit code from lastState — 137=OOMKilled, 1/2=app error, 127=command not found\nkubectl get pod <pod-name> -n <namespace> \\\n  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'\n```\n\n**Prometheus alert for proactive OOM detection:** Track `container_memory_working_set_bytes`\n\nagainst memory limits. Alert at 80% — that’s your warning window before the kernel OOM killer fires.\n\n```\n# Containers using >80% of their memory limit\n(container_memory_working_set_bytes{container!=\"\"}\n  / kube_pod_container_resource_limits{resource=\"memory\", container!=\"\"}) > 0.80\n```\n\n## How to fix each cause\n\n### Application or config errors\n\nWhen `logs --previous`\n\nshows `Error: required environment variable DATABASE_URL is not set`\n\nor `exec: \"myapp\": executable file not found in $PATH`\n\n, you have a config or packaging problem.\n\n**Missing env vars:** Confirm the referenced ConfigMap or Secret exists in the same namespace. A reference to a non-existent object causes an immediate exit.**Wrong entrypoint:** Verify`spec.containers[].command`\n\nand`args`\n\n. If you’re overriding the image’s CMD, confirm the binary path inside the image.**Permission errors:** Mounted volumes may not be writable by the container’s non-root UID. Check`fsGroup`\n\nand`runAsUser`\n\nin`securityContext`\n\n.\n\nFor live debugging, attach an ephemeral container without triggering another restart:\n\n```\n# Attach a debug container to inspect filesystem, env vars, and network\nkubectl debug -it <pod-name> -n <namespace> \\\n  --image=busybox \\\n  --target=<container-name>\n```\n\n### Out-of-memory restarts\n\nExit code 137 with `OOMKilled: true`\n\nin `lastState`\n\nmeans the Linux kernel terminated the container for exceeding its memory limit. See our deep-dive: [OOMKilled in Kubernetes](https://cast.ai/blog/oomkilled-exit-code-137/).\n\nSetting limits correctly is harder than it looks. In our 2026 Kubernetes Optimization Report, clusters averaged around 20% memory utilization — heavily overprovisioned overall, yet individual pods were still undersized. In one representative cluster analyzed for our 2026 Kubernetes Optimization Report, we recorded 40–50 OOM kills per hour — spiking above 80 during peak load. After deploying automated rightsizing, the rate dropped to near zero.\n\n**VPA** helps automate this. Valid `updateMode`\n\nvalues: `Off`\n\n(recommendations only), `Initial`\n\n(set on pod creation), `Recreate`\n\n(apply via pod eviction), `Auto`\n\n(in-place on K8s 1.33+, beta with the `InPlacePodVerticalScaling`\n\nfeature gate; clusters running 1.27–1.32 still use pod eviction).\n\n**JVM workloads:** Set `-Xmx`\n\nto ~75% of the container memory limit. A 512Mi limit should have `-Xmx384m`\n\n. This leaves headroom for non-heap memory — metaspace, thread stacks, native libraries. Skipping this is one of the most common causes of Java OOM kills in Kubernetes.\n\n**Cast AI Workload Autoscaler** goes beyond VPA. Its OOM event handler detects a kill as it happens, generates a new recommendation with increased memory overhead, and applies it immediately — no manual intervention. It supports up to 240 changes per hour and uses in-place pod resizing on K8s 1.33+ (beta, `InPlacePodVerticalScaling`\n\nfeature gate; 1.27–1.32 uses pod eviction instead).\n\n```\n# Check kernel OOM events on the node — do NOT use /var/log/syslog\njournalctl -k | grep \"Out of memory\"\n```\n\n**LimitRange objects can silently impose memory limits on pods that don’t set them explicitly.** If an operator deploys a pod without a `resources.limits.memory`\n\nfield, a LimitRange default (say, 128Mi) applies automatically — the pod appears to be running without limits but gets OOM-killed at 128Mi. Check before you assume the pod has no limit:\n\n```\n# Check for LimitRange objects in the namespace\nkubectl get limitrange -n <namespace>\n\n# See the default limits applied\nkubectl describe limitrange -n <namespace>\n```\n\n### Liveness and readiness probe failures\n\nThese two probes do different things. **Liveness probe failure → kubelet restarts the container** (causes CrashLoopBackOff). **Readiness probe failure → pod removed from Service endpoints** (does NOT restart the container). Conflating them produces unnecessary restarts.\n\n**A note on exit codes for liveness-triggered restarts:** The kubelet sends SIGTERM first — the container’s exit code will typically be 143 (128 + signal 15) or whatever exit code the application returns in its signal handler. SIGKILL (exit code 137) only follows if the container is still running after `terminationGracePeriodSeconds`\n\nexpires. If you’re seeing 137 on a liveness probe failure, your application is ignoring SIGTERM and being forcibly killed — that’s a separate problem worth fixing.\n\nFor slow-starting containers, use `startupProbe`\n\n(Kubernetes 1.18+). It disables liveness and readiness until it passes, giving the application its full startup window without forcing absurdly high `initialDelaySeconds`\n\non the liveness probe.\n\n```\n# startupProbe for slow starters — failureThreshold * periodSeconds = startup grace window\n# 30 * 10s = 5 minutes before liveness/readiness activate\nstartupProbe:\n  httpGet:\n    path: /healthz\n    port: 8080\n  failureThreshold: 30   # 30 allowed failures\n  periodSeconds: 10      # checked every 10s\n\nlivenessProbe:\n  httpGet:\n    path: /healthz\n    port: 8080\n  periodSeconds: 10\n  failureThreshold: 3\n```\n\nTwo rules that eliminate most probe-related CrashLoopBackOff incidents: (1) never probe external dependencies in liveness checks — a 30-second database blip should not restart every pod in your deployment; (2) use `startupProbe`\n\nfor any app that takes more than 10–15 seconds to initialize.\n\n## How to prevent crash loops\n\n**Set accurate resource requests and limits.** Requests drive scheduling; limits cap consumption. Profile under realistic load — don’t guess. An undersized memory limit produces OOM kills; no limit lets a leaking container consume the entire node.\n\n**Use startup probes for init-heavy workloads.** For Java applications on Kubernetes, implementing a `startupProbe`\n\nwith `failureThreshold=30`\n\nand removing CPU limits allows the JVM class-loading phase to complete without being throttled or killed — this is the combination that eliminates init-phase CrashLoopBackOff for JVM workloads. CPU limits cause throttling that dramatically slows JVM class loading; rely on CPU requests for scheduling and monitor actual utilization instead of capping it hard.\n\n**Implement graceful shutdown.** Handle SIGTERM, drain in-flight requests, and set `terminationGracePeriodSeconds`\n\nto match. A pod that crashes on shutdown can corrupt state that causes the next startup to fail.\n\n**Rollback safety.** Before applying any memory limit or probe changes in production, make sure you can undo them quickly:\n\n```\n# Always check rollout history before making changes\nkubectl rollout history deployment/<name> -n <namespace>\n\n# Roll back if the change makes things worse\nkubectl rollout undo deployment/<name> -n <namespace>\n```\n\n**Automate memory rightsizing.** Manual limits drift. [Cast AI Workload Autoscaler](https://cast.ai/workload-optimization) tracks actual memory usage continuously and keeps limits calibrated. Its OOM event handler closes the feedback loop between a kill event and a corrected limit without requiring an engineer to notice, diagnose, and redeploy. At 240 changes per hour across a fleet, it handles scale that VPA alone cannot.\n\n**Use OpsPilot for real-time diagnosis.** When a CrashLoopBackOff alert fires at 2am, OpsPilot gives you root cause in seconds: *“payments-api has restarted 14 times in 2 hours. Root cause: OOMKilled — memory limit 256Mi, peak RSS 312Mi at v2.3.9 rollout.”* That’s the full diagnostic loop compressed into one response, with the specific version and memory figures you need to act.\n\n## FAQ\n\n### What is CrashLoopBackOff in Kubernetes?\n\nCrashLoopBackOff is a pod status (`Waiting.Reason: CrashLoopBackOff`\n\n) indicating a container is repeatedly starting and crashing. Kubernetes applies exponential backoff between restart attempts — starting at 10 seconds, capping at 5 minutes. It is not an error code; it describes the pod’s current waiting state. See the [Kubernetes pod lifecycle documentation](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-restarts) for the full specification.\n\n### How long does the backoff last?\n\nThe sequence is 10s, 20s, 40s, 80s, 160s, then 300s (5 minutes) where it caps. Each failed restart doubles the wait, up to 5 minutes. After 10 consecutive minutes of successful operation, the backoff counter resets to zero.\n\n### What is the difference between CrashLoopBackOff and ImagePullBackOff?\n\nImagePullBackOff means Kubernetes cannot pull the container image — wrong tag, missing credentials, or a network issue. The container never starts. CrashLoopBackOff means the image pulled successfully but the container exits with a non-zero code after starting. Both use exponential backoff, but they occur at different lifecycle stages and require different fixes.\n\n### What does exit code 137 mean in Kubernetes?\n\nExit code 137 means the container was killed by SIGKILL (128 + signal 9) — either the Linux kernel’s OOM killer or the kubelet after `terminationGracePeriodSeconds`\n\nexpired. For OOM kills, `kubectl describe pod`\n\nshows `OOMKilled: true`\n\nin the container’s `lastState`\n\n. The fix is to increase the memory limit using profiling data or an automated rightsizing tool. For liveness probe restarts, note that kubelet sends SIGTERM first (exit code 143); 137 only appears if the container didn’t exit within the grace period.\n\n### How do I stop CrashLoopBackOff quickly?\n\nRun `kubectl logs <pod-name> --previous`\n\nto read the crashed container’s output, then `kubectl describe pod <pod-name>`\n\nto check exit codes and events. Exit code 137 + `OOMKilled: true`\n\n: increase memory limits. Exit codes 1 or 2 with config errors: fix env vars or entrypoint. Liveness probe failures in events: add a startupProbe. If STATUS shows `Init:CrashLoopBackOff`\n\n, use `kubectl logs <pod> -c <init-container-name> --previous`\n\n— the main container hasn’t run yet. Deleting the pod only resets the backoff timer — it does not fix the crash.\n\n### Can OOM kills cause CrashLoopBackOff?\n\nYes — OOM kills are one of the most common causes. The kernel terminates the container when it exceeds its memory limit (exit code 137). Kubernetes registers the crash and restarts the container. If the limit is still too low, the container hits it again, crashes again, and the loop continues. Cast AI Workload Autoscaler detects the OOMKill event and immediately applies a corrected memory limit, breaking the cycle without manual intervention.\n\n### What kubectl command shows CrashLoopBackOff?\n\n`kubectl get pods -n <namespace>`\n\nshows the STATUS column where CrashLoopBackOff appears alongside the RESTARTS count. For root cause, use `kubectl describe pod <name> -n <namespace>`\n\nfor exit codes and events, and `kubectl logs <name> --previous`\n\nfor the last container’s output. If STATUS shows `Init:CrashLoopBackOff`\n\n, first list init container names with `kubectl get pod <name> -o jsonpath='{.spec.initContainers[*].name}'`\n\nthen pull their logs with `kubectl logs <name> -c <init-container-name> --previous`\n\n.\n\n### Does Cast AI help with CrashLoopBackOff?\n\nYes, in two ways. OpsPilot diagnoses CrashLoopBackOff incidents in seconds, surfacing root cause and restart count without manual log triage. The Cast AI Workload Autoscaler handles OOM-driven crash loops by detecting kill events, generating updated memory recommendations, and applying them immediately — on Kubernetes 1.33+ via in-place pod resizing (beta, `InPlacePodVerticalScaling`\n\nfeature gate), or on 1.27–1.32 via pod eviction.", "url": "https://wpnews.pro/news/crashloopbackoff-in-kubernetes-the-real-causes-and-how-we-fix-it", "canonical_source": "https://cast.ai/blog/crashloopbackoff/", "published_at": "2026-06-26 07:55:25+00:00", "updated_at": "2026-06-29 09:31:09.070159+00:00", "lang": "en", "topics": ["ai-tools", "developer-tools"], "entities": ["Kubernetes", "Cast AI", "Workload Autoscaler"], "alternates": {"html": "https://wpnews.pro/news/crashloopbackoff-in-kubernetes-the-real-causes-and-how-we-fix-it", "markdown": "https://wpnews.pro/news/crashloopbackoff-in-kubernetes-the-real-causes-and-how-we-fix-it.md", "text": "https://wpnews.pro/news/crashloopbackoff-in-kubernetes-the-real-causes-and-how-we-fix-it.txt", "jsonld": "https://wpnews.pro/news/crashloopbackoff-in-kubernetes-the-real-causes-and-how-we-fix-it.jsonld"}}