cd /news/ai-tools/crashloopbackoff-in-kubernetes-the-r… · home topics ai-tools article
[ARTICLE · art-43208] src=cast.ai ↗ pub= topic=ai-tools verified=true sentiment=· neutral

CrashLoopBackOff in Kubernetes: The Real Causes and How We Fix It

CrashLoopBackOff in Kubernetes is a pod status indicating a container repeatedly crashes and restarts with exponential backoff, caused by application errors, OOM kills, failing probes, bad images, missing dependencies, or init container failures. The diagnostic loop uses kubectl commands to inspect logs and events, while Cast AI's Workload Autoscaler automates OOMKill fixes by adjusting memory limits.

read13 min views3 publishedJun 26, 2026
CrashLoopBackOff in Kubernetes: The Real Causes and How We Fix It
Image: Cast (auto-discovered)

Key takeaways #

CrashLoopBackOff is a pod status(Waiting.Reason

), not an error code — the container starts, crashes, and Kubernetes keeps retrying with exponential backoff.Backoff sequence: 10s → 20s → 40s → 80s → 160s → 300s (capped at 5 minutes), resets after 10 minutes of successful operation.Six root causes: application/config errors, OOM kills, failing liveness probes, bad image or entrypoint, missing dependencies, and init container failures.The CrashLoopBackOff Diagnostic Loop:get pods

describe pod

logs --previous

get events

→ fix.Exit code 137= OOMKilled (128 + SIGKILL signal 9). Exit codes 1 or 2 = application or config error. Init container failures show statusInit:CrashLoopBackOff

.Cast AI Workload Autoscaler detects OOMKill events and applies corrected memory limits immediately — up to 240 changes per hour, no manual intervention.

What is CrashLoopBackOff? #

CrashLoopBackOff is a Kubernetes pod status ( Waiting.Reason: CrashLoopBackOff) indicating that a container is repeatedly starting, crashing, and being restarted by the kubelet. It is not an error code. Kubernetes applies exponential backoff between each restart attempt to avoid overwhelming the node. The sequence is:

10s → 20s → 40s → 80s → 160s → 300s, capped at 5 minutes. After 10 consecutive minutes of successful operation, the counter resets to zero.

This differs from ImagePullBackOff, which happens before the container starts — Kubernetes cannot pull the image (wrong tag, bad credentials, network issue). With CrashLoopBackOff, the image pulled fine; the container exits with a non-zero code. Deleting and recreating the pod only resets the backoff timer. The same crash will recur until you fix the underlying cause.

The common causes #

Cause Signal Exit code hint kubectl clue Fix section
Misconfig (env vars, entrypoint, permissions) Container exits immediately on start 1 or 2 logs --previous shows config error

describe pod

shows OOMKilled: true

OOM restartsdescribe pod

shows Liveness probe failed

Probe failureslogs --previous

empty or exec errorConfig errorsdescribe pod

shows volume mount error or missing resourceConfig errorsInit:CrashLoopBackOff

; describe pod

shows init container exitInit container failures### Init container failures

Init container failures look subtly different in kubectl get pods

. Instead of CrashLoopBackOff

, the STATUS column shows Init:CrashLoopBackOff

or Init:Error

. The main application container never starts — Kubernetes runs init containers sequentially, and if any one exits with a non-zero code, the whole pod restarts.

The diagnostic commands are slightly different too, because kubectl logs <pod>

defaults to the main container (which hasn’t run yet). You need to name the init container explicitly:

kubectl get pod <pod-name> -o jsonpath='{.spec.initContainers[*].name}'

kubectl logs <pod-name> -c <init-container-name> --previous

Common init container failures break down into two categories. The first is waiting for an external dependency — a database readiness check, a service endpoint that isn’t up yet, or a network policy that blocks the init container’s probe. The second is a missing Secret or ConfigMap: the init container tries to mount or read a resource that doesn’t exist in the namespace, exits non-zero, and triggers the loop. Check both with kubectl describe pod

— the Events section will show FailedMount

or connection timeout errors before you even pull logs.

How to diagnose it step by step #

Follow The CrashLoopBackOff Diagnostic Loop: get → describe → logs → events → fix. Each step narrows the cause. Don’t skip steps — what looks like an OOM kill can be a liveness probe timeout, and the fixes are different.

Step 1: Confirm status and restart count

kubectl get pods -n <namespace>

Step 2: Describe the pod — exit code and events

kubectl describe pod <pod-name> -n <namespace>

Init container callout: If STATUS is Init:CrashLoopBackOff

, the describe output’s Init Containers

section shows each init container’s last exit code and reason. Look for FailedMount

events or connection errors there — the main container’s Last State

block will be empty since it never ran.

Step 3: Read the previous container’s logs

kubectl logs <pod-name> -n <namespace> --previous

kubectl logs <pod-name> -n <namespace> -c <container-name> --previous

kubectl logs <pod-name> -n <namespace> -c <init-container-name> --previous

Sidecars in multi-container pods: In multi-container pods, check each container individually. Service mesh sidecars — Istio’s envoy-proxy

, Datadog agents — can consume memory that pushes the total pod over node capacity, causing the main container to be OOM-killed even if its own usage looks fine. Use kubectl logs <pod> -c <sidecar-name> --previous

to check each container.

Step 4: Check cluster events

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Step 5: Check live resource usage

kubectl top pod <pod-name> -n <namespace> --containers

Step 6: Extract the exit code programmatically

kubectl get pod <pod-name> -n <namespace> \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Prometheus alert for proactive OOM detection: Track container_memory_working_set_bytes

against memory limits. Alert at 80% — that’s your warning window before the kernel OOM killer fires.

(container_memory_working_set_bytes{container!=""}
  / kube_pod_container_resource_limits{resource="memory", container!=""}) > 0.80

How to fix each cause #

Application or config errors

When logs --previous

shows Error: required environment variable DATABASE_URL is not set

or exec: "myapp": executable file not found in $PATH

, you have a config or packaging problem.

Missing env vars: Confirm the referenced ConfigMap or Secret exists in the same namespace. A reference to a non-existent object causes an immediate exit.Wrong entrypoint: Verifyspec.containers[].command

andargs

. If you’re overriding the image’s CMD, confirm the binary path inside the image.Permission errors: Mounted volumes may not be writable by the container’s non-root UID. CheckfsGroup

andrunAsUser

insecurityContext

.

For live debugging, attach an ephemeral container without triggering another restart:

kubectl debug -it <pod-name> -n <namespace> \
  --image=busybox \
  --target=<container-name>

Out-of-memory restarts

Exit code 137 with OOMKilled: true

in lastState

means the Linux kernel terminated the container for exceeding its memory limit. See our deep-dive: OOMKilled in Kubernetes.

Setting limits correctly is harder than it looks. In our 2026 Kubernetes Optimization Report, clusters averaged around 20% memory utilization — heavily overprovisioned overall, yet individual pods were still undersized. In one representative cluster analyzed for our 2026 Kubernetes Optimization Report, we recorded 40–50 OOM kills per hour — spiking above 80 during peak load. After deploying automated rightsizing, the rate dropped to near zero.

VPA helps automate this. Valid updateMode

values: Off

(recommendations only), Initial

(set on pod creation), Recreate

(apply via pod eviction), Auto

(in-place on K8s 1.33+, beta with the InPlacePodVerticalScaling

feature gate; clusters running 1.27–1.32 still use pod eviction).

JVM workloads: Set -Xmx

to ~75% of the container memory limit. A 512Mi limit should have -Xmx384m

. This leaves headroom for non-heap memory — metaspace, thread stacks, native libraries. Skipping this is one of the most common causes of Java OOM kills in Kubernetes.

Cast AI Workload Autoscaler goes beyond VPA. Its OOM event handler detects a kill as it happens, generates a new recommendation with increased memory overhead, and applies it immediately — no manual intervention. It supports up to 240 changes per hour and uses in-place pod resizing on K8s 1.33+ (beta, InPlacePodVerticalScaling

feature gate; 1.27–1.32 uses pod eviction instead).

journalctl -k | grep "Out of memory"

LimitRange objects can silently impose memory limits on pods that don’t set them explicitly. If an operator deploys a pod without a resources.limits.memory

field, a LimitRange default (say, 128Mi) applies automatically — the pod appears to be running without limits but gets OOM-killed at 128Mi. Check before you assume the pod has no limit:

kubectl get limitrange -n <namespace>

kubectl describe limitrange -n <namespace>

Liveness and readiness probe failures

These two probes do different things. Liveness probe failure → kubelet restarts the container (causes CrashLoopBackOff). Readiness probe failure → pod removed from Service endpoints (does NOT restart the container). Conflating them produces unnecessary restarts.

A note on exit codes for liveness-triggered restarts: The kubelet sends SIGTERM first — the container’s exit code will typically be 143 (128 + signal 15) or whatever exit code the application returns in its signal handler. SIGKILL (exit code 137) only follows if the container is still running after terminationGracePeriodSeconds

expires. If you’re seeing 137 on a liveness probe failure, your application is ignoring SIGTERM and being forcibly killed — that’s a separate problem worth fixing.

For slow-starting containers, use startupProbe

(Kubernetes 1.18+). It disables liveness and readiness until it passes, giving the application its full startup window without forcing absurdly high initialDelaySeconds

on the liveness probe.

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30   # 30 allowed failures
  periodSeconds: 10      # checked every 10s

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

Two rules that eliminate most probe-related CrashLoopBackOff incidents: (1) never probe external dependencies in liveness checks — a 30-second database blip should not restart every pod in your deployment; (2) use startupProbe

for any app that takes more than 10–15 seconds to initialize.

How to prevent crash loops #

Set accurate resource requests and limits. Requests drive scheduling; limits cap consumption. Profile under realistic load — don’t guess. An undersized memory limit produces OOM kills; no limit lets a leaking container consume the entire node.

Use startup probes for init-heavy workloads. For Java applications on Kubernetes, implementing a startupProbe

with failureThreshold=30

and removing CPU limits allows the JVM class- phase to complete without being throttled or killed — this is the combination that eliminates init-phase CrashLoopBackOff for JVM workloads. CPU limits cause throttling that dramatically slows JVM class ; rely on CPU requests for scheduling and monitor actual utilization instead of capping it hard.

Implement graceful shutdown. Handle SIGTERM, drain in-flight requests, and set terminationGracePeriodSeconds

to match. A pod that crashes on shutdown can corrupt state that causes the next startup to fail.

Rollback safety. Before applying any memory limit or probe changes in production, make sure you can undo them quickly:

kubectl rollout history deployment/<name> -n <namespace>

kubectl rollout undo deployment/<name> -n <namespace>

Automate memory rightsizing. Manual limits drift. Cast AI Workload Autoscaler tracks actual memory usage continuously and keeps limits calibrated. Its OOM event handler closes the feedback loop between a kill event and a corrected limit without requiring an engineer to notice, diagnose, and redeploy. At 240 changes per hour across a fleet, it handles scale that VPA alone cannot.

Use OpsPilot for real-time diagnosis. When a CrashLoopBackOff alert fires at 2am, OpsPilot gives you root cause in seconds: “payments-api has restarted 14 times in 2 hours. Root cause: OOMKilled — memory limit 256Mi, peak RSS 312Mi at v2.3.9 rollout.” That’s the full diagnostic loop compressed into one response, with the specific version and memory figures you need to act.

FAQ #

What is CrashLoopBackOff in Kubernetes?

CrashLoopBackOff is a pod status (Waiting.Reason: CrashLoopBackOff

) indicating a container is repeatedly starting and crashing. Kubernetes applies exponential backoff between restart attempts — starting at 10 seconds, capping at 5 minutes. It is not an error code; it describes the pod’s current waiting state. See the Kubernetes pod lifecycle documentation for the full specification.

How long does the backoff last?

The sequence is 10s, 20s, 40s, 80s, 160s, then 300s (5 minutes) where it caps. Each failed restart doubles the wait, up to 5 minutes. After 10 consecutive minutes of successful operation, the backoff counter resets to zero.

What is the difference between CrashLoopBackOff and ImagePullBackOff?

ImagePullBackOff means Kubernetes cannot pull the container image — wrong tag, missing credentials, or a network issue. The container never starts. CrashLoopBackOff means the image pulled successfully but the container exits with a non-zero code after starting. Both use exponential backoff, but they occur at different lifecycle stages and require different fixes.

What does exit code 137 mean in Kubernetes?

Exit code 137 means the container was killed by SIGKILL (128 + signal 9) — either the Linux kernel’s OOM killer or the kubelet after terminationGracePeriodSeconds

expired. For OOM kills, kubectl describe pod

shows OOMKilled: true

in the container’s lastState

. The fix is to increase the memory limit using profiling data or an automated rightsizing tool. For liveness probe restarts, note that kubelet sends SIGTERM first (exit code 143); 137 only appears if the container didn’t exit within the grace period.

How do I stop CrashLoopBackOff quickly?

Run kubectl logs <pod-name> --previous

to read the crashed container’s output, then kubectl describe pod <pod-name>

to check exit codes and events. Exit code 137 + OOMKilled: true

: increase memory limits. Exit codes 1 or 2 with config errors: fix env vars or entrypoint. Liveness probe failures in events: add a startupProbe. If STATUS shows Init:CrashLoopBackOff

, use kubectl logs <pod> -c <init-container-name> --previous

— the main container hasn’t run yet. Deleting the pod only resets the backoff timer — it does not fix the crash.

Can OOM kills cause CrashLoopBackOff?

Yes — OOM kills are one of the most common causes. The kernel terminates the container when it exceeds its memory limit (exit code 137). Kubernetes registers the crash and restarts the container. If the limit is still too low, the container hits it again, crashes again, and the loop continues. Cast AI Workload Autoscaler detects the OOMKill event and immediately applies a corrected memory limit, breaking the cycle without manual intervention.

What kubectl command shows CrashLoopBackOff?

kubectl get pods -n <namespace>

shows the STATUS column where CrashLoopBackOff appears alongside the RESTARTS count. For root cause, use kubectl describe pod <name> -n <namespace>

for exit codes and events, and kubectl logs <name> --previous

for the last container’s output. If STATUS shows Init:CrashLoopBackOff

, first list init container names with kubectl get pod <name> -o jsonpath='{.spec.initContainers[*].name}'

then pull their logs with kubectl logs <name> -c <init-container-name> --previous

.

Does Cast AI help with CrashLoopBackOff?

Yes, in two ways. OpsPilot diagnoses CrashLoopBackOff incidents in seconds, surfacing root cause and restart count without manual log triage. The Cast AI Workload Autoscaler handles OOM-driven crash loops by detecting kill events, generating updated memory recommendations, and applying them immediately — on Kubernetes 1.33+ via in-place pod resizing (beta, InPlacePodVerticalScaling

feature gate), or on 1.27–1.32 via pod eviction.

── more in #ai-tools 4 stories · sorted by recency
── more on @kubernetes 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/crashloopbackoff-in-…] indexed:0 read:13min 2026-06-26 ·