# Agent checkpointing is far from production-grade resiliency

> Source: <https://www.restate.dev/blog/why-checkpointing-is-not-production-grade-durable-execution>
> Published: 2026-06-15 14:33:46+00:00

[Blog](/blog)

# Agent checkpointing is far from production-grade resiliency

Giselle van Dongen

*Agent frameworks now advertise "durable execution" and "resiliency," usually built on checkpointing. But the guarantees you get fall far short of what it takes to deploy long-running agents to production.*

Productionizing agents is hard. Agents are long-running, expensive, and fragile. A production agent calls a model that costs money for each invocation, fires tools with side effects — emails, payments, deployments — and might wait for days for a human to click *approve*. In the meantime, processes restart, connections drop, you hit rate-limits, and new deploys happen. So frameworks started adding resiliency features: checkpointing, resumable threads, "durable execution."

These features are useful, but they usually only solve a small part of the problem: recovering completed work. 80% of the hard work is still on the plate of the developer.

What agent SDKs mean by "durable"

Strip away the naming differences and the pattern across agent frameworks is the same: **periodically snapshot the agent's state to a database, and let the caller resume from the last snapshot.** Recovery means loading that state object and starting over from the boundary. The process has no record of where *inside* your code execution was: it just knows what the state looked like at the last save point.

That same model is used across the ecosystem: LangGraph does node-level checkpointing, Google ADK has event persistence, Mastra does step-level snapshots. Each of these boils down to saving state at boundaries, restarting coarsely, and losing the work in between.

The part of the iceberg below the checkpoint

Checkpoints help with recovery, but agents are distributed applications and making them durable requires more that that.

Recovery should land on the failing step, not the nearest boundary.

A checkpoint can only take you back to a boundary — it doesn't *recover* execution, it *restarts* it. If a tool performs three calls — call an API, send an email, write to a database — and the process dies after the second, restarting from the checkpoint re-runs all three: nothing ever recorded that the first two happened. Checkpoints describes state at a boundary, not how far your code got in between.

The same goes for parallel work. Imagine, a planner agent that kicks off 10 parallel researcher agents do an hour of work, 5 have finished, the sixth one fails. What gets recovered depends on the exact checkpointing implementation, but often it means some or all researchers redo work.

**What we actually want:** Real durable execution — persist every step the code executes (each LLM call, tool call, sleep, RPC), and on recovery, the code re-runs, completed steps return their journaled results instead of executing again, and execution fast-forwards to the exact step that didn't finish and continues live from there. The three-call tool recovers to call three.

Someone has to notice the failure.

A library cannot supervise itself. It dies together with the process. With a checkpoint SDK, "recovery" means *something else* must detect that a run was ongoing and needs to be retried, and it must know how to restart it.

With checkpointing, this is often left up to you. You must either have your upstream client detect failures and retry or put a queue in front of your agent service. You need to generate a stable ID for each run, and on a retry, deduce the right ID and dispatch the `agent.run(...)`

task again. This is complex, error-prone, extra infra to deploy/operate, and it doesn’t contribute at all to your use case.

**What we actually want:** A durable orchestrator that detects failures immediately and automatically redispatches in milliseconds.

Retries have to survive the thing that's failing.

In-process retry state, like an attempt counter in a local variable or a backoff implemented as `sleep()`

, dies with the process. Retry policies, attempt counts, and backoff schedules belong in the same durable substrate as the execution itself, including the end state.

**What we actually want:** Retry policies that are respected across process failures, restarts, and redeploys, that intelligently treat some errors as retryable and others as terminal, and that allow for fine-grained retry policy definition for different actions and side effects of our system.

The same durability guarantees should cover the entire workflow.

Most agents end up as deterministic workflows, of which some steps are agentic and others are not (write to DB, send email, …).

To cover the steps outside of `agent.run()`

, the checkpointing solution now also needs to extend outside of your `agent.run()`

. This scenario is so common that many agent SDKs are turning into workflow orchestrators to fill this gap.

But they don’t give the same guarantees as the workflow orchestrators we have already been using for years, for boring, business-critical logic: payments, order handling, infra automation. Building intuitive abstractions over the agent loop is an entirely different ballpark than resiliently orchestrating long-running workflows. Resilient workflow execution cannot be bolted on but is a product in itself, built from the ground up to support this. An agent SDK should run on top of / integrate with workflow orchestrators instead of extending to become one.

**What we actually want**: A flexible orchestrator that executes agentic and non-agentic steps, lets you swap agent SDKs or implement your own loop, supports dynamic control flow in normal Python/TS code — no DAG restrictions, no DSLs — and guarantees the code runs to completion no matter what.

Two writers must be impossible.

It's easy to end up with two agents writing concurrently to the same session state: a user clicks twice on the send button, or a zombie process that was presumed dead wakes up and keeps writing while its replacement runs. This has unpredictable results: lost updates, corrupt state, duplicate side effects, agents misunderstanding what they did, etc.

Resilient session management is a key part of productionizing agents, yet checkpointing does not cover this. You need to implement locking and fencing yourself.

**What we actually want**: An orchestrator that coordinates sessions and guarantees only a single agent executes on a single session, fences out zombie processes, and handles deduplication of duplicate requests. This requires consensus-based distributed-systems engineering: monotonic attempt epochs, a log that rejects appends from superseded attempts, single-writer-per-key execution.

Waiting on humans, timers and other agents must be free and durable.

Agents wait a lot: for humans, for timers, for other agents. A durable wait need a set of properties:

- The wakeup itself is durable (an
`asyncio.sleep`

is not). After a restart, the process remembers which approvals, remote agent call, and timers it was waiting on. - Timers remember how much time remained after restarts, instead of starting over.
- The process pauses during the wait without consuming resources. An approval that takes a month shouldn’t be consuming resources, especially on serverless.
- The code that resumes is the code that suspended.

Some SDKs allow pause/resume for human approvals, but that’s where it stops. You can’t race an approval against a failure-proof timer, or suspend an orchestrator agent while a remote agent does some work. There needs to be a process awake to track the timer, or hold the connection to the remote agent. Otherwise, after a failure, they start over.

**What we actually want**: An orchestrator that makes promises/futures durable and composable: timers, human approvals, remote agent calls. The orchestrator persists what an execution is waiting on and lets it resume once it’s able to make progress.

Safe upgrades is a requirement for durability.

When you deploy new agent logic with slightly different descriptions, instructions, tool implementations, or workflow steps, any ongoing executions might fail loudly or silently in unpredictable ways. Imagine updating a fraud risk tool:

``` php
# V1
@function_tool
async def compute_risk_score(transaction: Transaction) -> float:
  """Returns a risk score from 0 to 10 where higher means more suspicious."""
  return await fraud_model.score(transaction)

# V2
@function_tool
async def compute_risk_score(transaction: Transaction) -> float:
  """Returns a risk score from 0 to 100 where higher means more suspicious."""
  return await improved_fraud_model.score(transaction)
```

An agent that executed this tool on the old code but continues on the new one, would misinterpret a score of “10” as safe instead of fraudulent. This would happen silently, and you might never even notice.

This is especially relevant for systems that allow resuming execution days later (e.g. human approval). The chance that a redeploy happened is high. Checkpoints carry no application version; so don’t guard against this and their tight coupling to the code makes them even more likely to break across updates. To make this work, you either fork on version tags inside your code (messy), or need to implement smart routing.

**What we actually want**: Version-aware routing. Executions always resume/retry on the code they started with. Executions are pinned to immutable, versioned deployments. New traffic goes to the new version, while old executions continue on their original version.

The orchestrator itself must not die.

If your agent's resilience is guarded by a single Postgres instance in a single region, then that becomes your single point of failure. To have high-availability, each of the systems that gets added to the picture (queue, DB, scheduler, …) needs to run as a distributed cluster, thereby quickly becoming a headache to operate.

**What we actually want**: Business-critical systems need high availability with leader failover for in-flight work, and in more extreme cases even geo-replication to survive a region loss without duplicating a request.

The stack upside down

There is nothing on this list that's agent specific. It's the same list you'd write for payments, order workflows, or any other distributed system. Which is the point:

**Durable execution is a distributed-systems problem, and agent SDKs are application libraries.**

Your backend needs an all-encompassing durability story, consistent across all services and workflows, whether agentic or not. Instead of tying resiliency together with your agent framework, agents should be built on top of the technologies that were built from the ground up to own it.

```
┌----------------------------------------------┐
│  Agent SDK (your choice)                     │  prompts, tools, loops,
│  OpenAI Agents · Vercel AI · Pydantic AI ·   │  handoffs, memory patterns
│  Google ADK · LangChain · your own loop.     │
│----------------------------------------------│
│  Durability layer (Restate)                  │  journaled steps, automatic
│  replicated log · fencing · durable timers   │  retries, exactly-once RPC,
│  & promises · version pinning · failover     │  suspension, task control
│----------------------------------------------│
│  Your compute (containers, K8s, Lambda)      │  stateless, scale-to-zero
└----------------------------------------------┘
```

In this stack the SDK gets used for its useful abstractions, but runs on top of a durable substrate. A durable, highly available orchestrator receives the request, and from that moment it owns the end-to-end execution: every LLM call and tool result is journaled in a replicated log before execution proceeds, recovery is automatic, a session is a single-writer keyed object with built-in corruption protection, an approval wait is a durable promise that costs nothing for six weeks, and the execution that resumes is pinned to the code version it started on. The component that guarantees a process runs to completion has to see the whole process — it can't live inside the thing whose lifetime it's guaranteeing.

Restate is an open-source durable orchestrator that gives you all these guarantees. It is packaged as a lightweight, single binary (just one thing to deploy, no extra DB/dependencies). It can run as a single process or in a high-availability setup by deploying it multiple times and snapshotting to object store. It integrates with popular agent SDKs (Pydantic AI, Google ADK, LangChain, Vercel AI SDK, etc.), but also gives you the option to write fully-customized agents from scratch, by just using Restate’s durable building blocks. Your agents run as normal Python/TS/java processes in Docker containers or serverless functions that suspend while waiting. They use the Restate SDK to make steps durable.

Start building

Restate is used as the foundation in agent platforms, runtimes, and harnesses at large AI-native companies as well as startups. It extends beyond agents and powers business-critical workflows in Fortune 500 companies as well as startups, for things like payments, order processing, infrastructure automation, workflow runtimes, etc.

The fastest way to get started is with [ Restate Cloud](https://restate.dev/cloud/). Deploy your agents on serverless functions for zero infra management. Try it for free and have a managed instance running in minutes. Or follow the

[to run Restate locally.](https://docs.restate.dev/quickstart)

**quickstart guide** Questions? Join us on [ Discord](https://discord.restate.dev/) or

[.](https://slack.restate.dev/)

**Slack** Restate is open, free, and available at [ GitHub](https://github.com/restatedev/restate). Star the project if you like what we're doing and tell your friends about it!
