AI Agents Have a Reliability Problem Nobody Is Talking About

wpnews.pro

Software has always evolved by changing what systems are allowed to do. We moved from batch jobs to interactive applications, from monoliths to distributed systems, and from on-prem servers to elastic cloud infrastructure. Each shift didn’t just improve performance it expanded what software could reliably accomplish.

We are now entering a new shift: from software that responds to software that acts.

AI agents are the first systems that don’t just compute outputs they execute actions in the real world. They call APIs, move money, update databases, trigger workflows, and operate with a degree of autonomy that earlier software systems never had.

But this is where the transition breaks.

The intelligence layer has advanced rapidly: better models, better prompting, better tool use. Yet the infrastructure layer beneath agents has not caught up. These systems are being asked to operate continuously and autonomously on top of tools designed for stateless, best-effort execution.

That mismatch becomes visible only in failure: crashes that lose state, retries that duplicate side effects, and workflows that cannot safely resume. The same problems distributed systems solved years ago through transactions, event logs, and durable execution—reappear in a new form, but without the same guarantees.

This is the missing piece in the agentic future. Not smarter models, but reliable execution.

A customer asks an agent for a refund. The agent looks up the order, decides the refund is valid, and calls the payments API. The API processes the charge reversal. Then, in the few hundred milliseconds between the payment provider returning 200 OK

and the agent recording that fact, the process running the agent is killed an OOM kill, a deploy, a spot instance reclaimed, a Kubernetes pod evicted. Pick your cause; in production they all happen.

The orchestration layer notices the task didn't finish. It does the sensible thing: it retries. The agent starts again from the top, looks up the order, decides the refund is valid, and calls the payments API a second time.

The customer gets refunded twice.

Nobody wrote a bug. Every individual component behaved correctly. The model reasoned correctly both times. The payments API did exactly what it was told, twice. The retry logic did what retry logic is supposed to do. And yet the system as a whole produced a financially incorrect, externally visible, irreversible outcome.

This is not a prompting problem. It is not a model problem. It is an infrastructure problem and it is the same class of infrastructure problem that distributed systems engineering spent the last two decades learning how to solve. The unsettling thing about the current generation of AI agents is how thoroughly that body of knowledge has been ignored.

The reason this problem is invisible is that agents look fine better than fine in the environment where almost all of them are evaluated. That environment is a single process, on a developer's machine or a notebook, running one task at a time, to completion, with no concurrency, no crashes, and a human watching the output stream by.

Consider the canonical agent loop. Stripped of framework-specific decoration, nearly every agent system in production today is some variant of this:

state = initial_context(task)
while not done(state):
    action = model.decide(state)        # LLM call: choose a tool + arguments
    result = execute(action)            # side-effecting call to the world
    state = state + action + result     # append to in-memory context
return finalize(state)

In a demo this loop is flawless. The state

variable holds the entire history of the task. Each tool call happens, its result gets appended, the model sees the full trajectory, and the loop converges. You can watch it think. It feels like a system.

It is not a system. It is a function call that happens to take a long time and reach out to the network in the middle. And the moment you move it from a notebook into anything resembling production, three assumptions silently break.

The process is assumed to be immortal. state

lives in process memory. The loop assumes it will run from initial_context

to finalize

without interruption. But agent tasks are long seconds to minutes, sometimes hours and "long-running" and "in-memory" are a contradiction in any environment where processes restart. Deploys happen. Hosts die. Autoscalers scale in. The probability that a multi-minute task is interrupted at least once is not zero, and at scale it is not small. When the process dies, state

is gone. Everything the agent did every tool call, every result, every reasoning step evaporates, including the knowledge of which side effects already happened.

Tool calls are assumed to be pure. The loop treats execute(action)

as if it were a read: call it, get a value, no consequences. But the entire point of an agent, as opposed to a chatbot, is that its tool calls are not pure. They move money, write rows, send emails, provision infrastructure, file tickets, hit third-party APIs that themselves trigger downstream effects. execute

is the part of the loop that touches the real world and cannot be taken back. Treating it like a pure function is exactly what turns a crash-and-retry into a double refund.

Execution is assumed to happen exactly once. There is no retry in the demo loop, because nothing fails in the demo. In production there is always retry at the queue level, the orchestration level, the load balancer, the client SDK, or a human clicking "run again." Retry is not optional; it is how distributed systems achieve reliability in the presence of partial failure. But retry on top of impure, in-memory, non-replayable execution doesn't produce reliability. It produces duplicated side effects.

These are not edge cases you can prompt your way out of. They are structural. The agent loop, as universally implemented, has no concept of durability, no concept of which actions have already been committed to the world, and no way to resume rather than restart. It works in the demo precisely because the demo removes every condition under which the missing infrastructure would have mattered.

It helps to be precise about how agents fail, because vague terms like "agents are unreliable" invite vague solutions like "use a better model." The failures are specific and they have well-understood names in systems engineering.

Duplicate side effects. A side-effecting operation is performed more than once because a retry replayed an action whose completion was never durably recorded. The double refund is the textbook case, but the general form is everywhere: two database rows where there should be one, an email sent twice, a server provisioned twice, a webhook delivered twice. This is the failure mode that most directly costs money and trust.

Lost state after crashes. The agent's working memory its trajectory, its intermediate conclusions, its partial progress exists only in process memory and is destroyed when the process dies. Because there is no durable log, the system cannot answer the most basic recovery question: what had already happened before the crash? Without that answer, the only options are to restart from scratch (risking duplicate side effects) or to give up (losing work and stranding the user mid-task).

Inconsistent execution. When two copies of an agent run concurrently because a retry fired before the original finished, or a queue delivered the same message twice they observe and mutate shared state with no coordination. One reads a value the other is about to change. Both believe they are the sole executor. The result is the same family of race conditions and write-write conflicts that distributed databases exist to prevent, except now they are being generated by a probabilistic decision-maker that may take different actions on each run.

Unrecoverable workflows. A multi-step agent task fails halfway through, leaving the world in a partially mutated state: the charge was reversed but the inventory was not restocked, the account was created but the welcome email never sent, three of five microservices were called. There is no record of how far it got and no safe way to continue or to unwind. The workflow is wedged, and a human has to reverse-engineer the partial state by hand.

Every one of these has a name, a literature, and a battle-tested solution in distributed systems. None of those solutions is new. What is new and strange is that an entire category of software is being built as if that literature does not exist.

Long before "agent" meant an LLM in a loop, the industry built systems whose entire job was to perform sequences of side-effecting operations, reliably, in the presence of crashes, retries, and concurrency. Payment processors, order-fulfillment pipelines, bank ledgers, provisioning systems, and workflow engines all live in exactly the regime where agents now find themselves. The techniques they converged on are not exotic. They are foundational, and they are directly applicable.

Event sourcing. Instead of storing only the current state, store the ordered, immutable log of events that produced it. The state is a projection of the log, not the source of truth. The log is the source of truth. This single inversion is the most important idea in reliable execution, because it means state can always be reconstructed: as long as you have the events, you can recover what happened, in what order, with full fidelity. A crash destroys the projection (in-memory state) but not the log. You rebuild and continue.

Replayable execution. If your event log captures not just business events but the inputs and outputs of every non-deterministic operation every external call, every random choice, every clock read then you can replay an execution deterministically. You feed the recorded results back in instead of re-performing the operations. This is the mechanism behind workflow engines like Temporal: workflow code is written as ordinary, sequential, imperative logic, but the runtime records the result of every external interaction so that after a crash it can re-run the code from the beginning, substituting recorded results for already-completed steps, and arrive at exactly the point of failure without re-executing anything that already happened. The programmer writes what looks like a normal function; the runtime makes it durable underneath.

Durable queues. Work is not held in memory; it is enqueued in a persistent store with explicit delivery semantics, acknowledgments, and visibility timeouts. A task is not considered done until it is acknowledged. If a worker crashes before acknowledging, the task becomes visible again and another worker picks it up. The queue guarantees the work will be attempted until it succeeds which is exactly why everything downstream of a queue must be built to tolerate being attempted more than once.

Idempotency keys. Because at-least-once delivery is the achievable guarantee and exactly-once delivery generally is not, the standard defense is to make operations idempotent: performing them twice has the same effect as performing them once. The canonical implementation is the idempotency key a unique identifier attached to a side-effecting request, stored by the receiver, so that a second request with the same key returns the result of the first instead of performing the action again. Stripe's API is the reference example: send the same Idempotency-Key

twice and you get the original charge back, not a second charge. The double refund does not happen because the second call is recognized as a replay of the first.

Saga / compensation patterns. When a multi-step workflow cannot be made atomic and across multiple external systems it usually cannot you define a compensating action for each step (refund for charge, delete for create, restock for deduct). If the workflow fails partway, the engine runs the compensations for the steps that did complete, driving the system back toward a consistent state. This is how you get something approaching transactional behavior across systems that share no transaction.

Put these together and you get a runtime that can lose a process at any instant and recover to a correct state, that can retry freely without duplicating effects, and that can run the same logical task on different machines over time without confusion about what has already been done. This is solved engineering. The agent ecosystem has mostly reinvented the orchestration on top of it the loop, the tool-routing, the planning while leaving the durability underneath entirely unbuilt.

The most common response to agent unreliability is to wait for, or train, a better model. This is a category error, and it is worth being explicit about why, because it is the misconception that keeps the actual problem from getting attention.

A better model produces better decisions. It chooses more appropriate tools, makes fewer reasoning mistakes, follows instructions more faithfully, hallucinates less. All of that is real and valuable. None of it touches reliability, because the reliability failures occur in the gap between a correct decision and its durable, exactly-counted effect on the world.

Return to the double refund. The model's decision was correct both times: this refund is valid, call the payments API. A perfect model an oracle that always decides correctly produces the same double refund, because the duplication does not come from a bad decision. It comes from a crash between the side effect and the record of the side effect, followed by a retry. No quality of reasoning prevents a process from being killed mid-execution. No amount of intelligence tells a freshly-restarted process what the dead process had already done, because that information was never written down.

The confusion stems from treating reliability as a property of decisions when it is a property of execution. Consider the clean separation:

A non-deterministic decision-maker arguably makes the runtime's job harder, not easier. A traditional workflow engine assumes the workflow code is deterministic on replay same inputs, same path. An LLM is not deterministic; replay the same context and it may choose a different tool. This means agent runtimes cannot naively assume that re-running the logic reproduces the prior trajectory. They must treat the model's outputs themselves as events to be recorded and replayed, not as logic to be re-derived. The non-determinism of the decision layer makes durable, replayable execution more necessary, not less.

So the better-model narrative gets the direction of the problem exactly backwards. Smarter agents that take more consequential actions, more autonomously, over longer horizons, with less human oversight, do not reduce the need for reliable execution. They raise the stakes on every failure the current loop cannot prevent.

The missing layer has a name, borrowed directly from the systems that solved this before: durable execution. A durable execution runtime guarantees that a long-running, side-effecting process either runs to a correct completion or can be recovered to a correct, consistent state across crashes, restarts, retries, and concurrency without duplicating effects or losing progress.

For agents specifically, the durable execution layer sits underneath the orchestration layer and treats the agent loop not as a function call but as a recoverable workflow. The conceptual shift is this:

  Without durable execution            With durable execution

  loop runs in memory            -    loop runs against a durable log
  state = in-process variable    -    state = projection of the log
  crash = total loss             -    crash = resume from last event
  retry = re-execute             -    retry = replay, skip committed effects
  tool call = fire and hope      -    tool call = idempotent, logged, recoverable

Crucially, this is an infrastructure claim, not a prompting claim. It does not ask the model to be more careful. It changes the substrate the agent runs on so that the failure modes become structurally impossible or structurally recoverable, regardless of what the model decides. The agent author keeps writing what looks like a simple loop; the runtime underneath records every event, makes every tool call idempotent and replayable, persists progress continuously, and handles crash recovery transparently the same trick workflow engines pulled for deterministic business logic, adapted for a non-deterministic decision-maker at the center.

This is a new category because the existing categories don't cover it. Agent frameworks own orchestration, prompting, and tool routing. Workflow engines own durable execution for deterministic code. Neither owns durable execution for agentic code long-running, side-effecting, driven by a non-deterministic model, with tool calls as the unit of external effect. That intersection is the gap.

It is not enough to say "make it durable." A runtime that actually solves the failure modes above must have specific, nameable properties. These are not features to pick from; they are interlocking requirements, and removing any one reopens a class of failure.

The runtime must treat an append-only, ordered, durable event log as the single source of truth for an agent's execution. Every meaningful occurrence becomes an event: the task was received, the model was asked to decide, the model chose this tool with these arguments, the tool returned this result, the model concluded this, the task finished. The agent's working state is never the primary artifact it is always a projection computed by folding the event log.

This is the precondition for everything else. You cannot recover what you did not record. You cannot replay what you did not log. You cannot detect a duplicate if you have no durable memory of the first attempt. Event sourcing is the foundation precisely because every other property is built on the existence of a complete, durable history.

A practical consequence: the model's own outputs must be events. Because the model is non-deterministic, you cannot reconstruct its decision by re-asking it you must have recorded what it actually decided the first time. The decision is data, not logic.

Given the event log, the runtime must be able to reconstruct the exact state of an in-flight agent by replaying its events, feeding recorded results back in place of re-executing the operations that produced them. After a crash, recovery means: load the log, replay it to rebuild state up to the last recorded event, and continue from there. Steps that already completed are not performed again; their recorded results are returned instead.

Replayability is what makes "resume" possible instead of "restart." It is the difference between a crash costing you the remaining work and a crash costing you everything. And it is the property that, combined with idempotency, makes free retrying safe: a retry replays the committed prefix without re-executing it, and only the uncommitted suffix actually runs.

For agents, replay has a subtlety worth stating directly. You replay the recorded trajectory, not a freshly-generated one. You do not re-ask the model "what would you do here?" during recovery you replay "here is what you did." The model is consulted only at the genuine frontier of execution, the point the prior run had not yet reached.

The runtime must guarantee that an agent interrupted at any point including the worst possible point, between performing a side effect and recording that it happened recovers to a consistent state. This is the property that directly defeats lost-state and unrecoverable-workflow failures.

Crash recovery has a hard requirement that is easy to get wrong: the boundary around each side effect must be designed so that recovery is unambiguous. The dangerous window is the gap between "the effect happened in the world" and "the effect is recorded in the log." If a crash lands in that window, recovery must not double-execute. This is where event sourcing and idempotency have to cooperate: the runtime records its intent to perform an effect (with an idempotency key) before performing it, performs it, then records completion. On recovery, an effect recorded as intended-but-not-completed is retried using the same idempotency key so the retry is recognized as a replay by the receiver and does not duplicate. The window does not disappear, but it stops being able to cause a duplicate.

Every side-effecting tool call must be idempotent, and the runtime must make it so by default rather than relying on each tool author to remember. The mechanism is the one Stripe made standard: the runtime generates a stable idempotency key for each logical tool invocation, derived from the agent's execution identity and the position in the event log so that a replay of the same logical step yields the same key. That key is passed to the underlying API. A retried or replayed call carries the original key; the receiver recognizes it and returns the prior result instead of acting again.

This is the property that directly defeats duplicate side effects. It is also the property most dependent on cooperation from the outside world which is the right point to be honest about the limits.

A category-creation essay that overclaims is worse than useless, so it is worth stating plainly what durable execution can and cannot guarantee.

Exactly-once execution of an external side effect is impossible to guarantee from the agent side alone. This is not a limitation of any particular implementation; it is a consequence of the same impossibility results that underlie distributed systems generally. If the agent calls an external API and the connection drops before the response arrives, the agent cannot know whether the operation happened. The request may have been processed and the acknowledgment lost, or the request may never have arrived. From the caller's side these two cases are indistinguishable. No log, no replay, and no amount of cleverness on the agent side can disambiguate them.

What durable execution actually provides is at-least-once execution with idempotency, which composes into effectively-once behavior but only when the receiving system participates. If the external API honors idempotency keys, then at-least-once-with-keys yields effectively-once: the duplicate call is absorbed by the receiver. If the external API does not support idempotency keys, the agent runtime cannot manufacture the guarantee. The best it can do is record its intent, retry safely where the operation is naturally idempotent, and surface the ambiguity for a compensating action or human review where it is not.

This is the same bargain every reliable distributed system makes. Exactly-once is a property of a system, achieved through the cooperation of sender and receiver, not a property the sender can assert unilaterally. The honest framing is: durable execution moves agents from "duplicates happen silently and unpredictably" to "duplicates are prevented wherever the receiver cooperates, and detectable everywhere else." That is an enormous improvement. It is not magic, and claiming otherwise would repeat exactly the kind of overclaiming the field needs less of.

A second honest limit: the model's non-determinism means that recovery preserves the trajectory that happened, not the best possible trajectory. If the original run made a poor decision before crashing, replay faithfully reproduces that poor decision durability is about consistency and recoverability, not about decision quality. The two layers are genuinely separate, which is the whole point.

There is a recurring pattern in how new kinds of software grow up. First the capability appears and is demonstrated in a controlled setting, where it is astonishing. Then people try to run it in production, where it fails in ways that have nothing to do with the capability itself and everything to do with the missing infrastructure around it. Then the infrastructure gets built, usually by borrowing hard-won ideas from the previous generation of systems, and the capability becomes something you can actually depend on.

Web applications went through this the leap from a CGI script to a fault-tolerant, horizontally-scaled service was almost entirely about infrastructure, not about HTML. Data pipelines went through it. Payments went through it; the difference between a script that calls a card network and Stripe is overwhelmingly reliability infrastructure. Each time, the durable, boring layer underneath is what turned a capability into a system.

AI agents are at the start of that arc. The capability a model that can plan and act through tools is real and improving fast. But the demos that showcase the capability also hide the gap, because they remove every condition under which durability matters. The moment agents take consequential, irreversible actions in production, at scale, over long horizons, with retries and crashes and concurrency, the gap stops being hidden and starts costing money, trust, and correctness.

The field is pouring its attention into the decision layer — better models, better prompting, better orchestration, better planning. That work matters. But it is solving the part of the problem that is already going well while ignoring the part that is structurally broken. You cannot prompt your way out of a process getting killed between a side effect and its record. You cannot fine-tune away a race condition between two retries. Those are execution problems, and execution problems are solved with execution infrastructure: event logs, replay, crash recovery, idempotency, compensation. The same primitives that turned every previous capability into a dependable system.

The agents that matter the ones trusted to move money, change records, provision systems, and act without a human watching every step will not be the smartest ones. They will be the ones running on infrastructure that guarantees a chosen action happens the right number of times and that progress survives failure. Intelligence is what lets an agent decide what to do. Infrastructure is what lets it be trusted to actually do it. The industry has spent its first era building the former. The reliability problem nobody is talking about is that almost nobody is yet building the latter.

source & further reading

dev.to — original article What Is Model Context Protocol (MCP)? Building My First AI Registration chatbot Same DeepSeek V4 Flash, Different Agent: Why the Runtime Changes the Result

AI Agents Have a Reliability Problem Nobody Is Talking About

Run your AI side-project on zahid.host