# The reliability gap: what it actually takes to put an AI agent in production

> Source: <https://dev.to/azena-ai/the-reliability-gap-what-it-actually-takes-to-put-an-ai-agent-in-production-36ik>
> Published: 2026-06-26 12:09:52+00:00

A demo agent is easy. It calls a model, the model calls a tool, the tool returns something plausible, and everyone in the room nods. Then you put the same agent in front of real users, real data, and real money — and it quietly does the wrong thing 4% of the time. Nobody notices until a customer does.

That 4% is the reliability gap. It is the entire distance between a convincing demo and a system you can actually depend on, and almost nothing in the typical LLM tutorial prepares you for it.

Here is what closing that gap actually involves.

**1. They are non-deterministic by construction.** The same input can produce a different tool call tomorrow. Your regression intuition — "I didn't touch that code, so it still works" — is simply false. A prompt tweak three steps upstream can change a decision three steps downstream.

**2. They fail silently.** A traditional service throws. An agent confidently returns a wrong answer in the same shape as a right one. There is no stack trace for "the model misread the invoice total."

**3. There is rarely a ground truth at runtime.** When the agent decides, you usually cannot check the decision against an oracle in the moment. You only find out later, in aggregate, if you measured.

If you internalise nothing else: an agent is not a function you debug, it is a population you have to measure.

The single highest-leverage thing a team can build is an eval set — a collection of realistic inputs with known-good outcomes that you run on every change. Not "does it sound good," but "did it pick the right tool / extract the right field / refuse the out-of-scope request."

A useful eval set has three properties:

`refund`

tool when the policy said deny" is checkable. "Was helpful" is not.This is the part most teams skip, and it is the part that separates an agent you can iterate on from one you are afraid to touch. I wrote up the failure modes in more detail here: [why AI agents fail in production and what evals have to do with it](https://azena.ai/blog/ki-agenten-produktion-evals/).

A common mistake is to treat reliability as a prompting problem — add another paragraph of "you must never…" and hope. Prompts are persuasion, not enforcement.

Real guardrails live in code, around the model:

`delete_account`

tool in scope at all. Don't ask it nicely — don't hand it the gun.The mental model: the LLM is the planner, but the runtime is the adult in the room.

There is a persistent fantasy that "fully autonomous" is the goal and a human checkpoint is a temporary crutch. For anything with legal, financial, or safety weight, that is backwards. The human checkpoint is the design.

The interesting engineering question is not *whether* a human reviews, but *where* — you want the agent to do the 90% that is mechanical (gather, draft, structure, pre-fill) and route the 10% that carries liability to a person, with the full context assembled so the review takes seconds, not minutes. That's the difference between automation that scales and automation that creates a new bottleneck.

We unpack where to draw that line — chatbot vs. agent, and which workflows should never be fully autonomous — here: [agentic AI without the autonomy theatre](https://azena.ai/perspectives/agentic-ai-beratung/).

Honesty is a feature. Some boundaries are not optimisation problems:

Saying "an agent is the wrong tool here" out loud is one of the most senior things an engineer building these systems can do.

Reliable agents are less about a clever prompt and more about boring infrastructure: a real eval set wired into CI, guardrails enforced in code, bounded loops, and a deliberate human checkpoint exactly where the stakes are. None of it is exciting. All of it is the difference between a demo and a system.

If you're a small or mid-sized team that wants agents in production but doesn't have an in-house ML platform team to build that scaffolding, that gap is exactly the thing a focused engineering partner exists to close — that's the work we do at [azena, an EU AI boutique](https://azena.ai/ki-beratung-mittelstand/): bespoke systems, evaluated, with the guardrails and the data-control decisions made on purpose.

Build the eval set first. Everything else gets easier once you can measure.