The Reliability Problem That Forced Us to Rethink AI Agents

A developer building AI agents for client projects discovered that reliability issues emerged in production, not in demos, due to edge cases like malformed arguments and retry loops. The team redefined reliability as separate engineering concerns—determinism, graceful failure, safe resumption, and termination—and improved it by replacing flexible tool definitions with smaller, narrowly scoped tools with strict schemas. Schema validation of structured responses before triggering actions prevented numerous downstream failures.

A few months into building AI agents for client projects, we hit a pattern that should sound familiar to anyone shipping this technology beyond the demo stage: the agent worked beautifully in front of stakeholders, then quietly fell apart the moment real users got their hands on it. Not catastrophically. That would've been easier to catch. A tool call would be made with a slightly malformed argument and get stuck in a retry loop. A multi-step task would drift away from its original objective halfway through execution. An agent would confidently report success while accomplishing nothing useful at all. Nothing crashed. Nobody got paged. The damage was a slow leak of trust. That's the moment we stopped treating reliability as a property the model would eventually have enough of and started treating it as something we had to engineer for directly. A demo is a curated path through a system. You ask the question you know it handles well, in the phrasing you know it understands, and you stop before it has the chance to wander. Production doesn't give you that courtesy. Users paraphrase. They contradict themselves halfway through a conversation. They paste malformed data. They ask for things that are three steps removed from anything in your evaluation set. The uncomfortable realization for us was that an agent's reliability in the real world has very little to do with how impressive it looked across fifteen carefully selected examples. It has everything to do with how it behaves on the long tail—the situations nobody anticipated. One workflow in particular forced us to rethink our assumptions. We had an agent responsible for collecting information from multiple sources and updating records in an external system. Most of the time it worked perfectly. Then we started noticing duplicate records appearing sporadically. After digging through logs, we found the culprit. The external system successfully completed the update but returned a timeout before the response reached the agent. The agent interpreted the timeout as a failure and retried the action. Since the update had already succeeded, the retry created a duplicate. The model didn't hallucinate. The reasoning wasn't wrong. The failure came from how the surrounding system handled uncertainty. That realization changed how we approached reliability. For a long time, we treated reliability as a single fuzzy goal. The problem with that approach is that you can't improve what you can't define. Breaking reliability into separate concerns made it much easier to reason about: Does the same input produce roughly the same behavior each time, or does the agent behave differently on every run? When something goes wrong, does the system fail loudly and clearly, or does it generate a confident-sounding but incorrect answer? If a workflow fails halfway through execution, can it resume safely, or does it need to start from scratch? Does the agent know when to stop, or can it continue calling tools indefinitely because it never reaches a satisfying conclusion? Once we started treating these as separate engineering problems, reliability became much easier to improve. Our early tool definitions tried to be flexible. A single tool might accept numerous optional parameters and support several different workflows. In theory, that made development easier. In practice, it increased ambiguity. The model had too many ways to call the same tool, and we had too many code paths to validate. We replaced these with smaller, narrowly scoped tools that performed one job well and enforced strict schemas. The reduction in malformed tool calls was immediate. Because that's exactly what they are. Every structured response now passes through schema validation before it can trigger a real action. Validation failures are treated as expected branches in the workflow rather than exceptional situations. This single change prevented numerous downstream failures. Retries are useful until they aren't. Some of our strangest bugs came from retrying actions that had partially succeeded. We introduced idempotent operations wherever possible and capped retries with circuit breakers instead of allowing endless loops. When failures happen now, they fail cleanly and visibly. For longer workflows, we persist state after each completed step. If a seven-step process fails at step four, the agent resumes from step four instead of repeating the first three actions. This reduced duplicate side effects and made recovery significantly more predictable. Sending emails. Charging cards. Deleting records. Publishing content. These actions now pass through explicit approval gates rather than relying solely on the model's confidence. Confidence and correctness are not the same signal. Treating them as if they are creates unnecessary risk. Most teams run evaluations before deployment. We started treating them as a permanent regression suite. Every time an agent failed in production, we captured the example and added it to our test set. That meant every future change had to prove it wasn't reintroducing an old failure. Some of our most promising "improvements" turned out to solve one problem while creating three new ones. Without regression testing, we never would've noticed. This was the least glamorous improvement and probably the most valuable. We began tracing every reasoning step, tool call, validation check, and decision point. Debugging stopped feeling like archaeology. The majority of our mysterious failures became obvious once we could see the sequence of events that led to them. That's the part worth emphasizing. None of these changes improved the model's reasoning ability. What they did was reduce the number of ways a reasoning mistake could become a production problem. They made failures visible. They made failures recoverable. They reduced the blast radius when things inevitably went wrong. That distinction changed how we scope projects today. We no longer start by asking: Can the model perform this task? Increasingly, the answer is yes. Instead, we ask: When this fails—and it will—what does failure look like, who sees it, and how do we recover? That question turns out to be far more important. A few lessons we'd share with teams early in their journey: The happy path is rarely the hard part. The edge cases are where reliability is won or lost. We're still finding new ways for agents to surprise us. That part probably never goes away. But the failures look different now. They're visible instead of silent. Bounded instead of endless. Recoverable instead of catastrophic. For production systems, that's most of the battle. As models continue to improve, reliability will increasingly become an engineering challenge rather than a model-quality problem. The teams that recognize that shift early will build systems users can trust. If you're working through similar challenges, I'd be especially interested in how you're approaching recoverability and state management in long-running agent workflows. It's one of the areas we're still actively refining.