What Actually Breaks Multi-Agent Systems: A Field Report From Real Production Failures

A developer investigating multi-agent system failures found that the most common production bugs are silent: tools that skip execution without errors, oscillation loops that never converge, and cascading patch failures where fixes reintroduce previously resolved issues. The developer's own detection tool initially missed a real failure because it only checked for explicit error keywords, not silent tool skips.

Here is a confirmed, filed bug in LangChain. An agent with a checkpointer attached has this conversation: Turn 1: "When was Company A founded?" → tool fires, returns "2020" → agent: "Company A was founded in 2020." ✅ correct Turn 2 same conversation : "When was Company B founded?" → tool does NOT fire → agent: "I don't have information about Company B." ❌ wrong, and silently wrong No exception. No error log. No timeout. The tool was simply never called. The mechanism: the checkpointer stores Turn 1's tool result in the message history. When Turn 2 comes in, the LLM sees old tool output already sitting in context, assumes it has what it needs, and never issues a fresh tool call. I build a tool that takes an agent's execution trace — paste it in directly, no SDK, no instrumentation — and returns a reliability score plus the root cause of any failure it finds. So naturally I rebuilt this exact trace and ran it through my own detector to see if it would catch it. It scored 100 out of 100. Clean. No failures detected. My own tool, built specifically to catch this class of problem, missed a real one on the first try. That's the moment this stopped being a side project and became an actual investigation into what these failures look like and why they hide so well. Why My Own Tool Missed It My detection logic was checking for explicit error keywords — words like "failed," "error," "not found" — in tool outputs. In this trace, the tool simply returned nothing at all. There was no error word to match against, just an empty result and a status field marked "skipped" that nothing in my code was actually reading. The fix took two changes: catch any tool step where the status is explicitly "skipped," and separately catch any tool step that returns empty content regardless of status. Neither existed before. After the fix, the same trace scored 65 out of 100 and correctly flagged "MISSING MANDATORY TOOL CALL" on the exact step where the tool went silent. That gap — only catching loud failures, missing quiet ones — turned out to be the whole story. Going Looking for Real Failures Over the next two days I did something simple — I searched GitHub issues, replied to developers describing agent problems on Twitter and LinkedIn, and asked direct questions instead of guessing. The pattern that emerged was consistent and a little unsettling: almost none of these failures throw errors. They look like success. Here are the patterns that came up repeatedly, all from real production systems, all anonymized. The Silent Tool Skip The LangChain bug above is the cleanest example. A tool that should run doesn't, and the agent proceeds as if it did, often producing a plausible-sounding wrong answer. The trace looks identical to a successful run unless you specifically check whether the tool was invoked. The Oscillation Loop A developer building an automated question generator described their validator/repair pipeline like this: Round 1: validator flags issue A → repair "fixes" it Round 2: validator flags issue A again → repair "fixes" it again Round 3: validator flags issue A again → same fix applied again ...continues for 4+ rounds with no exit condition No timeout, no error — just an indefinite back-and-forth that burns tokens without ever converging. They'd tried feeding the repair node a history of what was already attempted, hoping the model would stop repeating itself. It helped somewhat, but the loop still sometimes never resolved, and their eventual workaround was economic rather than technical: scrap the broken attempt and regenerate from scratch, because that turned out cheaper in tokens than trying to converge. Fix Interference and Cascading Patch Failure A more subtle variant: fixing problems B and C in one repair cycle reintroduces problem A, which was already fixed in a previous round. Each individual fix is locally reasonable, but the system has no model of how its own changes interact with each other, so it ends up chasing its tail across rounds. The same developer's actual root cause turned out to be even narrower and stranger than expected: across four separate repair attempts, the model kept regenerating the exact same incorrect mathematical symbol writing the Greek letter nu instead of the letter u in a recurrence formula , while confidently stating each time that the notation drift had been fixed. The repair node would change surrounding wording and formatting — but never touch the one line that was actually broken. The fix was never in the model's hands to begin with; it needed a hard programmatic substitution after generation, not another prompt asking it to "be more careful." Context Drift Across Hops One developer running an autonomous agent for over 1,500 production sessions found that a state file and the live filesystem had quietly diverged — the state file said a queue had 13 items, the actual filesystem had 7. Six sessions of decisions were made on the wrong number before anyone noticed. In multi-step or multi-agent systems, this kind of drift compounds: one agent's stale output becomes the next agent's accepted input. Tool Avoidance The inverse of the silent skip — an agent answers a question that genuinely requires real-time or external data a stock price, current weather, a live status check without calling any tool at all. It simply generates a plausible-sounding number. There's no failure surfaced anywhere; the response just looks like a normal, confident answer. Goal Abandonment A multi-step task starts correctly, several tool calls succeed, and then the trace simply ends with the agent saying something like "I'll continue with the next step now" — and nothing follows. No error, no timeout, just an unfinished task that reports nothing wrong. Infinite Loops Disguised as Productivity Closely related but distinct: the same developer running 1,500+ sessions found a 13-session stretch where their content agent was technically "working" the entire time but never made measurable progress, because the actual queue it should have been clearing stayed full. From the outside, every session looked active and successful. The Pattern Behind the Patterns Here is the specific, testable claim: in every single failure above, the trace's own status field said something positive — "success," "completed," "fixed" — at the exact moment the system was actually broken. Not one of these failures set an error flag, threw an exception, or triggered an alert. They were all status: success failures. That means any monitoring built around catching errors, exceptions, or non-200 status codes will see every one of these as a clean run. The only way to catch them is to check something more specific than "did it crash" — did the tool actually get called, did two consecutive outputs come out suspiciously identical, does the final state match what the trace claims happened. That's a different category of check than most logging and monitoring setups are built to do by default, which is exactly why these failures tend to surface in production after hundreds of runs rather than in a demo or a handful of manual tests. What I'm Doing With This I've been building a tool that takes an agent execution trace — pasted directly, no instrumentation or SDK installation required — and runs it through a set of deterministic checks for patterns like the ones above, plus a semantic pass for the harder-to-catch cases. It's free to try if you want to paste in a trace of your own and see what it finds: https://6jovkucbyygcamzbeksa67.streamlit.app https://6jovkucbyygcamzbeksa67.streamlit.app But honestly, the more interesting outcome of the last two days wasn't the tool. It was how consistent these failure shapes turned out to be across completely different systems — math question generators, ad operations agents, customer support bots, content pipelines. Different domains, same handful of root causes. If you've hit something like this in your own agents, I'd genuinely like to hear about it.