I spent a month trying to predict multi-agent AI failures. It failed — here's what the failure taught me.

A developer spent a month attempting to build a system that could predict multi-agent AI failures before they occur, pre-registering all signals and metrics to avoid confirmation bias. The experiment failed cleanly, with the primary signal achieving an AUC of 0.46 against a pre-registered bar of 0.80, and further analysis revealed that the signal was actually measuring trace length rather than failure patterns. After correcting for length invariance, the result flipped to an AUC of 0.42 with a slight inverse direction, showing that "information slowing down" occurs during both failure and healthy convergence.

I had a hypothesis I was pretty excited about: that you could detect a multi-agent system going off the rails before it actually fails — early enough to stop it. If true, that's a product. If false, I wanted to know in a month, not a year. So I ran it as an actual experiment. Here's what happened, including the part where the result didn't just come back negative — it came back backwards. The setup I cared most about: not fooling myself The easy way to "validate" an idea like this is to build a signal, run it, and stop tuning the moment the numbers look good. That's also how you ship something built on a lie. So before touching results, I pre-registered everything: the signals, the dataset split, the success bar AUC ≥ 0.80 , how the threshold would be set. I separated the signal computation from the labels at the file level and wrote a leakage test that fails the build if the signal code so much as imports the labels. Every change after that was logged as a numbered amendment with a timestamp and a "decided before seeing results" flag. It felt like overkill. It wasn't — it's the only reason I can trust what came next. The signals Two of them, both computed from only the past prefix of a run never the future, never the label : Loop Pressure — structural cycle detection in the inter-agent message graph, weighted so that low-information loops count more. The intuition: teams that fail often start going in circles. Information Gain Decay — the rate at which each new step stops adding novel information embedding novelty, smoothed, then its downward slope . The intuition: right before a stall, new information dries up. The result AUC ≈ 0.46. The pre-registered bar was 0.80. Clean fail. Per-framework, it was ~0.5 across the board — not a dataset-mixing artifact. I could have stopped at "no signal, random performance." But the why turned out to be the interesting part. The twist: the signal was measuring length, and fixing that reversed it The per-trace score I was using max over the run correlated with trace length at Spearman 0.86. In other words, my "failure signal" was mostly measuring how long the run was. Longer run → more chances to hit a high value → higher score. The actual difference between failed and successful runs was not significant. The honest move here is not to call the hypothesis refuted — a broken instrument doesn't test anything. So I did one pre-registered re-aggregation with a length-invariant metric per-trace mean instead of max . Length correlation dropped from 0.86 to −0.27. The instrument was now clean. The clean result was worse — AUC 0.42 — and the direction had flipped. Successful runs showed more information-gain decay than failed ones small effect, but significant . That flip is the most useful thing I learned. "Information slowing down" isn't a failure signal, because it also happens during healthy convergence — a task wrapping up successfully also stops generating novel information. My signal couldn't tell "finishing well" from "stuck in a loop," because at the level I was measuring, they look the same. What this does and doesn't mean It does not mean "you can never predict multi-agent failure." The labels I used were LLM-judged with ~33% disagreement against the human-labeled subset, and they were task-level rather than execution-level. Loop Pressure I couldn't even test properly — only one of seven frameworks actually logged inter-agent edges which is its own uncomfortable finding about the state of multi-agent tooling . What it does mean: this specific signal, measured cleanly on this data, has no predictive power and a slight inverse direction. That's a real negative for this approach — not an eternal verdict on the idea. The part that surprised me: the instinct wasn't wrong, the implementation was Digging into the literature afterward, there's recent work IBM Research, "Unsupervised Cycle Detection in Agentic Applications" that tackles almost exactly the "productive cycle vs bad cycle" distinction. Their finding: structural detection alone scored F1 0.08, semantic-similarity alone 0.28 — but a cascade structure finds candidate loops, then semantic similarity confirms whether the repeated content is actually redundant hit F1 0.72. That's their result, not mine — but it's instructive. Which is to say: structure + semantics was the right instinct. My mistakes were 1 combining them as a weighted sum instead of a cascade, and 2 measuring a global novelty trend instead of local redundancy. Global decay can't separate convergence from stalling. "Did this step regenerate information we already had?" — a local, length-invariant question — can. What I actually believe now The recurring pain in all those traces was never really "predict the failure." It was simpler and more expensive: agents burning tokens on redundant work — re-running, replaying handoffs, looping on things they already solved. That's money, and most teams only notice on the invoice. So that's where I'm pointed now: less "predict every failure," more "catch and cut the wasteful, redundant cycles" — the one thing a length-invariant local-redundancy signal can actually do. If you run multi-agent systems in production: what's the failure that actually costs you — the model being wrong, or the system quietly wasting work? I'd genuinely like to know if this matches what you see. I'm building something narrow in this direction, so I'm biased — but mostly I want to know if I'm chasing a real problem or a ghost. Happy to share the full experiment, amendments and all.