{"slug": "i-spent-a-month-trying-to-predict-multi-agent-ai-failures-it-failed-here-s-what", "title": "I spent a month trying to predict multi-agent AI failures. It failed — here's what the failure taught me.", "summary": "A developer spent a month attempting to build a system that could predict multi-agent AI failures before they occur, pre-registering all signals and metrics to avoid confirmation bias. The experiment failed cleanly, with the primary signal achieving an AUC of 0.46 against a pre-registered bar of 0.80, and further analysis revealed that the signal was actually measuring trace length rather than failure patterns. After correcting for length invariance, the result flipped to an AUC of 0.42 with a slight inverse direction, showing that \"information slowing down\" occurs during both failure and healthy convergence.", "body_md": "I had a hypothesis I was pretty excited about: that you could detect a multi-agent system going off the rails before it actually fails — early enough to stop it. If true, that's a product. If false, I wanted to know in a month, not a year.\n\nSo I ran it as an actual experiment. Here's what happened, including the part where the result didn't just come back negative — it came back backwards.\n\nThe setup I cared most about: not fooling myself\n\nThe easy way to \"validate\" an idea like this is to build a signal, run it, and stop tuning the moment the numbers look good. That's also how you ship something built on a lie.\n\nSo before touching results, I pre-registered everything: the signals, the dataset split, the success bar (AUC ≥ 0.80), how the threshold would be set. I separated the signal computation from the labels at the file level and wrote a leakage test that fails the build if the signal code so much as imports the labels. Every change after that was logged as a numbered amendment with a timestamp and a \"decided before seeing results\" flag.\n\nIt felt like overkill. It wasn't — it's the only reason I can trust what came next.\n\nThe signals\n\nTwo of them, both computed from only the past prefix of a run (never the future, never the label):\n\nLoop Pressure — structural cycle detection in the inter-agent message graph, weighted so that low-information loops count more. The intuition: teams that fail often start going in circles.\n\nInformation Gain Decay — the rate at which each new step stops adding novel information (embedding novelty, smoothed, then its downward slope). The intuition: right before a stall, new information dries up.\n\nThe result\n\nAUC ≈ 0.46. The pre-registered bar was 0.80. Clean fail. Per-framework, it was ~0.5 across the board — not a dataset-mixing artifact.\n\nI could have stopped at \"no signal, random performance.\" But the why turned out to be the interesting part.\n\nThe twist: the signal was measuring length, and fixing that reversed it\n\nThe per-trace score I was using (max over the run) correlated with trace length at Spearman 0.86. In other words, my \"failure signal\" was mostly measuring how long the run was. Longer run → more chances to hit a high value → higher score. The actual difference between failed and successful runs was not significant.\n\nThe honest move here is not to call the hypothesis refuted — a broken instrument doesn't test anything. So I did one pre-registered re-aggregation with a length-invariant metric (per-trace mean instead of max). Length correlation dropped from 0.86 to −0.27. The instrument was now clean.\n\nThe clean result was worse — AUC 0.42 — and the direction had flipped. Successful runs showed more information-gain decay than failed ones (small effect, but significant).\n\nThat flip is the most useful thing I learned. \"Information slowing down\" isn't a failure signal, because it also happens during healthy convergence — a task wrapping up successfully also stops generating novel information. My signal couldn't tell \"finishing well\" from \"stuck in a loop,\" because at the level I was measuring, they look the same.\n\nWhat this does and doesn't mean\n\nIt does not mean \"you can never predict multi-agent failure.\" The labels I used were LLM-judged with ~33% disagreement against the human-labeled subset, and they were task-level rather than execution-level. Loop Pressure I couldn't even test properly — only one of seven frameworks actually logged inter-agent edges (which is its own uncomfortable finding about the state of multi-agent tooling).\n\nWhat it does mean: this specific signal, measured cleanly on this data, has no predictive power and a slight inverse direction. That's a real negative for this approach — not an eternal verdict on the idea.\n\nThe part that surprised me: the instinct wasn't wrong, the implementation was\n\nDigging into the literature afterward, there's recent work (IBM Research, \"Unsupervised Cycle Detection in Agentic Applications\") that tackles almost exactly the \"productive cycle vs bad cycle\" distinction. Their finding: structural detection alone scored F1 0.08, semantic-similarity alone 0.28 — but a cascade (structure finds candidate loops, then semantic similarity confirms whether the repeated content is actually redundant) hit F1 0.72. (That's their result, not mine — but it's instructive.)\n\nWhich is to say: structure + semantics was the right instinct. My mistakes were (1) combining them as a weighted sum instead of a cascade, and (2) measuring a global novelty trend instead of local redundancy. Global decay can't separate convergence from stalling. \"Did this step regenerate information we already had?\" — a local, length-invariant question — can.\n\nWhat I actually believe now\n\nThe recurring pain in all those traces was never really \"predict the failure.\" It was simpler and more expensive: agents burning tokens on redundant work — re-running, replaying handoffs, looping on things they already solved. That's money, and most teams only notice on the invoice.\n\nSo that's where I'm pointed now: less \"predict every failure,\" more \"catch and cut the wasteful, redundant cycles\" — the one thing a length-invariant local-redundancy signal can actually do.\n\nIf you run multi-agent systems in production: what's the failure that actually costs you — the model being wrong, or the system quietly wasting work? I'd genuinely like to know if this matches what you see. (I'm building something narrow in this direction, so I'm biased — but mostly I want to know if I'm chasing a real problem or a ghost. Happy to share the full experiment, amendments and all.)", "url": "https://wpnews.pro/news/i-spent-a-month-trying-to-predict-multi-agent-ai-failures-it-failed-here-s-what", "canonical_source": "https://dev.to/jeonsewon/i-spent-a-month-trying-to-predict-multi-agent-ai-failures-it-failed-heres-what-the-failure-2kl0", "published_at": "2026-06-05 02:46:01+00:00", "updated_at": "2026-06-05 03:11:12.311141+00:00", "lang": "en", "topics": ["ai-agents", "ai-research", "ai-safety", "machine-learning", "mlops"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/i-spent-a-month-trying-to-predict-multi-agent-ai-failures-it-failed-here-s-what", "markdown": "https://wpnews.pro/news/i-spent-a-month-trying-to-predict-multi-agent-ai-failures-it-failed-here-s-what.md", "text": "https://wpnews.pro/news/i-spent-a-month-trying-to-predict-multi-agent-ai-failures-it-failed-here-s-what.txt", "jsonld": "https://wpnews.pro/news/i-spent-a-month-trying-to-predict-multi-agent-ai-failures-it-failed-here-s-what.jsonld"}}