{"slug": "measuring-agents-in-production", "title": "Measuring Agents in Production", "summary": "Based on the article \"Measuring Agents in Production,\" a survey of 306 practitioners and 20 case studies reveals that deployed AI agents are far simpler than hype suggests, with 80% using predefined workflows and 68% requiring human intervention within 10 steps. The study found that 70% of teams rely on prompting off-the-shelf models rather than fine-tuning, and 85% build custom infrastructure instead of using heavy frameworks. Ultimately, production agents remain basic, human-supervised tools, contrasting sharply with claims of fully autonomous systems.", "body_md": "When you are in TPOT echo chamber, you would think fully autonomous AI agents are running the world. But this 2025 December paper, \"Measuring Agents in Production\", cuts through the reality behind the hype. It surveys 306 practitioners and conducts 20 in-depth case studies across 26 domains to document what is actually running in live environments. The reality is far more basic, constrained, and human-dependent than TPOT suggest.\nSimplicity and Bounded Autonomy: 80% of case studies use predefined structured workflows rather than open-ended autonomous planning, and 68% execute fewer than 10 steps before requiring human intervention. Frankly, these systems sound to me less \"autonomous agent\" than glorified state machine or multi-step RAG pipeline.\nPrompting Beats Fine-Tuning: Despite the academic obsession with reinforcement learning and fine-tuning, 70% of teams building production agents simply prompt off-the-shelf proprietary models. Custom-tuned models are often too brittle, and they break when foundation model providers update their models.\nTolerance for Latency: While in database systems and distributed systems we focus on shaving milliseconds and microseconds off response times, in the agent world 66% of deployed systems take minutes or even longer to respond. I am not comparing or criticizing because of the intrinsically different nature of the work, I am just stating how vastly different the latency expectations are.\nCustom Infrastructure Over Heavy Frameworks: Though many developers experiment with frameworks like LangChain, 85% of the detailed production case studies ended up building their systems completely in-house using direct API calls. Teams actively migrate away from heavy frameworks to reduce dependency bloat and maintain the flexibility to integrate with their own proprietary enterprise infrastructure.\nBenchmarks are Abandoned: 75% of production teams skip formal benchmarking entirely. Because real-world tasks are incredibly messy and domain-specific, teams rely instead on A/B testing, production monitoring, and human-in-the-loop evaluation (which a massive 74% of systems use as their primary check for correctness).\nReliability (consistent correct behavior over time) remains the primary bottleneck and challenge. OK, this one was not really a surprising finding.\nSo the data says that the state of multi-agent systems in production is exaggerated. Everyone says they are doing it, but only a few actually are. And those who are doing it are keeping it basic.\nThis feels familiar.\nRemember 2018? IBM published a whitepaper stating that \"7 in 10 consumer industry executives expect to have a blockchain production network in 3 years\". They famously claimed blockchains would cure almost every business ailment, reducing 9 distinct frictions including \"inaccessible marketplaces\", \"restrictive regulations\", \"institutional inertia\", \"invisible threats\", and \"imperfect information\". Ha, \"invisible threats\", it cracks me up every time!\nFor blockchains, whenever I asked people about the killer application, they would mumble something like, \"It is trustless\", for which I would respond, \"That is not an application\", which would made them would respond with \"But it is decentralized\". Today the AI agent narrative occasionally gives off a similar vibe. When pressed on the ultimate value of these systems, the default response is often claiming \"productivity gains\". Unfortunately, there hasn't been much deep elaboration on what this actually entails at scale.\nBut, comparing AI agents to blockchains is unfair. Agents actually have a couple of killer applications already, particularly in coding, data analysis, and customer care. They have successfully made it into deployment despite in a very basic and constrained manner. It is just that they aren't the fully autonomous hyper-intelligent multi-agent swarms that people claim they are. They remain basic, human-supervised, highly constrained tools.\nThis connects perfectly to the arguments I made in my Agentic AI and The Mythical Agent-Month post regarding the mathematical laws of scaling coordination. Throwing thousands of AI agents at a project does not magically bypass Brooks' Law. Agents can dramatically scale the volume of code generated, but they do not scale insight. In fact, due to their vast operational speed, agent coordination complexity will likely far exceed the $O(N^2)$ coordination complexity that Fred Brooks originally postulated for human teams. Until we solve the fundamental epistemic gap of distributed knowledge, adding more agents to a system would simply produce a faster more expensive way to generate merge conflicts.", "url": "https://wpnews.pro/news/measuring-agents-in-production", "canonical_source": "https://muratbuffalo.blogspot.com/2026/03/measuring-agents-in-production.html", "published_at": "2026-03-17 11:10:00+00:00", "updated_at": "2026-05-22 12:46:54.275325+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "research", "enterprise-software"], "entities": ["TPOT"], "alternates": {"html": "https://wpnews.pro/news/measuring-agents-in-production", "markdown": "https://wpnews.pro/news/measuring-agents-in-production.md", "text": "https://wpnews.pro/news/measuring-agents-in-production.txt", "jsonld": "https://wpnews.pro/news/measuring-agents-in-production.jsonld"}}