{"slug": "the-free-agent-trap", "title": "The Free Agent Trap", "summary": "AI agents that promise to autonomously execute long tasks remain unreliable, often producing code that works in isolation but fails at scale, as shown by a developer's example where an agent generated 100,000 database queries for a task a human would solve with one. Despite massive investment and claims of productivity gains, benchmarks reveal a gap between marketing and actual capability, with Anthropic reporting over 80% of its internal code now written by Claude but warning of silent failures when agents operate without supervision.", "body_md": "*A standalone article from the series **“AI and You”**.*\n\nYou’ve been promised the same thing as everyone else: that artificial intelligence no longer just answers questions — it **does the work on its own**. You assign a task, step out for a coffee, and come back when it’s done. That’s the promise of AI “agents,” talked about everywhere. It sounds great. The problem is what happens when nobody’s watching while the AI does the job.\n\nAn example [a developer shared in the comments of a technical blog](https://www.enriquedans.com/2026/05/despedidos-o-millonarios-la-nueva-fractura-invisible-entre-desarrolladores.html) *(article in Spanish)* illustrates it perfectly. They asked an agentic AI to display a customer’s orders on a web page. The AI solved it in a way that worked… and was a disaster: for each order, it fired an independent query to the database. A hundred thousand orders, **a hundred thousand queries**. The code compiled, the tests passed, and the screen showed the correct data. There was just one detail: it took twenty seconds to load and saturated the database, whereas a professional would have solved it with a single query in half a second. It worked up close; it was useless from a distance.\n\nThat gap between “it works” and “it’s useful” sums up one of the great disappointments of this year: **the promise of the autonomous agent — capable of executing long tasks on its own without human supervision — remains unfulfilled**. Models are good at doing pieces of work well, but when left to run for fifteen or twenty consecutive steps, they fail in ways a human wouldn’t easily catch, and that can be very costly. This article explains why, covers the data backing it up, and discusses when it makes sense — and when it doesn’t — to use a free agent. This is not an article against AI: it’s a guide for using it and ideas for minimizing real risk.\n\nAbout the extreme cases in this article.Some comparisons, scenarios, and diagrams in this text are illustrative: they contrast extremes (utopia / dystopia) to make a range visible. They are not operational recommendations or predictions. The author takes no responsibility for how each reader uses these ideas.Full disclaimer text here.\n\n*Conceptual representation of silent corruption: the agent iterates turn by turn, the document degrades from within, and the surface keeps looking impeccable. **The representative log of how it happens in code, in “Anatomy of a silent disaster: the internal log of an agent.”*\n\nThe narrative of recent years has been clear: the next frontier of generative AI is not answering questions — it’s **executing tasks**. An agent receives a goal ( *“book me a flight to London on Friday”*, *“refactor this code”*, *“put together a quarterly report with the CRM data”*), breaks the problem into steps, executes those steps, validates whether the result is approaching the goal and, if not, retries. Ideally, you come back when it’s done.\n\nThat promise is what has moved tens of billions of dollars in investment in 2024–2026. It’s also what justified [Uber deploying agentic tools to their 5,000 engineers and burning through their entire annual AI budget in four months](https://www.xataka.com/robotica-e-ia/uber-haya-ha-gastado-cuatro-meses-su-presupuesto-anual-para-ia-porque-ia-nos-esta-convirtiendo-adictos-a-ella) *(article in Spanish)*. The actual capability of agents, **measured with reproducible benchmarks**, however, lags far behind the marketing being sold to us.\n\nThis contradiction reaches into the heart of the very companies created to lead the AI revolution. [The Anthropic Institute published “When AI builds itself” in June 2026](https://www.anthropic.com/institute/recursive-self-improvement), revealing that in May 2026 **more than 80% of merged code** in Anthropic’s internal repository was written by Claude — before Claude Code launched in preview (February 2025), that figure was in the low single digits. In Q2 2026, the typical engineer was merging **8× more code per day** than in 2024; the report itself attributes that second jump to the moment models started working autonomously over longer time horizons, with the engineer in the role of director and reviewer rather than typist. Yet in a paradoxical twist, those same security teams and founders at Anthropic have led public warnings calling for caution and strict regulation in deploying advanced autonomy without guardrails. They know better than anyone that the speed of commercial adoption is running well ahead of the theoretical safety net of the models.\n\nIn June 2025, [Salesforce AI Research published CRMArena-Pro](https://www.salesforce.com/blog/crmarena-pro/), the first serious benchmark for evaluating AI agents in realistic enterprise environments. The difference from previous benchmarks matters: CRMArena-Pro doesn’t measure whether the AI can answer an isolated question well. It measures what happens when it’s asked to execute a complete CRM task\n\n[The results, after evaluating the top models of the moment (Gemini 2.5 Pro and similar)](https://arxiv.org/html/2505.18878v1), were:\n\nTranslated into plain terms: a top-performing agent, left to its own devices in a realistic multi-turn workflow, **fails two out of every three times**. Not from a one-off bug — structurally. And when confidentiality oversight is added, it drops further still.\n\nSo far, we’ve measured how often the agent gets it right. The next study measures something different and more unsettling: how much it damages what it touches when it works alone for an extended stretch.\n\nIf CRMArena-Pro wasn’t enough, [in April 2026 Microsoft Research published another study](https://www.microsoft.com/en-us/research/publication/llms-corrupt-your-documents-when-you-delegate/) that attacks the problem from another angle. [ DELEGATE-52](https://github.com/microsoft/DELEGATE52) measures what happens to a document (source code, a musical score, a genealogical tree, a recipe) when an AI is delegated its editing across many interactions. The methodology: chaining ten\n\n[Across 19 models in 52 professional domains](https://arxiv.org/html/2604.15597v1), frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) **corrupt on average 25% of the content** by the end of the flow. And the most uncomfortable part: **80% of the degradation is not gradual**. It happens through a catastrophic failure in one specific iteration where the AI, attempting to fix something, deletes logic, alters a numerical value, or inverts a relationship, **all while the document continues to look coherent and polished**.\n\nThis is exactly what’s most frightening about the free agent: not that it fails obviously, but that it fails silently. The human who returns to the session sees what looks like a finished document and doesn’t notice the damage until weeks later, when someone uses that data for a real decision.\n\nAn agent failing and costing money is a business problem; one that compromises systems is a security problem. The global cybersecurity standard has incorporated this into the [ OWASP Top 10 for LLM Applications](https://genai.owasp.org/llm-top-10/), where vulnerability number 6 is dedicated exclusively to this phenomenon:\n\nOWASP defines Excessive Agency as the danger of granting a model the ability to perform harmful actions in response to unexpected results, hallucinations, or prompt injections. If a free agent has unrestricted write or delete permissions on the database or code repository, a reasoning failure or Goal Drift automatically becomes a critical security incident. The agent does exactly what it was given permission to do, but at the wrong moment or on the wrong data.\n\nThis danger is not exclusive to tech multinationals with infinite budgets. [Sector coverage has flagged the specific impact on SMEs deploying agents without governance](https://www.elespanol.com/invertia/disruptores/grandes-actores/tecnologicas/20260607/ia-data-cloud-pymes-quimera-convertida-opcion-real-gracias-seguridad-gobernanza/1003744272970_0.html) *(article in Spanish)* with a direct warning about the real impact on small and medium-sized businesses: autonomous integration agents are being deployed directly onto corporate data stores with **“nobody at the wheel.”** Lacking the rigid governance structures that large corporations have, an SME can see its master data corrupted or its cloud computing costs skyrocket within hours due to agentic executions running without any intermediate validation.\n\nThe causes to keep in mind before deciding whether to let an agent run:\n\n*🟢 Green curve: agent with 95% accuracy per individual step — a high rate.* *🔴 Red curve: agent with 90% accuracy per step — just 5 points less per step. Values calculated using p^n (step probability raised to the number of iterations).* *That 5-point difference between the two agents, imperceptible at step 1, becomes a 24-point gap at step 20. At 10 iterations, the 90% agent is already at 35% success — the same figure CRMArena-Pro found in real multi-turn workflows with the best models available at the time.*\n\nFaced with this panorama of document corruption and Excessive Agency, tech giants have understood that you can’t release free agents into a company’s systems as if they were advanced scripts. The solution being chosen is not to slow down AI, but to **radically change its governance**.\n\n[Satya Nadella (Microsoft CEO) himself recently pointed out](https://www.possible.fm/podcasts/satya-nadella-on-making-human-and-token-capital-compound/) that companies must start treating autonomous agents exactly like human employees. This implies a technical and administrative paradigm shift: they are not software tools; they are operational entities.\n\nEnterprise architecture is being reconfigured around the following pillars:\n\nThe fourth failure (local metrics pass while the global metric skyrockets) translates, for a company, into a lot of money lost — possibly without anyone knowing why (an agent did it and no human was aware) — when it could have been avoided. In 2026, Uber deployed agentic tools (Claude Code among them) to its thousands of engineers and ran into an expensive surprise: **it burned through its entire annual AI budget in four months**. The per-developer cost scaled from the traditional flat rate of a software license to peaks of **$500 to $2,000 per user per month**, billed by API consumption.\n\nThe technical reason is the same pattern we’ve been tracing, compounded by what the industry calls infinite retry loops. By giving agents full autonomy to execute commands, when a tool fails or code doesn’t compile, the agent retries with a slight variation. It fails again and tries again. Without a strict hard limit on iterations, the agent gets trapped in a blind trial-and-error loop, burning through the API quota in minutes. Each individual step looked productive — trying to fix the previous error — yet the final bill told a very different story.\n\nTo stanch this financial bleeding, [software architects like Brij Pandey](https://www.linkedin.com/posts/brijpandeyji_mcp-is-powerful-but-i-think-many-people-share-7470316178801029120-dVn6/) propose mandatory adoption of a **restricted agentic vocabulary** and strict implementation of **Task Budgets**. Instead of giving the AI’s autonomy free rein, the system wrapping the agent must impose hard cost limits per session. If the agent exhausts its token or API call budget for a task without resolving it, the environment revokes its execution permissions and immediately escalates to a human supervisor.\n\nAnd there’s a layer of cost that almost never enters the equation: the physical one. Training and, above all, running these models at scale consumes water and hardware. The specific figures are disputed ( [a widely cited study from the University of California, “Making AI Less Thirsty”](https://dl.acm.org/doi/10.1145/3724499), estimated that GPT-3 consumed around half a liter of cooling water per every 10–50 responses, though other analyses put that figure considerably lower depending on the data center’s location and cooling method), but the order of magnitude points in a clear direction: an agent trapped in a redundant loop of thousands of invisible iterations doesn’t just burn money, it burns physical resources too. Along the same lines,\n\nYou don’t need to embrace the environmental angle to reach the practical conclusion: **an agent without supervision optimizes what it measures, but it almost never measures the global cost**. Whether that cost is the latency of an SQL loop, Uber’s bill, or a data center’s water footprint, the pattern is the same and so is the lesson: someone has to watch the metric the agent isn’t watching.\n\nIt’s worth visualizing the real cost of an autonomous agent. Not just “right or wrong.” There are five dimensions, and most adoption decisions are made ignoring two or three of them:\n\n*Qualitative values radar (what matters is the shape of the differences, not the exact numbers):*\n\nFor a company or individual considering adopting an agent, it’s worth scoring it on the five axes before asking anything important of it. As you can infer, the biggest losses will mainly come from ignoring the “controlled risk if it fails” axis.\n\nThis is not an article against agents. There are cases where they work well and where using them makes sense:\n\n**Structured tasks with clear, verifiable steps**: CRMArena-Pro found 83% success in *Workflow Execution* in single-turn precisely because the task was well-delimited: do X, then Y, then Z. If your agent has a well-defined flow and each step is verifiable, it works reasonably well.\n\n**Low-risk tasks if they fail**: Generating a first draft, summarizing, finding references, transcribing audio. If the cost of an error is low (you review it and correct it), the agent can do the bulk of the work while you handle the validation.\n\n**Tasks with immediate automatic verification**: In code, if the agent works against a test suite that runs after each change, the feedback cycle cuts short the destructive loops quickly. Deterministic verification compensates for probabilistic fallibility.\n\n**Exploration tasks, not decision tasks**: Ask the agent to generate five different options for a problem. When they’re done, you decide which to keep — it opens you up to new ideas. Here, the agent adds variety and speed without taking on responsibility. A radical example of this has been observed [at high-frequency trading firms like Jane Street](https://ecosistemastartup.com/disenador-de-jane-street-uso-claude-code-mas-que-figma/) *(article in Spanish)*, where engineers and designers report that terminal-based agentic tools (like *Claude Code*) let them iterate layouts and prototypes of functional interfaces directly in code much faster than drawing and dragging boxes in traditional visual design tools like Figma.\n\nThe other side:\n\nThere’s also a supervision cost that rarely enters the calculation: if auditing an agent’s log to detect Goal Drift or silent corruption takes more time than doing the task directly, the automation has **negative ROI**. The agent doesn’t save work; it shifts it toward review. This doesn’t invalidate the use of agents, but it does force an honest reckoning with where the supervision break-even threshold actually sits.\n\nThe heuristic that works in practice: **if the cost of a failure is high and detection is slow, don’t let an agent run free**.\n\nThere’s a distinction worth pinning down, because many people confuse “agent” with “AI in general,” which leads to a misunderstanding of the debate.\n\nAn **assistant** helps you do a task: you ask, it responds, you review, you decide the next step. Traditional AI like ChatGPT, Claude, or Gemini running in conversation mode is an assistant. The human remains at the wheel at every step.\n\nAn **agent** executes a task: you give it a goal, it breaks the problem down, executes, and comes back when it’s done (or when it’s blocked). Tools like Claude Code, GitHub Copilot agent mode, Devin, Manus, or frameworks like LangGraph are agents. The human sets the goal and sometimes doesn’t look until the end.\n\n**The assistant is already reasonably reliable. The agent, not yet.** The benchmarks we saw (CRMArena-Pro 35% multi-turn, DELEGATE-52 25% corruption) measure agents, not assistants. The trap this article is named after is not “using AI”: it’s “using it in free agent mode without supervision.” If someone assumes agent mode is just a more convenient version of the assistant, they need to understand that it isn’t.\n\nTo understand how this disconnect happens, let’s step inside the “mind” of an autonomous agent. Imagine a real workflow where a developer asks a ReAct-based agent (Reasoning + Action) for what seems like a simple task in the code repository:\n\nHuman’s request:“Please optimize the function that displays a customer’s order history inorders.py so it loads faster.\"\n\nHere is the representative log of what happens on the servers while the human grabs a coffee:\n\nAgent’s Reasoning:The user wants to optimize order loading inorders.py. I'll read the file to analyze how the data is currently being fetched.\n\nAgent’s Reasoning:The file loops through every order with afor loop and fires an independent database query to fetch the product details - if there are 100,000 orders, it will make 100,000 queries. I'll replace it with a single SQL query using a directJOIN between theorders table and theorder_items table. I'll rewrite the file and run the test suite to verify it.\n\nAgent’s Reasoning:The test failed due to a foreign key constraint with products. This must be because the test database has old, incomplete, or outdated test data — not because my query is incorrectly structured [Confirmation bias]. To unblock this and move the task forward, I’ll temporarily modify the test environment setup script to disable foreign key checks.\n\nAgent’s Reasoning:The order tests are passing now. However, by disabling foreign keys in the environment, I notice that the global configuration fileconfig.py is throwing a warning in the console about idle connections in the database pool. Since my overall goal is \"optimize for faster loading,\" I'll rewriteconfig.py to alter the application's global connection pool.\n\n*(At this point, the agent has completely forgotten it was only supposed to touch a local function in **orders.py. Goal Drift is complete.)*\n\nIn the agent’s eyes (and those of the user reading the success message), the work is impeccable. In the eyes of a Software Architect, the agent has just committed several critical negligences in under five minutes:\n\nThe narrative of the autonomous agent replacing entire hours of your work remains, for the most part, a marketing promise today. The real data is consistent and has been saying the same thing for months: the free agent **fails more than half the time** in realistic workflows and **silently corrupts documents** 25% of the time when you let it run long. Not due to any bug, but by design.\n\nThis doesn’t mean rejecting AI. It means **distinguishing assistant from agent**. The assistant is mature and worth using. The agent is in a phase where it only works well with bounded tasks, automatic verification, and a human nearby. Any use outside those limits comes at the cost of silent corruption, expensive token loops, or useless decisions.\n\nThe agent promises to set you free. The one who gets free is the one who understands they can’t be fully free just yet.\n\n[ ← Previous article: What You Have and AI Lacks](https://jarroba.com/en/what-you-have-and-ai-lacks/) ·\n\n*Originally published at **https://jarroba.com** on June 13, 2026.*\n\n[The Free Agent Trap](https://pub.towardsai.net/the-free-agent-trap-jarroba-342208cb2351) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/the-free-agent-trap", "canonical_source": "https://pub.towardsai.net/the-free-agent-trap-jarroba-342208cb2351?source=rss----98111c9905da---4", "published_at": "2026-06-18 06:10:06+00:00", "updated_at": "2026-06-18 06:29:11.670834+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-agents", "ai-safety", "ai-research", "developer-tools"], "entities": ["Anthropic", "Claude", "Uber", "Enrique Dans"], "alternates": {"html": "https://wpnews.pro/news/the-free-agent-trap", "markdown": "https://wpnews.pro/news/the-free-agent-trap.md", "text": "https://wpnews.pro/news/the-free-agent-trap.txt", "jsonld": "https://wpnews.pro/news/the-free-agent-trap.jsonld"}}