{"slug": "your-ai-agent-doesnt-need-100-tools-it-needs-the-right-one", "title": "Your AI Agent Doesn’t Need 100 Tools. It Needs the Right One.", "summary": "Researchers R.S. Babu and L.G. Iyer introduced Causal Minimal Tool Filtering (CMTF) to solve ToolChoiceConfusion, where LLM agents fail when given too many tools. CMTF limits visible tools to only those necessary for the next step, reducing errors and costs. In tests with 100 tools and 102 tasks, the method improved reliability across multiple LLM backends.", "body_md": "If you’ve built an LLM agent that connects to more than a handful of tools, you’ve felt the pain even if you couldn’t name it. You wire up your email, calendar, file system, and a few internal APIs. Demos look great. Then you add the twentieth tool, then the fiftieth, and the agent starts doing strange things: it calls send_email before it has drafted anything, it deletes the wrong file, it picks a tool that sounds right but does the wrong job. Latency creeps up. Token bills balloon. Your “reliable” agent becomes a coin flip.\n\nA recent paper, **ToolChoiceConfusion: Causal Minimal Tool Filtering** **for Reliable LLM Agents** by *R.S. Babu and L.G. Iyer*, gives this failure mode a name and, more importantly, a clean fix. The core idea is almost embarrassingly simple, and that’s exactly why it’s useful for teams.\n\nThe mistake everyone makes: confusing relevant with necessary\n\nMost tool-selection systems today treat the problem as semantic relevance. Given a user request, you retrieve the tools whose names and descriptions best match the query, whether through keyword overlap, embeddings, or a RAG pipeline over your tool registry. Show the agent the “most related” tools and let it figure out the rest.\n\nThe paper’s central argument is that relevance is not the same as necessity. Consider the task: “Find an email and draft a reply.” Which tools are relevant? search_email, read_email, create_draft, send_email, archive_message. All of them. They’re all about email. They’ll all score highly on any semantic match.\n\nBut at the first step, only search_email is actually useful. You can’t read a message you haven’t found. You can’t draft a reply to a message ID you don’t have. And exposing send_email this early is downright dangerous. It’s a high-risk action with no business being on the menu yet. The authors call the resulting degradation ToolChoiceConfusion: agents misbehave not because they’re bad at calling tools, but because we hand them a menu full of plausible-but-premature options. More choices, more ways to be wrong.\n\n**The fix: think like a planner, filter like a minimalist**\n\nThe solution, Causal Minimal Tool Filtering (CMTF), borrows an old idea from classical AI planning (STRIPS, PDDL) and applies it as a lightweight runtime filter, with no model training required.\n\nEach tool gets a small contract. The preconditions describe what state variables it needs to run (for example, update_event requires an event_id). The effects describe what it produces (for example, search_events produces event_id). Optionally, the contract also carries a risk level and a cost. Then, at every step, CMTF does three things. First, it builds a dependency graph of state transitions from the tool contracts. Second, it finds the minimal causal path from your current state to the goal state. Third, it exposes only the next tool on that path, the so-called causal frontier. That’s it. The LLM still does the actual reasoning and argument-filling. CMTF just controls which tools are visible for each local decision.\n\nWalk through the running example, “Move tomorrow’s dentist appointment to 4 PM.” The initial state is date, event_description, and new_time. The goal state is event_updated. CMTF finds the path search_events followed by update_event. At step zero, the only visible tool is search_events, because only it produces the missing event_id. At step one, once event_id exists, the only visible tool is update_event. create_event and delete_event are relevant to calendars. They are never necessary for this task. So the agent never sees them. It can’t pick the wrong tool if the wrong tool isn’t on the table.\n\n**The numbers that should get your attention**\n\nThe authors built a controlled benchmark: 102 tasks across calendar, email, and file domains, a registry of 100 tools stuffed with deliberate distractors (near-duplicates, risky actions, cross-domain noise), four LLM backends, and 2,448 total runs. Tool outputs were mocked and deterministic so that any failure could be blamed on tool selection, not flaky APIs.\n\n**Here’s the comparison that matters:**\n\n**Success rate by tool-exposure method**\n\nTask success rate across the six tool-exposure methods. CMTF and full causal path reach 0.99, while the naive “show fewer tools” baselines fall below the all-tools exposure. (*Source: project repository (*[ https://github.com/R-Suresh/ToolChoiceConfusion/blob/main/figures/success_by_method.png](https://github.com/R-Suresh/ToolChoiceConfusion/blob/main/figures/success_by_method.png))\n\nRead that last row again. CMTF takes you from 100 visible tools per step down to 1, cuts token usage by roughly 90 percent, drives wrong-tool calls to essentially zero, eliminates premature actions, and does all this while raising the success rate from 0.83 to 0.99.\n\nThe counterintuitive part: the naive “just show fewer tools” approaches (keyword top-5, top-10, state-aware filtering) actually performed worse than showing everything. Fewer tools alone doesn’t help. Fewer causally-correct tools is the win.\n\nAnd the gains were largest for weaker models. Claude 3.5 Haiku went from 0.48 success (with 2.62 wrong-tool calls per task) under all-tools exposure to 0.94 success with 0.06 wrong-tool calls under CMTF. Strong models like Claude Sonnet 4 were already near-perfect, but CMTF still slashed their token cost, from 24,858 down to 1,819.\n\n**Why this matters to the field**\n\nStep back from the engineering specifics and CMTF lands on a larger point the agent research community is only beginning to absorb: reliability is not solely a property you train into a model, it’s something you can architect around it. For the past two years the dominant levers in agent research have been scale and post-training — bigger models, better RLHF, smarter prompting. CMTF demonstrates a third, orthogonal axis: shaping the action space the model sees at each decision point. That reframing matters because it decouples agent safety and reliability from model capability. A method that lifts a weak model from 0.48 to 0.94 success without touching a single weight suggests that a meaningful fraction of what we currently attribute to “model isn’t smart enough” is actually “interface gave the model too much rope.” For a field racing toward autonomous, long-horizon agents, the idea that least-privilege action exposure can be derived automatically from tool contracts — rather than hand-coded guardrails — is a foundational primitive, not a one-off trick. It connects modern LLM agents back to decades of classical planning theory, and it gives researchers a clean, measurable target: minimize the causal frontier, not just the token count.\n\n**Why this matters for engineering teams, not just researchers?**\n\nThis is where the paper stops being an academic curiosity and starts being a practical architecture pattern. Here’s how I’d translate it for a team shipping agentic systems.\n\nFirst, tool exposure is a control surface you can own. The paper makes a framing shift I think every platform team should internalize: the visible tool menu is a runtime control surface, not a static config. You’re not just deciding what your agent can do; you’re deciding what it can do right now, at this step, given this state. That means tool exposure becomes something you engineer deliberately, something testable, tunable, and observable, rather than a giant tools array you dump into every prompt and pray over.\n\nSecond, high-risk actions get gated by construction. Every team with a production agent worries about the same nightmare: the agent sends the email, deletes the record, or charges the card prematurely. CMTF gates risky tools by their preconditions. send_email literally cannot appear on the menu until a draft exists and the causal path justifies it. Safety stops being a hopeful system prompt (“please don’t delete things”) and becomes a structural property of the interface. For teams, this is huge, because it’s an auditable guarantee. You can inspect why a tool was exposed, because a precondition-effect relation put it on the minimal path, independent of the model’s opaque internal reasoning.\n\nThird, cheaper models become viable. A 90 percent token reduction isn’t just a cost line item. It changes what’s economically possible. If causal filtering lets a smaller, cheaper, lower-latency model hit 0.94 success where it used to flounder at 0.48, you can deploy agents in places where the big-model bill was a non-starter. For teams managing cost, latency, and rate limits across many concurrent agent sessions, that’s the difference between “prototype” and “product.”\n\nFourth, your new bottleneck is tool contracts. Here’s the honest catch, and the paper is upfront about it. CMTF is only as good as your tool contracts. Garbage preconditions and effects mean garbage filtering. The authors explicitly call out tool-contract design as a systems problem. Contracts can be hand-written for safety-critical tools, derived from API schemas, or inferred from execution traces, and they should be versioned and monitored like any other production artifact. For a team, this is a good problem to have, because it’s a familiar one. You already know how to maintain schemas, write tests, and version interfaces. The paper essentially says: treat your agent’s tools the way you treat a typed API, with explicit contracts about what they need and what they produce. That discipline pays off in both reliability and auditability.\n\nFifth, don’t be dogmatic about minimality. The authors are refreshingly practical here too. Strict “expose exactly one tool” minimality assumes your state tracking and contracts are trustworthy. In the real world, tools fail, state estimates drift, and sometimes no causal path is found. Their suggested extension, uncertainty-aware filtering, exposes the minimal frontier by default but widens to a small recovery or diagnostic set when execution fails or the state tracker is unsure. This pairs naturally with self-healing orchestration approaches and is the kind of pragmatic hedge a production team will want from day one.\n\n**The honest limitations**\n\nIt would be a disservice to oversell this. The benchmark is synthetic, built on mocked, deterministic tool outputs that strip away real-world messiness like latency, auth failures, partial results, and ambiguous observations. The strict “one gold chain per task” success metric may penalize legitimately valid alternative trajectories. CMTF assumes you can map user requests to goal states and track task state accurately, which fits search-read-update and retrieve-summarize workflows far better than open-ended, creative, or multi-objective tasks.\n\nSo this isn’t a silver bullet for all agents. It’s a sharp tool for a specific, extremely common shape of problem: multi-step workflows over a large tool registry where each step has identifiable state. Which, if you look around at enterprise copilots and workflow automation, describes an enormous slice of what teams are actually building.\n\n**The takeaway**\n\nFor most of the last 10 months the agent-reliability conversation has been dominated by “use a better model” and “write a better prompt.” This paper points at a third lever that teams have far more direct control over: the action interface itself. As the authors put it, reliable tool use is not only a model capability; it’s an interface-design problem between the model, the task state, and the available tools.\n\nThe practical playbook for a team is clear. Give every tool a precondition-effect contract. Track task state and goal state explicitly. At each step, expose only the next causally-necessary tool, or a small recovery set under uncertainty. Gate risky actions structurally, not hopefully. And treat tool contracts as versioned, tested, first-class engineering artifacts. Your agent doesn’t need to see all hundred tools. It needs to see the one that moves the task forward. Build the filter that makes that the only choice it has, and ToolChoiceConfusion stops being a thing that happens to you.\n\n*Paper: ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents (arXiv:2606.06284v1), Rahul Suresh Babu and Laxmipriya Ganesh Iyer. **https://arxiv.org/html/2606.06284v1*\n\nCode & reproducibility artifacts: *https://github.com/R-Suresh/ToolChoiceConfusion*\n\n*Laxmipriya Ganesh Iyer researches reliable and safe tool use in LLM agents, with a focus on tool-menu construction, causal filtering, and least-privilege action exposure.*\n\n[Your AI Agent Doesn’t Need 100 Tools. It Needs the Right One.](https://pub.towardsai.net/your-ai-agent-doesnt-need-100-tools-it-needs-the-right-one-b1baf370a82b) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/your-ai-agent-doesnt-need-100-tools-it-needs-the-right-one", "canonical_source": "https://pub.towardsai.net/your-ai-agent-doesnt-need-100-tools-it-needs-the-right-one-b1baf370a82b?source=rss----98111c9905da---4", "published_at": "2026-06-15 15:31:01+00:00", "updated_at": "2026-06-15 15:46:10.538623+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-research", "ai-tools"], "entities": ["R.S. Babu", "L.G. Iyer", "CMTF"], "alternates": {"html": "https://wpnews.pro/news/your-ai-agent-doesnt-need-100-tools-it-needs-the-right-one", "markdown": "https://wpnews.pro/news/your-ai-agent-doesnt-need-100-tools-it-needs-the-right-one.md", "text": "https://wpnews.pro/news/your-ai-agent-doesnt-need-100-tools-it-needs-the-right-one.txt", "jsonld": "https://wpnews.pro/news/your-ai-agent-doesnt-need-100-tools-it-needs-the-right-one.jsonld"}}