{"slug": "when-your-agent-calls-the-wrong-tool-making-function-calling-reliable-enough-to", "title": "When Your Agent Calls the Wrong Tool: Making Function-Calling Reliable Enough to Ship", "summary": "A developer describes how to make function-calling in AI agents reliable enough for production, based on experience with an agent that called the wrong tool. Key techniques include keeping tool sets small and distinct, validating all inputs server-side, separating read and write tools with extra gates for writes, using idempotency keys to prevent duplicate actions, returning specific error messages, and logging every tool invocation for debugging.", "body_md": "The first time we put an agent in front of real tools, it did something instructive. Asked to \"refund the customer's last order,\" it called `cancel_subscription`\n\ninstead of `issue_refund`\n\n. Both tools existed. Both were plausibly related to an unhappy customer. The model picked the wrong one and executed it with complete confidence.\n\nThis is the part of agent engineering that the demos skip. Letting a model generate text is easy. Letting it take actions in your systems - call functions, hit APIs, change state - is where reliability either exists or it does not. Here is how we make function-calling trustworthy enough to put in production.\n\nThe more tools you hand an agent, the more chances it has to choose wrong. We have watched accuracy fall off a cliff once a single agent has twenty-plus tools with overlapping purposes. Two fixes: keep the tool set small and distinct, and name and describe each tool so precisely that confusion is hard. `refund_order`\n\nwith a description that says \"issues a monetary refund for a completed order; does NOT cancel subscriptions\" beats a vague `handle_order`\n\nevery time. The description is not documentation - it is the instruction the model actually reads when deciding.\n\nEven when the agent picks the right tool, it can pass nonsense: a refund amount larger than the order, a date in the wrong format, a customer ID that does not exist, a negative quantity. The model is generating plausible-looking arguments, not verified ones. So every tool we expose validates its inputs hard, on the server side, before doing anything - type checks, range checks, existence checks, business-rule checks. An invalid call returns a clear error the agent can read and correct, rather than corrupting data. Treat agent-supplied arguments exactly as you would treat input from an anonymous user on the internet: never trusted.\n\nReading data is low-risk; changing data is not. We split tools into those two classes and treat them very differently. Read tools an agent can call freely. Write tools - anything that moves money, sends a message, deletes a record, changes an order - go through extra gates: stricter validation, rate limits, and for the highest-stakes actions, a human approval step. The agent prepares the action; a person confirms it. As trust in a specific workflow grows, you can loosen the gate. You do not start there.\n\nAgents retry. A tool call times out, the agent assumes failure and calls again - but the first call actually went through. Now you have refunded twice. The defense is an idempotency key on every state-changing tool: a deterministic identifier the server checks so a repeated call returns the original result instead of acting again. And wherever the business allows, prefer reversible actions - a \"draft\" or \"pending\" state a human can release - over irreversible ones the agent commits instantly.\n\nWhen a tool fails, what you return matters enormously. Return a generic \"error\" and the agent flails - retries blindly, or invents a success message to the user. Return a specific, readable message - \"refund failed: order 4471 is already fully refunded\" - and a capable model will reason about it correctly, explain it to the user, or choose a different path. We treat tool error messages as a first-class part of the design, written for a reader who has to decide what to do next.\n\nWhen an agent does something surprising in production, the only way to understand it is to see exactly which tools it called, with which arguments, in what order, and what came back. We log every tool invocation as a structured record. This is not optional - it is the difference between \"we fixed it in an hour\" and \"we have no idea what happened.\" It is also what lets you build the evaluation set that catches the next regression before it ships.\n\nMost teams test whether the agent *says* the right thing. Far fewer test whether it *does* the right thing. We build a suite of scenarios - \"customer asks for a refund on an already-refunded order,\" \"user requests a cancellation but means a pause\" - and assert on the actual tool calls the agent makes, not the words it produces. That is where the real bugs hide, and it is the only way to ship action-taking agents with a straight face.\n\nThe shift from a chatbot to an agent is the shift from generating words to taking actions, and actions have consequences. Reliable function-calling is not one trick - it is a stack of small disciplines: a tight tool set, hard validation, gated writes, idempotency, honest errors, and tests that check behavior. Put them in place and an agent becomes genuinely useful. Skip them and you have shipped an unpredictable hand on your production systems.\n\n**About Shanti Infosoft:** Shanti Infosoft is a CMMI Level 5 AI development company that has delivered 700+ projects across 16+ industries. We help teams move from AI ideas to dependable, production-grade software - [shantiinfosoft.com](https://www.shantiinfosoft.com) | [AI development services](https://www.shantiinfosoft.com/services/ai-development-company/).\n\nIf your agent calls the wrong tool often enough to make you nervous, we can harden its function-calling layer until it is reliable enough to ship. [Talk to our team](https://www.shantiinfosoft.com/contact-us/).\n\nRelated reading: [AI Writes 4x the Code. Here's the QA Layer That Stops 4x the Bugs](https://www.shantiinfosoft.com/blog/ai-writes-4x-code-qa-layer/)\n\n*Rishabh Jain is a Director at Shanti Infosoft, where the team builds AI agents and automation for real business operations.*", "url": "https://wpnews.pro/news/when-your-agent-calls-the-wrong-tool-making-function-calling-reliable-enough-to", "canonical_source": "https://dev.to/rishabh_jain_7087a66dbf50/when-your-agent-calls-the-wrong-tool-making-function-calling-reliable-enough-to-ship-4fn8", "published_at": "2026-06-18 06:30:18+00:00", "updated_at": "2026-06-18 06:51:47.121970+00:00", "lang": "en", "topics": ["ai-agents", "ai-safety", "ai-products", "developer-tools"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/when-your-agent-calls-the-wrong-tool-making-function-calling-reliable-enough-to", "markdown": "https://wpnews.pro/news/when-your-agent-calls-the-wrong-tool-making-function-calling-reliable-enough-to.md", "text": "https://wpnews.pro/news/when-your-agent-calls-the-wrong-tool-making-function-calling-reliable-enough-to.txt", "jsonld": "https://wpnews.pro/news/when-your-agent-calls-the-wrong-tool-making-function-calling-reliable-enough-to.jsonld"}}