The first time we put an agent in front of real tools, it did something instructive. Asked to "refund the customer's last order," it called cancel_subscription
instead of issue_refund
. Both tools existed. Both were plausibly related to an unhappy customer. The model picked the wrong one and executed it with complete confidence.
This is the part of agent engineering that the demos skip. Letting a model generate text is easy. Letting it take actions in your systems - call functions, hit APIs, change state - is where reliability either exists or it does not. Here is how we make function-calling trustworthy enough to put in production.
The more tools you hand an agent, the more chances it has to choose wrong. We have watched accuracy fall off a cliff once a single agent has twenty-plus tools with overlapping purposes. Two fixes: keep the tool set small and distinct, and name and describe each tool so precisely that confusion is hard. refund_order
with a description that says "issues a monetary refund for a completed order; does NOT cancel subscriptions" beats a vague handle_order
every time. The description is not documentation - it is the instruction the model actually reads when deciding.
Even when the agent picks the right tool, it can pass nonsense: a refund amount larger than the order, a date in the wrong format, a customer ID that does not exist, a negative quantity. The model is generating plausible-looking arguments, not verified ones. So every tool we expose validates its inputs hard, on the server side, before doing anything - type checks, range checks, existence checks, business-rule checks. An invalid call returns a clear error the agent can read and correct, rather than corrupting data. Treat agent-supplied arguments exactly as you would treat input from an anonymous user on the internet: never trusted.
Reading data is low-risk; changing data is not. We split tools into those two classes and treat them very differently. Read tools an agent can call freely. Write tools - anything that moves money, sends a message, deletes a record, changes an order - go through extra gates: stricter validation, rate limits, and for the highest-stakes actions, a human approval step. The agent prepares the action; a person confirms it. As trust in a specific workflow grows, you can loosen the gate. You do not start there.
Agents retry. A tool call times out, the agent assumes failure and calls again - but the first call actually went through. Now you have refunded twice. The defense is an idempotency key on every state-changing tool: a deterministic identifier the server checks so a repeated call returns the original result instead of acting again. And wherever the business allows, prefer reversible actions - a "draft" or "pending" state a human can release - over irreversible ones the agent commits instantly.
When a tool fails, what you return matters enormously. Return a generic "error" and the agent flails - retries blindly, or invents a success message to the user. Return a specific, readable message - "refund failed: order 4471 is already fully refunded" - and a capable model will reason about it correctly, explain it to the user, or choose a different path. We treat tool error messages as a first-class part of the design, written for a reader who has to decide what to do next.
When an agent does something surprising in production, the only way to understand it is to see exactly which tools it called, with which arguments, in what order, and what came back. We log every tool invocation as a structured record. This is not optional - it is the difference between "we fixed it in an hour" and "we have no idea what happened." It is also what lets you build the evaluation set that catches the next regression before it ships.
Most teams test whether the agent says the right thing. Far fewer test whether it does the right thing. We build a suite of scenarios - "customer asks for a refund on an already-refunded order," "user requests a cancellation but means a " - and assert on the actual tool calls the agent makes, not the words it produces. That is where the real bugs hide, and it is the only way to ship action-taking agents with a straight face.
The shift from a chatbot to an agent is the shift from generating words to taking actions, and actions have consequences. Reliable function-calling is not one trick - it is a stack of small disciplines: a tight tool set, hard validation, gated writes, idempotency, honest errors, and tests that check behavior. Put them in place and an agent becomes genuinely useful. Skip them and you have shipped an unpredictable hand on your production systems.
About Shanti Infosoft: Shanti Infosoft is a CMMI Level 5 AI development company that has delivered 700+ projects across 16+ industries. We help teams move from AI ideas to dependable, production-grade software - shantiinfosoft.com | AI development services.
If your agent calls the wrong tool often enough to make you nervous, we can harden its function-calling layer until it is reliable enough to ship. [Talk to our team](https://www.shantiinfosoft.com/contact-us/).
Related reading: [AI Writes 4x the Code. Here's the QA Layer That Stops 4x the Bugs](https://www.shantiinfosoft.com/blog/ai-writes-4x-code-qa-layer/)
Rishabh Jain is a Director at Shanti Infosoft, where the team builds AI agents and automation for real business operations.