cd /news/ai-agents/your-ai-agent-calls-the-wrong-tool-a… · home topics ai-agents article
[ARTICLE · art-42298] src=dev.to ↗ pub= topic=ai-agents verified=true sentiment=· neutral

Your AI agent calls the wrong tool — and your JSON schema is usually why

An engineer warns that AI agents calling the right tool 95% of the time still fail on eight-step tasks about 34% of the time due to compounding errors. The root cause is often poorly written JSON schemas, with four common issues: vague descriptions, untyped parameters, mismatched required fields, and free-text fields that should be enums. The fix is to treat schema descriptions as the model's only instructions and to encode constraints explicitly.

read5 min views1 publishedJun 28, 2026

Here's the number that should worry you more than it does: an agent that calls the right tool with the right arguments 95% of the time completes an eight-step task correctly only about 66% of the time. Reliability doesn't fail in one dramatic crash. It leaks. Every step is a coin that lands heads 19 times out of 20, and you're flipping it eight times in a row.

The good news is that most of that leak isn't the model being dumb. It traces to two things you control completely: the JSON schema you hand the model, and whether you let it guess when it shouldn't. Fix those two and the per-call rate climbs — and because it compounds, small gains pay off hugely.

This is the reframe that fixes everything downstream. When you define a tool, the description

fields aren't docs for your teammates. They are the only instructions the model gets about when and how to use that tool. The model never sees your implementation. It sees the schema. That's it.

So a schema like this is not "good enough":

{
  "name": "send_email",
  "description": "Sends an email",
  "parameters": {
    "type": "object",
    "properties": {
      "to": { "type": "string" },
      "body": { "type": "string" }
    }
  }
}

Read it the way the model does. When should it send an email versus draft one? Is to

an address or a contact name? Can body

be HTML? Is anything required? You know the answers. The model is guessing — and guessing is exactly where the 5% comes from.

After staring at a lot of broken tool definitions, the same four keep showing up:

1. Vague or missing descriptions. "Sends an email," "Gets data," "Handles the request." When two tools have thin descriptions, the model can't tell them apart, so it picks the wrong one. The fix is to write the description like you're explaining the tool to a new hire who will be fired for using it at the wrong time: when to call it, when not to, and what each argument means.

2. Untyped or loosely typed params. A string

where you meant an ISO date. A string

where you meant one of four statuses. If the type doesn't constrain the value, the model invents a plausible-looking one — "next Tuesday"

, "done-ish"

— and your executor chokes. Use enum

for fixed sets. Use format

and explicit types. Every constraint you encode is one the model can't violate.

3. The silent killer: required naming a property that doesn't exist. This one is brutal because nothing yells at you. Your

required

array lists "recipient"

, but the property in properties

is called to

. The schema is still valid JSON. The model now thinks a field is mandatory that it has no slot to fill — so required

actually exists in properties

.4. Free-text where you meant a choice. "priority": { "type": "string" }

invites "high"

, "High"

, "urgent"

, "P0"

, and "pretty important tbh"

. Make it "enum": ["low", "medium", "high"]

and the ambiguity is gone before the model can create it.

The single most common production failure isn't a malformed call — it's the model confidently filling in a blank it should have asked about. User says "schedule a meeting with Sarah next week." Which Sarah? Which timezone? Which 30-minute slot on which day? A model optimizing to be helpful will pick one. Sometimes it's right. Sometimes it books a 7 a.m. call with the wrong Sarah.

The rule I'd tattoo on a junior agent: if a missing field affects money, publishing, deletion, or customer communication, ask — don't guess. A clarifying question costs one turn. A wrong write operation costs a refund, a deleted record, or an apology email. Don't optimize for fewer turns at the price of wrong actions.

You can encode a lot of this in the schema itself: don't mark fields required

that the model can't reasonably infer, and say so in the description — "If the user has not specified a timezone, ask; do not assume." The schema is where you set the defaults for the model's judgment.

Even when the provider guarantees well-formed JSON, well-formed is not the same as correct. Structured-output modes stop the model from emitting broken JSON; they do nothing to stop it from passing a valid-looking but wrong argument. So validate on your side, every time, before you execute: check the values against your real constraints (does this user ID exist? is this amount within range?), and on failure, return a clear error the model can read and recover from rather than crashing the run. Model output is input. You wouldn't trust raw input from a form field. Don't trust this one either.

Reading your own schemas for these bugs is hard — the required

-references-a-missing-property one in particular is invisible until it's breaking every call in prod. So I wrote a tiny zero-dependency linter for exactly this: tool-schema-lint (

npx tool-schema-lint your-tools.json

). It flags vague descriptions, untyped params, free-text-where-you-meant-enum, and the silent required

/properties

mismatch — for both Anthropic and OpenAI tool formats. It's free and MIT-licensed; point it at your tool definitions and see what falls out.If you want the bigger picture — the tool-patterns that keep multi-step agents on the rails, plus a runnable eval rubric for scoring "did it call the right tool with the right args in the right number of steps" — that's the Agent Builder's Toolkit. And if you're earlier on the curve, the

Tool-calling reliability compounds: 95% per call is ~66% over eight steps, so small per-call gains matter enormously. Most misses come from two controllable things. First, the schema — it's the only instruction the model gets, so write real descriptions, type and enum

your params, and make sure every name in required

actually exists in properties

(that last bug silently breaks every call). Second, guessing — if a missing field touches money, publishing, deletion, or customer communication, make the agent ask instead of inventing a value. Then validate the model's output as untrusted input before you execute. Schema plus judgment, not a smarter model, is where the reliability lives.

What's the worst wrong-tool call you've shipped? Reply and tell me — I collect these.

── more in #ai-agents 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/your-ai-agent-calls-…] indexed:0 read:5min 2026-06-28 ·