{"slug": "tool-calling-is-not-an-api-call-what-engineers-keep-getting-wrong", "title": "Tool Calling Is Not an API Call: What Engineers Keep Getting Wrong", "summary": "Engineers at MasTec building tool-driven agent systems repeatedly make five critical mistakes when implementing LLM tool calling, treating it like a REST API call instead of a nondeterministic contract. The errors include ambiguous schema descriptions, lack of argument validation, and failure to handle unexpected tool outputs, which can silently corrupt production workflows. Clear schema design and Pydantic validation at the function boundary are essential fixes.", "body_md": "I’ve been building tool-driven agent systems at MasTec for a while now, orchestrating enterprise APIs, operational databases, and internal services through LLM agents in production environments. And the pattern I keep seeing is the same: engineers treat tool calling as if they’re writing a REST client. Clean schema, right endpoint, valid payload, ship it.\n\nThat mental model works for about five minutes in production. Then reality shows up.\n\nTool calling in an agentic system is a fundamentally different contract than an API call. The caller isn’t deterministic. It doesn’t guarantee argument structure. It doesn’t always know when *not* to call. And it doesn’t recover gracefully when the tool returns something unexpected. Understanding that gap between how engineers expect tool calling to work and how it actually behaves under real load is what separates agents that hold up from agents that quietly corrupt your workflows.\n\nHere are the five mistakes I’ve seen repeatedly. All of them are fixable. None of them show up in the tutorials.\n\nWhen engineers design a tool schema, they tend to write it the same way they’d write API documentation, clear enough for a developer to understand. That’s the wrong audience.\n\nThe model reads your schema at inference time and decides how to call the tool based entirely on what you wrote. If your description is ambiguous, the model fills the gap with a guess. If two of your tools have overlapping purposes, the model arbitrarily picks one. If your parameter names are terse and unexplained, the model infers meaning and is often wrong.\n\nI’ve watched agents call a get_record tool when they should have called search_records because both descriptions mentioned \"retrieving data.\" The fix wasn't changing the routing logic — it was rewriting the schema descriptions to make the behavioral boundary explicit.\n\nA good tool schema description should answer three questions unambiguously: what this tool does, what it explicitly does *not* do, and under what conditions it should be called. Write it like you’re training a junior engineer who has never seen your codebase.\n\npython\n\n```\n# Weak schema description{  \"name\": \"get_customer\",  \"description\": \"Gets customer data\"}# Production-grade schema description{  \"name\": \"get_customer_by_id\",  \"description\": \"Retrieves a single customer record using an exact customer ID.   Use this ONLY when you have a confirmed customer_id.   Do NOT use this for name-based lookups or search - use search_customers instead.\",  \"parameters\": {    \"customer_id\": {      \"type\": \"string\",      \"description\": \"The exact customer UUID. Format: 'cust_XXXXXXXXXX'\"    }  }}\n```\n\nThe investment in schema clarity pays back every time the model routes correctly without needing a retry.\n\nThe model sends a tool call. Your code receives it. What happens next?\n\nIn most early implementations I’ve reviewed, the arguments get passed directly to the underlying function. No validation. No type checking. No boundary checks. The assumption is that the model populated the arguments correctly.\n\nThat assumption is wrong often enough to matter.\n\nLLMs hallucinate tool arguments. Not dramatically, not {\"customer_id\": \"I made this up\"}but subtly. A string field gets an integer. A required parameter comes through as null. An enum field receives a value that isn't in the allowed set. These failures don't throw loud errors. They propagate silently into your database, your downstream services, and your audit logs.\n\nThe fix is schema validation at the MCP layer or at the function boundary before anything touches your actual systems. I enforce this with Pydantic on every tool handler we run in production:\n\npython\n\n``` python\nfrom pydantic import BaseModel, validatorclass GetCustomerInput(BaseModel):    customer_id: str    @validator(\"customer_id\")    def must_be_valid_format(cls, v):        if not v.startswith(\"cust_\"):            raise ValueError(f\"Invalid customer_id format: {v}\")        return v@tooldef get_customer_by_id(raw_input: dict) -> dict:    validated = GetCustomerInput(**raw_input)  # raises before any DB call    return db.fetch_customer(validated.customer_id)\n```\n\nSchema validation at the tool boundary is one of the highest-ROI reliability patterns in agent systems. It costs almost nothing to implement, and it catches a significant percentage of hallucinated arguments before they touch anything real.\n\nAn API goes down. A database query times out. A tool returns a 500. What does your agent do?\n\nIn a naive implementation, it stops. Or worse, it retries the same call with the same arguments indefinitely until you hit a rate limit or someone looks at the logs.\n\nThis is where the difference between a prototype and a production agent shows up most clearly. Production agents need structured error handling baked into the tool layer not as an afterthought, but as part of the tool’s contract with the orchestrator.\n\nEvery tool I ship has a return envelope that distinguishes recoverable failures from terminal ones:\n\npython\n\n``` php\ndef call_tool(name: str, args: dict) -> dict:    try:        result = execute_tool(name, args)        return {\"status\": \"success\", \"data\": result}    except TransientError as e:        return {\"status\": \"retry\", \"reason\": str(e), \"retry_after\": 2}    except InvalidInputError as e:        return {\"status\": \"invalid_args\", \"reason\": str(e)}    except Exception as e:        return {\"status\": \"error\", \"reason\": \"Tool failed. Do not retry.\"}\n```\n\nThe agent’s orchestration layer, in my case, LangGraph, reads that status field and routes accordingly. A retry status triggers exponential backoff with jitter. An invalid_args status routes back to the model with the error message so it can attempt a corrected call. An error status escalates or gracefully terminates that branch of execution.\n\nWithout this structure, your agent has no way to distinguish “try again” from “stop, something is fundamentally wrong.” It guesses. And its guesses at error recovery are usually bad.\n\nThis one surprises engineers every time. You’d think more tools mean more capability. In practice, it often means worse performance.\n\nWhen you register fifteen tools with an agent, every one of those tool schemas enters the model’s context window. The model now has to reason about fifteen possible actions on every step. That increases token usage, slows down routing decisions, and critically raises the probability of the model calling the wrong tool. The more overlapping or similar your tools look from a description standpoint, the worse this gets.\n\nAmazon Prime Video hit this directly in production. Centralizing all tool access through a single MCP server loaded enough tool definitions to consume a meaningful chunk of the context window before the agent processed a single user message.\n\nThe fix I’ve landed on: scope tools to the agent’s role. Not every agent needs access to every tool. A customer lookup agent doesn’t need write access to anything. An order status agent doesn’t need access to account management APIs. Define the minimum viable toolset for each agent’s function and enforce it at the MCP permission layer, not just at the prompt level.\n\nIf a human reviewing your system can’t immediately say which tool should handle a given scenario, the model can’t either. That ambiguity is a configuration problem, not a model problem.\n\nWhen something goes wrong, and it will, can you reconstruct exactly what happened? Which tool was called, with what arguments, and what did it return?\n\nMost teams can’t. Tool calls happen inside the agent loop, and unless you’ve explicitly wired in tracing, they’re invisible. You see the final output. You don’t see the three intermediate tool calls that produced it.\n\nI trace every tool invocation, inputs, outputs, latency, and status using structured logging tied to a trace ID that spans the full agent run:\n\npython\n\n``` python\nimport structloglog = structlog.get_logger()def traced_tool_call(trace_id: str, tool_name: str, args: dict):    log.info(\"tool_call_start\", trace_id=trace_id, tool=tool_name, args=args)    start = time.time()    result = call_tool(tool_name, args)    log.info(\"tool_call_end\",              trace_id=trace_id,              tool=tool_name,              status=result[\"status\"],             latency_ms=round((time.time() - start) * 1000))    return result\n```\n\nThis gives you the ability to replay any agent run, identify exactly where it went wrong, and determine whether the failure was a bad tool call, a bad model decision, or a downstream service issue. Without this, debugging agentic systems is guesswork, and guesswork is expensive when the system is touching live enterprise data.\n\nTool calling looks simple from the outside. You define a function, register it, and let the model decide when to invoke it. The complexity is in everything that surrounds that decision: schema clarity, input validation, error classification, toolset scoping, and execution observability.\n\nEvery one of these is an engineering discipline, not a model capability. The model will do its job. The question is whether you’ve built the infrastructure that makes its job possible.\n\nThe teams shipping agents to production reliably are the ones who’ve stopped treating tool calling as a convenience feature and started treating it as a first-class engineering surface. The rest are debugging production incidents and wondering why their demo worked.\n\n[Tool Calling Is Not an API Call: What Engineers Keep Getting Wrong](https://pub.towardsai.net/tool-calling-is-not-an-api-call-what-engineers-keep-getting-wrong-4100b33b45f9) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/tool-calling-is-not-an-api-call-what-engineers-keep-getting-wrong", "canonical_source": "https://pub.towardsai.net/tool-calling-is-not-an-api-call-what-engineers-keep-getting-wrong-4100b33b45f9?source=rss----98111c9905da---4", "published_at": "2026-06-24 11:31:00+00:00", "updated_at": "2026-06-24 11:48:58.401128+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-tools", "ai-infrastructure"], "entities": ["MasTec", "LLM", "Pydantic", "MCP"], "alternates": {"html": "https://wpnews.pro/news/tool-calling-is-not-an-api-call-what-engineers-keep-getting-wrong", "markdown": "https://wpnews.pro/news/tool-calling-is-not-an-api-call-what-engineers-keep-getting-wrong.md", "text": "https://wpnews.pro/news/tool-calling-is-not-an-api-call-what-engineers-keep-getting-wrong.txt", "jsonld": "https://wpnews.pro/news/tool-calling-is-not-an-api-call-what-engineers-keep-getting-wrong.jsonld"}}