[Day 5/100] Tool Use and Function Calling, Explained from Scratch

A developer explains that large language model function calling is actually trained, constrained text generation, not real code execution, and walks through the end-to-end cycle of tool calls in agentic AI systems.

In Day 1 https://medium.com/towards-artificial-intelligence/day-1-100-what-is-agentic-ai-beyond-chatbots-and-copilots-63bf6cbec971 we built a weather agent that called a tool. In We have skipped over the mechanism that makes the whole thing work: how does the model actually call a function? Today we open that black box. By the end you will know exactly what happens between the user asks a question and Python executes your code . You will be able to write tool schemas the model picks correctly, debug tool calls when they go wrong, and reason about why parallel tool calls are sometimes faster. Function calling sounds magical. The model calls a function . It does not. The model generates text, the same way it always has. Function calling is a convention layered on top. When you give the OpenAI API a tools argument, three things happen behind the scenes. First, the API serializes your tool schemas into a special part of the prompt the model has been trained on. Anthropic, OpenAI, and Google all do this slightly differently, but the principle is the same: your {"name": "get weather", ...} JSON gets turned into text the model can read. Second, the model is biased through training and through grammar-constrained sampling to either generate a normal text response or generate a structured tool call in a specific format. Third, the API parses that structured output back out and hands it to you as a Python object with tool calls on it. Function calling is trained, constrained text generation, parsed back into structured data . There is no actual function being called. Your code calls the function. The model just requests the call. Why does this matter? Because every quirk of function calling makes sense once you remember it is text generation. The model invents tool names that do not exist? It generated text that looked like a tool call. The arguments are malformed JSON? It generated text that almost matched the JSON grammar. Tool descriptions matter a lot? Of course they do. They are the only thing the model sees about what each tool does. Here is what happens end to end on a single function call. 1. User input ─────────────────────────► You2. Build messages + tool schemas ──────► OpenAI API3. Model generates response ───────────► OpenAI API4. API parses response ────────────────► You receive tool calls 5. You execute the function ───────────► Your Python runs6. You append the result to messages ──► Conversation grows7. Send back to model ─────────────────► OpenAI API8. Model generates final answer ───────► You9. You return text to user ────────────► Done Every loop in an agent is this cycle, repeated. Step 5 is the one place real-world effects happen. Everything else is text in, text out. In code it looks like this. Pay attention to the message types. python from openai import OpenAIimport jsonclient = OpenAI def get weather city: str - str: return f"Sunny in {city}, 22°C"tools = { "type": "function", "function": { "name": "get weather", "description": "Get the current weather in a given city.", "parameters": { "type": "object", "properties": { "city": { "type": "string", "description": "City name, e.g. 'Paris' or 'Tokyo'." } }, "required": "city" } }} messages = {"role": "system", "content": "You are a weather assistant."}, {"role": "user", "content": "What is the weather in Paris?"} Step 1: Model decides to call the tool.resp = client.chat.completions.create model="gpt-4o", messages=messages, tools=tools, temperature=0 assistant msg = resp.choices 0 .messagemessages.append assistant msg Step 2: We execute the tool.for call in assistant msg.tool calls: args = json.loads call.function.arguments result = get weather args Step 3: Append the tool result so the model can see it. messages.append { "role": "tool", "tool call id": call.id, "content": result, } Step 4: Model produces the final answer.final = client.chat.completions.create model="gpt-4o", messages=messages, tools=tools, temperature=0 print final.choices 0 .message.content Notice the four message roles in play: system, user, assistant containing the tool calls , and tool containing the result . The tool call id connects a result back to the call that produced it. Get this wrong and the model will be confused about which tool returned what. A tool schema gives the model three things: the name, the description, and the parameter schemas. All three matter, and most beginners only think about the first. Names should be short, clear verbs. get weather, search orders, send email. Not weather noun, ambiguous , not do weather lookup for a city long, awkward , not get no information . The model reads names as part of its decision about which tool to pick. Descriptions are where most schemas fail. Compare: Bad: "description": "Gets weather" Good: "description": "Get the current weather in a given city. Use this whenever the user asks about weather, temperature, or what to wear. Returns a one-line summary including temperature and conditions." The good description tells the model when to use the tool, what it returns, and what counts as a relevant question . The model uses this text to decide between competing tools. If two tools have similar descriptions, the model will pick randomly. For each parameter, include the type string, integer, number, boolean, array, object , a description with an example, an enum if the parameter only accepts a fixed set of values, and the required array for parameters that must be present. "parameters": { "type": "object", "properties": { "city": { "type": "string", "description": "City name in English, e.g. 'Paris' or 'Tokyo'." }, "units": { "type": "string", "enum": "celsius", "fahrenheit" , "description": "Temperature units." } }, "required": "city" } The enum is doing real work. Without it, the model will sometimes pass "C", sometimes "metric", sometimes "Celsius". With it, you get one of two values, every time. A few mistakes you will make exactly once. Tools that overlap. If search users and find users both exist, the model will guess. Pick one. Tools that are too coarse. A single do everything action, params tool destroys the model's ability to reason about which action to take. Split it. Tools that are too fine. A separate tool for get user first name, get user last name, and get user email is exhausting. The model will call all three when one would do. Combine them into get user profile. Tools that hide important information from the model. If your tool returns a giant nested JSON, the model will pick the wrong fields. Pre-format important results into a short summary string when you can. Tools that fail silently. Returning an empty string or None when something went wrong tells the model nothing. Return an explicit error message. Tool error: customer id 12345 not found is something the model can act on. A modern feature worth knowing. When the model is confident that two tool calls are independent, it can request both in the same response. The model returns two tool calls in one assistant msg.for call in assistant msg.tool calls: Run them in parallel with asyncio or threads. ... For What is the weather in Paris and Tokyo? a modern model with a good schema will emit two parallel get weather calls. Run them concurrently and your latency drops roughly in half. This only works if your tools are truly independent. If tool b depends on the output of tool a, the model has to do them sequentially across two model turns. Design your tools to be independent when you can. Future you and your latency budget will be glad. A tool can fail. Network is down. Database returned no rows. The user does not have permission. How you communicate that failure back to the model determines how the agent recovers. Three rules. Return errors as tool results, not exceptions. Catch the exception inside the tool, format it as a string, and return it. php def get weather city: str - str: try: r = requests.get f"https://wttr.in/{city}?format=3", timeout=5 r.raise for status return r.text except Exception as e: return f"Error fetching weather for {city}: {e}" Be specific. Error: HTTP 404 is less useful than Error: city ‘Mars’ not found. Try a real Earth city. The model can act on the second message and cannot act on the first. Distinguish recoverable from terminal errors. If the agent can fix the call by trying again with different arguments, say so. Try a shorter time range or Use the customer ID, not the customer name turns a dead end into a retry. OpenAI, Anthropic, and Google all support function calling, but the message formats differ. The mechanics are identical. Only the JSON shape differs. Frameworks like LangChain and LiteLLM exist mostly to paper over this so you can swap providers without rewriting your agent. We will cover this in detail in Phase 2. Three things, in order. Watching the model react to a well-described error is the moment tool design becomes a craft. After today, every tool you write should have a clear name, a teaching description, typed parameters with examples, and a thoughtful error contract. See you tomorrow for Day 6/100 The ReAct Pattern: Reasoning Plus Acting in a Loop . This is Day 5 of the 100 Days of Learning Agentic AI series. See the full 100-day roadma p for everything we will cover. Follow along to build production-grade agents from scratch with LangChain, LangGraph, Langfuse, RAG, local models, and ten end-to-end capstone projects. Day 5/100 Tool Use and Function Calling, Explained from Scratch https://pub.towardsai.net/day-5-100-tool-use-and-function-calling-explained-from-scratch-5d3ac2a6b9b2 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.