How LLM Function Calling Actually Works — From Tokens to Tool Orchestration An engineer explains how LLM function calling works under the hood, contrasting plain text, JSON mode, and schema-constrained function calling. The post details how constrained decoding enforces schema compliance at the token level, enabling reliable structured output and multi-tool orchestration. Originally published on my blog . Cross-posted here with a canonical link. When you ask an LLM "Compare the weather in Tokyo and Berlin," what actually happens? The model can't browse the internet — but it can decide to call a weather API. Twice. In the same turn. This article covers how function calling works, how the LLM returns structured data despite generating tokens one by one, and what happens when the model needs to orchestrate multiple tool calls to answer a single question. "Function calling" is one of several ways to get output from an LLM API. Here are the three main methods: response = client.chat.completions.create model="gemini-2.5-flash-lite", messages= {"role": "user", "content": "Classify this email: ..."} , text = response.choices 0 .message.content "This is a job search email because..." You get back free-form text . Then you'd have to parse it yourself — maybe with regex, or hoping the LLM follows instructions like "respond with JSON only". This is fragile because the LLM might say: "I think this email is in the job search category because..." ...and now your regex breaks. response = client.chat.completions.create model="...", messages= ... , response format={"type": "json object"}, force JSON output data = json.loads response.choices 0 .message.content The LLM is forced to output valid JSON , but you still have no guarantee of the schema — it might return {"cat": "job"} instead of {"category": "job search"} . response = client.chat.completions.create model="...", messages= ... , tools= { "type": "function", "function": { "name": "classify email", "parameters": { "properties": { "category": { "type": "string", "enum": "job search", "spam", "newsletter", ... }, "confidence": {"type": "number"}, "summary": {"type": "string"}, "reasoning": {"type": "string"} }, "required": "category", "confidence" } } } , tool choice={ "type": "function", "function": {"name": "classify email"} }, You define the exact schema you want — field names, types, enums, required fields. The API forces the LLM to fill in that schema. The result comes back as: { "category": "job search", "confidence": 0.95, "summary": "LinkedIn job alert for Senior Python Developer", "reasoning": "Sender is LinkedIn, contains job listings..." } This is the most reliable way to get structured output. The LLM cannot deviate from the schema. | Method | Output | Schema Guarantee | Reliability | |---|---|---|---| | Plain text | Free-form string | None | Low — requires manual parsing | | JSON mode | Valid JSON | No schema enforcement | Medium — valid JSON but unpredictable keys | | Function calling | Schema-constrained JSON | Full schema + types + enums | High — enforced at token generation level | The LLM still generates tokens one by one. It doesn't "natively" return a Python dictionary. Here's what actually happens under the hood. Tokens: { " category " : " job search " , " confidence " : 0 . 95 } ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ token token token ... still just text tokens The LLM is still producing text — it's just text that happens to be valid JSON. The API layer does the magic. When you use function calling, the API applies constrained decoding also called "guided generation" : tools schema definition "enum": "job search", "spam", ... , the LLM can "type": "number" , only numeric tokens are valid at that positionThis is fundamentally different from just asking "please reply in JSON" in the prompt. The constraints are enforced at the token-generation level , not via prompt instructions. LLM brain │ ▼ generates tokens, constrained by schema '{"category":"job search","confidence":0.95,"summary":"..."}' │ ▼ API parses & validates response.choices 0 .message.tool calls 0 .function.arguments │ this is still a STRING ▼ json.loads arguments │ ▼ now it's a Python dict {"category": "job search", "confidence": 0.95, "summary": "..."} The API returns tool calls as part of the response tool call = response.choices 0 .message.tool calls 0 .arguments is a STRING containing JSON raw = tool call.function.arguments '{"category":"job search","confidence":0.95,...}' We parse it into a Python dict data = json.loads raw {"category": "job search", "confidence": 0.95, ...} TL;DR : The LLM is still generating text/tokens. Function calling constrains which tokens it can generate must match your schema , and the API wraps the result in a structured format. We then json.loads that string into a Python dict. So far we've seen one function called once. But what happens when the user's question requires multiple tool calls ? tools = { "type": "function", "function": { "name": "get weather", "description": "Get current weather for a city", "parameters": { "type": "object", "properties": { "city": {"type": "string", "description": "City name"}, "units": { "type": "string", "enum": "celsius", "fahrenheit" , "description": "Temperature units" } }, "required": "city" } } } We have one tool definition — get weather . Now watch what happens when the user asks a question that requires it twice. response = client.chat.completions.create model="gpt-4o", messages= { "role": "user", "content": "Compare the weather in Tokyo and Berlin right now" } , tools=tools, The LLM doesn't return a text answer. Instead, it returns two tool calls in a single response: message = response.choices 0 .message message.content is None — no text response message.tool calls has TWO entries: for tc in message.tool calls: print f"ID: {tc.id}" print f"Function: {tc.function.name}" print f"Args: {tc.function.arguments}" print Output: ID: call abc123 Function: get weather Args: {"city": "Tokyo", "units": "celsius"} ID: call def456 Function: get weather Args: {"city": "Berlin", "units": "celsius"} The LLM decided on its own to: Now you run both calls and feed the results back: python import json Execute each tool call tool results = for tc in message.tool calls: args = json.loads tc.function.arguments Call your actual weather API weather = get weather from api args "city" , args.get "units", "celsius" tool results.append { "role": "tool", "tool call id": tc.id, must match the ID from the LLM "content": json.dumps weather } Send results back to the LLM messages = {"role": "user", "content": "Compare the weather in Tokyo and Berlin right now"}, message, the assistant's tool calls response tool results, both tool results final = client.chat.completions.create model="gpt-4o", messages=messages, tools=tools, print final.choices 0 .message.content The LLM now has both weather results and generates a natural comparison: "Right now, Tokyo is 22°C with partly cloudy skies, while Berlin is 8°C and raining. Tokyo is 14 degrees warmer. If you're choosing between the two today, Tokyo has the better weather." User: "Compare weather in Tokyo and Berlin" │ ▼ LLM Turn 1 : I need weather for both cities │ ├─→ tool call: get weather city="Tokyo" ──→ Your code calls API ──→ {"temp": 22, ...} │ └─→ tool call: get weather city="Berlin" ──→ Your code calls API ──→ {"temp": 8, ...} │ ▼ LLM Turn 2 : Now I have both results │ └─→ "Tokyo is 22°C, Berlin is 8°C. Tokyo is 14 degrees warmer..." Parallel what happened above : The LLM returns multiple tool calls in a single response. Both calls are independent — your code can execute them concurrently: python import asyncio async def execute tools parallel tool calls : tasks = execute single tool tc for tc in tool calls return await asyncio.gather tasks Sequential : Sometimes the LLM needs the result of one call before making the next. For example: "What's the weather in the capital of France?" Turn 1: LLM calls get capital country="France" → Your code returns "Paris" Turn 2: LLM calls get weather city="Paris" → Your code returns weather data Turn 3: LLM generates final answer The LLM decides which pattern to use based on whether the calls depend on each other. The LLM can also call different tools in the same turn. Suppose you define two tools: tools = { "type": "function", "function": { "name": "get weather", "description": "Get current weather for a city", "parameters": { "type": "object", "properties": { "city": {"type": "string"} }, "required": "city" } } }, { "type": "function", "function": { "name": "get exchange rate", "description": "Get currency exchange rate", "parameters": { "type": "object", "properties": { "from currency": {"type": "string"}, "to currency": {"type": "string"} }, "required": "from currency", "to currency" } } } If the user asks: "I'm traveling from NYC to Tokyo next week. What's the weather like and how much is 1 USD in Yen?" The LLM returns two different tool calls in one turn: tool calls 0 {"name": "get weather", "arguments": '{"city": "Tokyo"}'} tool calls 1 {"name": "get exchange rate", "arguments": '{"from currency": "USD", "to currency": "JPY"}'} Your code routes each call to the right function: tool handlers = { "get weather": handle weather, "get exchange rate": handle exchange rate, } for tc in message.tool calls: handler = tool handlers tc.function.name args = json.loads tc.function.arguments result = handler args ... send result back This is essentially the registry pattern — a dictionary maps function names to handlers. No if/else chains needed. The tool choice parameter controls whether and how the LLM uses tools: tool choice | Behavior | Use Case | |---|---|---| "auto" | LLM decides whether to call tools or respond with text | General-purpose agents | "required" | LLM must call at least one tool | When you always need structured output | {"type": "function", "function": {"name": "..."}} | LLM must call this specific function | Email classification always classify | "none" | LLM cannot call any tools | Force a text-only response | For the weather comparison, we use "auto" — the LLM decides on its own that it needs to call get weather twice. Function calling JSON mode plain text for getting structured data from LLMs. Function calling enforces your schema at the token generation level, not just via prompt instructions. LLMs still generate tokens — they don't natively return dicts. The API layer applies constrained decoding to ensure the token output matches your schema, then you json.loads the resulting string. One question can trigger multiple tool calls. The LLM decides whether to call the same tool with different arguments Tokyo + Berlin or different tools entirely weather + exchange rate — all in a single turn. Parallel vs sequential is decided by the LLM. Independent calls two cities come back in one turn. Dependent calls get capital → get weather happen across multiple turns. Route tool calls with a registry, not if/else. A dictionary mapping function names to handlers keeps your code clean and extensible.