{"slug": "how-llm-function-calling-actually-works-from-tokens-to-tool-orchestration", "title": "How LLM Function Calling Actually Works — From Tokens to Tool Orchestration", "summary": "An engineer explains how LLM function calling works under the hood, contrasting plain text, JSON mode, and schema-constrained function calling. The post details how constrained decoding enforces schema compliance at the token level, enabling reliable structured output and multi-tool orchestration.", "body_md": "Originally published on\n\n[my blog]. Cross-posted here with a canonical link.\n\nWhen you ask an LLM \"Compare the weather in Tokyo and Berlin,\" what actually happens? The model can't browse the internet — but it *can* decide to call a weather API. Twice. In the same turn.\n\nThis article covers how function calling works, how the LLM returns structured data despite generating tokens one by one, and what happens when the model needs to orchestrate multiple tool calls to answer a single question.\n\n\"Function calling\" is one of several ways to get output from an LLM API. Here are the three main methods:\n\n```\nresponse = client.chat.completions.create(\n    model=\"gemini-2.5-flash-lite\",\n    messages=[{\"role\": \"user\", \"content\": \"Classify this email: ...\"}],\n)\ntext = response.choices[0].message.content\n# \"This is a job_search email because...\"\n```\n\nYou get back **free-form text**. Then you'd have to parse it yourself — maybe with regex, or hoping the LLM follows instructions like \"respond with JSON only\". This is fragile because the LLM might say:\n\n\"I think this email is in the job_search category because...\"\n\n...and now your regex breaks.\n\n```\nresponse = client.chat.completions.create(\n    model=\"...\",\n    messages=[...],\n    response_format={\"type\": \"json_object\"},  # force JSON output\n)\ndata = json.loads(response.choices[0].message.content)\n```\n\nThe LLM is **forced to output valid JSON**, but you still have no guarantee of the schema — it might return `{\"cat\": \"job\"}`\n\ninstead of `{\"category\": \"job_search\"}`\n\n.\n\n```\nresponse = client.chat.completions.create(\n    model=\"...\",\n    messages=[...],\n    tools=[{\n        \"type\": \"function\",\n        \"function\": {\n            \"name\": \"classify_email\",\n            \"parameters\": {\n                \"properties\": {\n                    \"category\": {\n                        \"type\": \"string\",\n                        \"enum\": [\"job_search\", \"spam\", \"newsletter\", ...]\n                    },\n                    \"confidence\": {\"type\": \"number\"},\n                    \"summary\": {\"type\": \"string\"},\n                    \"reasoning\": {\"type\": \"string\"}\n                },\n                \"required\": [\"category\", \"confidence\"]\n            }\n        }\n    }],\n    tool_choice={\n        \"type\": \"function\",\n        \"function\": {\"name\": \"classify_email\"}\n    },\n)\n```\n\nYou define the **exact schema** you want — field names, types, enums, required fields. The API forces the LLM to fill in that schema. The result comes back as:\n\n```\n{\n    \"category\": \"job_search\",\n    \"confidence\": 0.95,\n    \"summary\": \"LinkedIn job alert for Senior Python Developer\",\n    \"reasoning\": \"Sender is LinkedIn, contains job listings...\"\n}\n```\n\nThis is the most reliable way to get structured output. The LLM **cannot** deviate from the schema.\n\n| Method | Output | Schema Guarantee | Reliability |\n|---|---|---|---|\n| Plain text | Free-form string | None | Low — requires manual parsing |\n| JSON mode | Valid JSON | No schema enforcement | Medium — valid JSON but unpredictable keys |\n| Function calling | Schema-constrained JSON | Full schema + types + enums | High — enforced at token generation level |\n\n**The LLM still generates tokens one by one.** It doesn't \"natively\" return a Python dictionary. Here's what actually happens under the hood.\n\n```\nTokens:  {  \"  category  \"  :  \"  job  _  search  \"  ,  \"  confidence  \"  :  0  .  95  }\n         ↑    ↑       ↑   ↑  ↑    ↑   ↑   ↑      ↑   ↑       ↑       ↑  ↑   ↑   ↑\n       token token  token ... (still just text tokens)\n```\n\nThe LLM is still producing **text** — it's just text that happens to be valid JSON. The API layer does the magic.\n\nWhen you use function calling, the API applies **constrained decoding** (also called \"guided generation\"):\n\n`tools`\n\nschema definition`\"enum\": [\"job_search\", \"spam\", ...]`\n\n, the LLM can `\"type\": \"number\"`\n\n, only numeric tokens are valid at that positionThis is fundamentally different from just asking \"please reply in JSON\" in the prompt. The constraints are enforced at the **token-generation level**, not via prompt instructions.\n\n```\nLLM brain\n  │\n  ▼ (generates tokens, constrained by schema)\n'{\"category\":\"job_search\",\"confidence\":0.95,\"summary\":\"...\"}'\n  │\n  ▼ (API parses & validates)\nresponse.choices[0].message.tool_calls[0].function.arguments\n  │  (this is still a STRING)\n  ▼\njson.loads(arguments)\n  │\n  ▼ (now it's a Python dict)\n{\"category\": \"job_search\", \"confidence\": 0.95, \"summary\": \"...\"}\n# The API returns tool_calls as part of the response\ntool_call = response.choices[0].message.tool_calls[0]\n\n# .arguments is a STRING containing JSON\nraw = tool_call.function.arguments\n# '{\"category\":\"job_search\",\"confidence\":0.95,...}'\n\n# We parse it into a Python dict\ndata = json.loads(raw)\n# {\"category\": \"job_search\", \"confidence\": 0.95, ...}\n```\n\n**TL;DR**: The LLM is still generating text/tokens. Function calling constrains *which* tokens it can generate (must match your schema), and the API wraps the result in a structured format. We then `json.loads()`\n\nthat string into a Python dict.\n\nSo far we've seen one function called once. But what happens when the user's question requires **multiple tool calls**?\n\n```\ntools = [{\n    \"type\": \"function\",\n    \"function\": {\n        \"name\": \"get_weather\",\n        \"description\": \"Get current weather for a city\",\n        \"parameters\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"city\": {\"type\": \"string\", \"description\": \"City name\"},\n                \"units\": {\n                    \"type\": \"string\",\n                    \"enum\": [\"celsius\", \"fahrenheit\"],\n                    \"description\": \"Temperature units\"\n                }\n            },\n            \"required\": [\"city\"]\n        }\n    }\n}]\n```\n\nWe have **one** tool definition — `get_weather`\n\n. Now watch what happens when the user asks a question that requires it twice.\n\n```\nresponse = client.chat.completions.create(\n    model=\"gpt-4o\",\n    messages=[{\n        \"role\": \"user\",\n        \"content\": \"Compare the weather in Tokyo and Berlin right now\"\n    }],\n    tools=tools,\n)\n```\n\nThe LLM doesn't return a text answer. Instead, it returns **two** tool calls in a single response:\n\n```\nmessage = response.choices[0].message\n\n# message.content is None — no text response\n# message.tool_calls has TWO entries:\n\nfor tc in message.tool_calls:\n    print(f\"ID: {tc.id}\")\n    print(f\"Function: {tc.function.name}\")\n    print(f\"Args: {tc.function.arguments}\")\n    print()\n```\n\nOutput:\n\n```\nID: call_abc123\nFunction: get_weather\nArgs: {\"city\": \"Tokyo\", \"units\": \"celsius\"}\n\nID: call_def456\nFunction: get_weather\nArgs: {\"city\": \"Berlin\", \"units\": \"celsius\"}\n```\n\nThe LLM **decided on its own** to:\n\nNow you run both calls and feed the results back:\n\n``` python\nimport json\n\n# Execute each tool call\ntool_results = []\nfor tc in message.tool_calls:\n    args = json.loads(tc.function.arguments)\n    # Call your actual weather API\n    weather = get_weather_from_api(args[\"city\"], args.get(\"units\", \"celsius\"))\n    tool_results.append({\n        \"role\": \"tool\",\n        \"tool_call_id\": tc.id,     # must match the ID from the LLM\n        \"content\": json.dumps(weather)\n    })\n\n# Send results back to the LLM\nmessages = [\n    {\"role\": \"user\", \"content\": \"Compare the weather in Tokyo and Berlin right now\"},\n    message,           # the assistant's tool_calls response\n    *tool_results,     # both tool results\n]\n\nfinal = client.chat.completions.create(\n    model=\"gpt-4o\",\n    messages=messages,\n    tools=tools,\n)\n\nprint(final.choices[0].message.content)\n```\n\nThe LLM now has both weather results and generates a natural comparison:\n\n\"Right now, Tokyo is 22°C with partly cloudy skies, while Berlin is 8°C and raining. Tokyo is 14 degrees warmer. If you're choosing between the two today, Tokyo has the better weather.\"\n\n```\nUser: \"Compare weather in Tokyo and Berlin\"\n  │\n  ▼\nLLM (Turn 1): I need weather for both cities\n  │\n  ├─→ tool_call: get_weather(city=\"Tokyo\")    ──→ Your code calls API ──→ {\"temp\": 22, ...}\n  │\n  └─→ tool_call: get_weather(city=\"Berlin\")   ──→ Your code calls API ──→ {\"temp\": 8, ...}\n  │\n  ▼\nLLM (Turn 2): Now I have both results\n  │\n  └─→ \"Tokyo is 22°C, Berlin is 8°C. Tokyo is 14 degrees warmer...\"\n```\n\n**Parallel** (what happened above): The LLM returns multiple `tool_calls`\n\nin a single response. Both calls are independent — your code can execute them concurrently:\n\n``` python\nimport asyncio\n\nasync def execute_tools_parallel(tool_calls):\n    tasks = [execute_single_tool(tc) for tc in tool_calls]\n    return await asyncio.gather(*tasks)\n```\n\n**Sequential**: Sometimes the LLM needs the result of one call before making the next. For example: \"What's the weather in the capital of France?\"\n\n```\nTurn 1: LLM calls get_capital(country=\"France\")\n  → Your code returns \"Paris\"\nTurn 2: LLM calls get_weather(city=\"Paris\")\n  → Your code returns weather data\nTurn 3: LLM generates final answer\n```\n\nThe LLM decides which pattern to use based on whether the calls depend on each other.\n\nThe LLM can also call **different** tools in the same turn. Suppose you define two tools:\n\n```\ntools = [\n    {\n        \"type\": \"function\",\n        \"function\": {\n            \"name\": \"get_weather\",\n            \"description\": \"Get current weather for a city\",\n            \"parameters\": {\n                \"type\": \"object\",\n                \"properties\": {\n                    \"city\": {\"type\": \"string\"}\n                },\n                \"required\": [\"city\"]\n            }\n        }\n    },\n    {\n        \"type\": \"function\",\n        \"function\": {\n            \"name\": \"get_exchange_rate\",\n            \"description\": \"Get currency exchange rate\",\n            \"parameters\": {\n                \"type\": \"object\",\n                \"properties\": {\n                    \"from_currency\": {\"type\": \"string\"},\n                    \"to_currency\": {\"type\": \"string\"}\n                },\n                \"required\": [\"from_currency\", \"to_currency\"]\n            }\n        }\n    }\n]\n```\n\nIf the user asks: *\"I'm traveling from NYC to Tokyo next week. What's the weather like and how much is 1 USD in Yen?\"*\n\nThe LLM returns **two different tool calls** in one turn:\n\n```\n# tool_calls[0]\n{\"name\": \"get_weather\", \"arguments\": '{\"city\": \"Tokyo\"}'}\n\n# tool_calls[1]\n{\"name\": \"get_exchange_rate\", \"arguments\": '{\"from_currency\": \"USD\", \"to_currency\": \"JPY\"}'}\n```\n\nYour code routes each call to the right function:\n\n```\ntool_handlers = {\n    \"get_weather\": handle_weather,\n    \"get_exchange_rate\": handle_exchange_rate,\n}\n\nfor tc in message.tool_calls:\n    handler = tool_handlers[tc.function.name]\n    args = json.loads(tc.function.arguments)\n    result = handler(**args)\n    # ... send result back\n```\n\nThis is essentially the **registry pattern** — a dictionary maps function names to handlers. No if/else chains needed.\n\nThe `tool_choice`\n\nparameter controls whether and how the LLM uses tools:\n\n`tool_choice` |\nBehavior | Use Case |\n|---|---|---|\n`\"auto\"` |\nLLM decides whether to call tools or respond with text | General-purpose agents |\n`\"required\"` |\nLLM must call at least one tool | When you always need structured output |\n`{\"type\": \"function\", \"function\": {\"name\": \"...\"}}` |\nLLM must call this specific function | Email classification (always classify) |\n`\"none\"` |\nLLM cannot call any tools | Force a text-only response |\n\nFor the weather comparison, we use `\"auto\"`\n\n— the LLM decides on its own that it needs to call `get_weather`\n\ntwice.\n\n**Function calling > JSON mode > plain text** for getting structured data from LLMs. Function calling enforces your schema at the token generation level, not just via prompt instructions.\n\n**LLMs still generate tokens** — they don't natively return dicts. The API layer applies constrained decoding to ensure the token output matches your schema, then you `json.loads()`\n\nthe resulting string.\n\n**One question can trigger multiple tool calls.** The LLM decides whether to call the same tool with different arguments (Tokyo + Berlin) or different tools entirely (weather + exchange rate) — all in a single turn.\n\n**Parallel vs sequential is decided by the LLM.** Independent calls (two cities) come back in one turn. Dependent calls (get capital → get weather) happen across multiple turns.\n\n**Route tool calls with a registry, not if/else.** A dictionary mapping function names to handlers keeps your code clean and extensible.", "url": "https://wpnews.pro/news/how-llm-function-calling-actually-works-from-tokens-to-tool-orchestration", "canonical_source": "https://dev.to/vahid_aghajani_60ce9dbec9/how-llm-function-calling-actually-works-from-tokens-to-tool-orchestration-27fb", "published_at": "2026-07-04 17:42:25+00:00", "updated_at": "2026-07-04 17:56:20.015406+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools"], "entities": ["Gemini"], "alternates": {"html": "https://wpnews.pro/news/how-llm-function-calling-actually-works-from-tokens-to-tool-orchestration", "markdown": "https://wpnews.pro/news/how-llm-function-calling-actually-works-from-tokens-to-tool-orchestration.md", "text": "https://wpnews.pro/news/how-llm-function-calling-actually-works-from-tokens-to-tool-orchestration.txt", "jsonld": "https://wpnews.pro/news/how-llm-function-calling-actually-works-from-tokens-to-tool-orchestration.jsonld"}}