Using LLM for Dialogue Management

wpnews.pro

Dialogue management is the process of tracking conversational state and deciding what an agent should say or do next. Classical systems split this into isolated modules: natural language understanding, dialogue state tracking, a policy engine, and response generation. Large language models can collapse these boundaries into a single inference step, but doing so reliably requires careful architecture choices. This article examines practical patterns for using LLMs as dialogue managers, with a focus on structured reasoning, tool use, and cost-efficient inference.

An LLM-based dialogue manager treats conversation as a partially observable decision process where the model itself reasons over history, user intent, and available actions. Instead of hand-written rules or separate slot taggers, the model receives the full transcript, a system prompt defining the task, and optionally a schema of tools it can invoke. The model then emits either natural language or structured JSON representing the next system action. This approach excels in open-domain or rapidly changing domains where maintaining a rigid ontology is impractical.

Most production implementations fall into one of four patterns. The right choice depends on how much control you need over state transitions and how willing you are to trade complexity for flexibility.

End-to-end generation. The LLM receives the full chat history and outputs the next response. It works well for unstructured chit-chat but can hallucinate state or ignore business rules without additional guardrails.

Structured state extraction. The LLM is prompted to output a JSON object representing dialogue state, such as slots, user intent, and confirmed facts. A lightweight policy layer reads this state to decide whether to ask a question, call an API, or close the task. This separates reasoning from control and makes debugging easier.

Tool-augmented manager. The LLM uses function calling to interact with external APIs. The dialogue manager is the LLM loop: user message, model decides to ask for clarification or invoke a tool, tool result is appended, model generates the final response. This is the most robust pattern for task-oriented dialogue.

Hybrid classifier-LLM. For high-stakes domains, a traditional intent classifier routes the user to a specific LLM prompt specialized for that workflow. This reduces variance but adds latency.

The system prompt should act as a specification. Include the agent's role and constraints, a schema of required slots or facts to collect, rules for escalation or handoff, and output format instructions. When available, JSON mode enforces valid structured outputs from the model.

For multi-turn tracking, prepend a structured summary of the conversation so far. Few-shot examples of edge cases, such as user corrections or topic switches, improve robustness without adding external logic. If the dialogue spans many turns, inject a compressed memory block into the system prompt rather than sending the entire raw transcript.

The following example uses the OpenAI Python SDK pointed at Oxlo.ai. It implements a simple e-commerce support agent that collects a product name and quantity before checking inventory.

import os
import json
import openai

client = openai.OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ["OXLO_API_KEY"]
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_inventory",
            "description": "Check whether a product is available in a given quantity",
            "parameters": {
                "type": "object",
                "properties": {
                    "product": {"type": "string"},
                    "quantity": {"type": "integer"}
                },
                "required": ["product", "quantity"]
            }
        }
    }
]

system_prompt = """You are a dialogue manager for an e-commerce support agent.
Track the user's intent and required slots: product name and quantity.
Ask clarifying questions if slots are missing.
Only call search_inventory when both slots are confirmed."""

history = [{"role": "system", "content": system_prompt}]

def search_inventory(product: str, quantity: int):
    return {"available": True, "product": product, "quantity": quantity}

def dialogue_turn(user_message: str) -> str:
    history.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=history,
        tools=tools,
        tool_choice="auto"
    )

    msg = response.choices[0].message
    history.append(msg)

    if msg.tool_calls:
        for tc in msg.tool_calls:
            if tc.function.name == "search_inventory":
                args = json.loads(tc.function.arguments)
                result = search_inventory(args["product"], args["quantity"])
                history.append({
                    "role": "tool",
                    "tool_call_id": tc.id,
                    "content": json.dumps(result)
                })

        final = client.chat.completions.create(
            model="llama-3.3-70b",
            messages=history,
            tools=tools
        )
        history.append(final.choices[0].message)
        return final.choices[0].message.content

    return msg.content

This loop maintains the full conversation history in history

, appends tool results, and lets the model decide when it has enough information to act. Because Oxlo.ai is fully OpenAI SDK compatible, the same code runs without client changes.

As conversations grow, each turn appends more tokens to the prompt. On token-based platforms, the cost of each new turn increases with the length of the history. For production systems with thousands of concurrent sessions, this quickly becomes unpredictable.

There are three standard mitigation strategies. First, summarization: periodically replace the oldest turns with a condensed summary generated by the model. Second, sliding window: retain only the last N turns plus a persistent user profile. Third, external memory: extract key facts into a key-value store or vector database and inject them into the system prompt.

From an infrastructure perspective, the billing model matters. Oxlo.ai uses request-based pricing, so the cost per turn remains flat regardless of how much history is included in the prompt. For long-running dialogues or agentic loops that carry extensive context, this avoids the linear cost growth typical of token-based billing. See the pricing page for details.

source & further reading

dev.to — original article Distributing Large ML Assets (data/features) to a Separate Server - Using tar, scp, and MD5 Verification PokeCLI - Pokemon RPG that runs on your terminal Further Optimizing the Vision-Only Harness: the Notes Rule

Using LLM for Dialogue Management

Run your AI side-project on zahid.host