{"slug": "running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model", "title": "Running a LangGraph ReAct Agent in Production: OpenAI-Compatible API + Multi-Model Gateway + One-Line Tracing", "summary": "A developer built a production-ready LangGraph ReAct agent that exposes an OpenAI-compatible API, supports multi-model switching via a gateway, and includes one-line tracing with Langfuse. The deployment uses a FastAPI router, a LangGraph StateGraph with a ReAct loop, and a Qdrant vector store for RAG, all in roughly 150 lines of Python. The agent can be driven by any OpenAI-compatible client like Open WebUI or LibreChat without adapter code.", "body_md": "Most LangGraph content stops at the notebook. You build a cute ReAct loop, it answers one question, and the article ends before the hard part: *how do you actually serve this thing, swap models without a rewrite, and see what it's doing when it misbehaves?*\n\nThis post walks through a small but **production-shaped** LangGraph deployment: a RAG ReAct agent that\n\n`openai`\n\nSDK, LibreChat) can talk to it unchanged,Every snippet below is real code from a working service. Roughly 150 lines of Python is all it takes.\n\n```\nOpenAI client (Open WebUI, openai SDK)\n        │  POST /v1/chat/completions\n        ▼\nFastAPI router ──► LangGraph StateGraph ──► LLM Gateway ──► model (hosted API today, vLLM tomorrow)\n        │                   │\n        │                   └──► ToolNode ──► Qdrant (RAG)\n        │\n        └──► Langfuse callback (one trace per request)\n```\n\nThe contract with the outside world is **just the OpenAI API**. Everything interesting — the graph, RAG, tracing — lives behind that boundary. That single decision is what lets an off-the-shelf chat UI drive a custom agent with zero adapter code.\n\nThe graph is deliberately tiny: one `agent`\n\nnode that reasons, one `tools`\n\nnode that retrieves, and a conditional edge that loops between them until the model stops asking for tools.\n\n``` python\n# app/graph/builder.py\nfrom langgraph.graph import END, StateGraph\nfrom langgraph.prebuilt import ToolNode, tools_condition\n\ndef build_graph():\n    g = StateGraph(AgentState)\n    g.add_node(\"agent\", agent_node)\n    g.set_entry_point(\"agent\")\n\n    # ReAct: if the model emits tool_calls, go to `tools`; otherwise END.\n    g.add_node(\"tools\", ToolNode(TOOLS))\n    g.add_conditional_edges(\"agent\", tools_condition)\n    g.add_edge(\"tools\", \"agent\")\n    return g.compile()\n```\n\n`tools_condition`\n\nand `ToolNode`\n\nare LangGraph prebuilts that do the unglamorous work: inspect the last message for `tool_calls`\n\n, route accordingly, execute the tools, and append `ToolMessage`\n\ns back into state. You wire the loop; they run it.\n\nState is a single shared message log with a reducer that *appends* rather than replaces:\n\n``` python\n# app/graph/state.py\nfrom typing import Annotated, TypedDict\nfrom langchain_core.messages import BaseMessage\nfrom langgraph.graph.message import add_messages\n\nclass AgentState(TypedDict, total=False):\n    messages: Annotated[list[BaseMessage], add_messages]\n```\n\n`add_messages`\n\nis the reducer. Every node returns `{\"messages\": [...]}`\n\nand LangGraph merges it into the running log — no manual list-shuffling, and it's what makes the agent⇄tools loop accumulate context correctly.\n\nThe agent node binds the tools and calls the model. Note `bind_tools`\n\nis conditional — flip RAG off and the exact same node degrades to a plain single-shot chat call:\n\n``` php\n# app/graph/nodes/agent.py\nasync def agent_node(state: AgentState) -> dict:\n    llm = get_llm()\n    if get_settings().rag_enabled:\n        llm = llm.bind_tools(TOOLS)\n    messages = [SystemMessage(content=SYSTEM_PROMPT), *state[\"messages\"]]\n    response = await llm.ainvoke(messages)\n    return {\"messages\": [response]}\n```\n\nAnd the tool itself is an ordinary `@tool`\n\n-decorated function. The docstring is not documentation — it's the prompt the model reads to decide *when* to call it:\n\n``` php\n# app/graph/tools.py\n@tool\ndef search_docs(query: str) -> str:\n    \"\"\"Search internal docs for content relevant to the question.\n    When the user asks about the project/system/docs, call this first.\"\"\"\n    hits = get_vector_store().similarity_search(query, k=get_settings().rag_top_k)\n    blocks = [\n        f\"[{i}] (source: {doc.metadata.get('source', 'unknown')})\\n{doc.page_content.strip()}\"\n        for i, doc in enumerate(hits, 1)\n    ]\n    return \"\\n\\n\".join(blocks) or \"No relevant documents found.\"\n```\n\nReturning a `[1] (source: ...)`\n\nstructure isn't cosmetic — it's how the model can cite sources in its final answer, which is the difference between a demo and something people trust.\n\nHere's the lever that makes everything else cheap: the agent speaks OpenAI's wire format. The router turns an incoming `/v1/chat/completions`\n\nrequest into graph input and the graph's output back into an OpenAI response.\n\n```\n# app/api/router.py\n@router.post(\"/v1/chat/completions\")\nasync def chat_completions(req: ChatCompletionRequest):\n    graph = get_graph()\n    inputs = {\"messages\": to_langchain_messages(req.messages)}\n    config: dict = {}\n\n    if not req.stream:\n        result = await graph.ainvoke(inputs, config=config)\n        text = extract_final_text(result.get(\"messages\", []))\n        return make_completion(text, settings.served_model_name)\n\n    return StreamingResponse(\n        graph_to_openai_sse(graph, inputs, settings.served_model_name, config=config),\n        media_type=\"text/event-stream\",\n    )\n```\n\nBecause the response matches OpenAI's schema (including SSE streaming chunks), **Open WebUI thinks it's talking to OpenAI**. You point its `openaiBaseUrl`\n\nat this service and your custom RAG agent shows up as a selectable model. No frontend work.\n\nLangGraph nodes never name a provider. They call one factory:\n\n``` python\n# app/llm/client.py\nfrom langchain_openai import ChatOpenAI\n\ndef get_llm(model=None, temperature=None, streaming=True) -> ChatOpenAI:\n    s = get_settings()\n    return ChatOpenAI(\n        base_url=f\"{s.litellm_url}/v1\",   # gateway, not a provider\n        api_key=s.litellm_key,\n        model=model or s.default_model,\n        temperature=s.default_temperature if temperature is None else temperature,\n        streaming=streaming,\n    )\n```\n\nThe `base_url`\n\npoints at a **LiteLLM gateway**, not at any specific vendor. LiteLLM exposes an OpenAI-compatible endpoint and fans out to whatever its `model_list`\n\nsays — a hosted API today, self-hosted vLLM tomorrow. Migrating off a paid API to an in-cluster GPU model becomes a *gateway config edit*; this Python file never changes.\n\nThere's one deliberate escape hatch — when the gateway is down locally, point straight at Ollama's OpenAI-compatible endpoint:\n\n```\n    if s.chat_provider.lower() == \"ollama\":\n        return ChatOpenAI(base_url=f\"{s.ollama_url}/v1\", api_key=\"ollama\",\n                          model=model or s.ollama_chat_model, ...)\n```\n\nSame `ChatOpenAI`\n\nclass, different `base_url`\n\n. The OpenAI-compatible interface shows up *three* times in this architecture — inbound API, gateway, and local fallback — and that consistency is the whole trick.\n\nA multi-node graph with a tool loop is opaque when it goes wrong. Did the model skip the tool? Retrieve garbage? Loop twice? Langfuse's LangChain callback captures the entire run — every node transition, tool call, and LLM call — as a single nested trace.\n\nThe integration is genuinely one object:\n\n``` python\n# app/obs/langfuse.py\nfrom functools import lru_cache\n\n@lru_cache\ndef get_langfuse_handler():\n    s = get_settings()\n    if not (s.langfuse_public_key and s.langfuse_secret_key):\n        return None  # no keys → tracing silently disabled (safe for local/POC)\n    from langfuse.langchain import CallbackHandler\n    return CallbackHandler()\n```\n\nHeads-up for the SDK version churn: on\n\nLangfuse SDK v3+the import is`from langfuse.langchain import CallbackHandler`\n\n, and the handler reads`LANGFUSE_PUBLIC_KEY`\n\n/`LANGFUSE_SECRET_KEY`\n\n/`LANGFUSE_HOST`\n\nfrom the environment — you don't pass keys to the constructor anymore. This tripped up a lot of v2 tutorials.\n\nThen attach it per request via the graph `config`\n\n— which is also where you stamp user/session metadata so traces are filterable in the Langfuse UI:\n\n```\n# app/api/router.py\nhandler = get_langfuse_handler()\nif handler is not None:\n    config[\"callbacks\"] = [handler]\n    config[\"metadata\"] = {\n        \"langfuse_user_id\": req.user or \"anonymous\",\n        \"langfuse_session_id\": getattr(req, \"chat_id\", None) or \"no-session\",\n        \"langfuse_tags\": [\"my-agent\", settings.served_model_name],\n    }\n```\n\nPassing the handler through `config[\"callbacks\"]`\n\n(rather than baking it into the LLM client) means it propagates down the *entire* graph automatically. One request → one trace → every step visible.\n\n| Concern | How it's handled | Why it scales |\n|---|---|---|\n| Frontend integration | OpenAI-compatible API | Any OpenAI client works unchanged |\n| Model choice | LiteLLM gateway behind `ChatOpenAI`\n|\nSwap providers via config, not code |\n| Agent logic | LangGraph `StateGraph` + prebuilts |\nReAct loop in ~10 lines, extensible to multi-agent |\n| Observability | Langfuse callback via graph `config`\n|\nOne trace per request, zero per-node wiring |\n| Local dev | Ollama fallback through same interface | No gateway needed to hack offline |\n\nNone of these pieces is exotic. The point is the **seams**: an OpenAI boundary on the outside, a gateway boundary on the model side, and a callback boundary for observability. Get the seams right and the agent in the middle stays small and swappable.\n\nThe same skeleton extends cleanly to a supervisor/worker multi-agent graph, a Postgres checkpointer for persistent threads, and an in-cluster vLLM model — each is an additive change behind one of those seams. But that's a follow-up post.\n\n*Built with LangGraph, LangChain, LiteLLM, Qdrant, and Langfuse. If you're running LangGraph in production and want to compare notes on deployment patterns, reach out.*", "url": "https://wpnews.pro/news/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model", "canonical_source": "https://dev.to/javaking1129/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model-gateway--emi", "published_at": "2026-06-23 22:57:54+00:00", "updated_at": "2026-06-23 23:48:30.982779+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-agents", "natural-language-processing", "machine-learning"], "entities": ["LangGraph", "OpenAI", "FastAPI", "Qdrant", "Langfuse", "LibreChat", "Open WebUI", "vLLM"], "alternates": {"html": "https://wpnews.pro/news/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model", "markdown": "https://wpnews.pro/news/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model.md", "text": "https://wpnews.pro/news/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model.txt", "jsonld": "https://wpnews.pro/news/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model.jsonld"}}