Running a LangGraph ReAct Agent in Production: OpenAI-Compatible API + Multi-Model Gateway + One-Line Tracing

A developer built a production-ready LangGraph ReAct agent that exposes an OpenAI-compatible API, supports multi-model switching via a gateway, and includes one-line tracing with Langfuse. The deployment uses a FastAPI router, a LangGraph StateGraph with a ReAct loop, and a Qdrant vector store for RAG, all in roughly 150 lines of Python. The agent can be driven by any OpenAI-compatible client like Open WebUI or LibreChat without adapter code.

Most LangGraph content stops at the notebook. You build a cute ReAct loop, it answers one question, and the article ends before the hard part: how do you actually serve this thing, swap models without a rewrite, and see what it's doing when it misbehaves? This post walks through a small but production-shaped LangGraph deployment: a RAG ReAct agent that openai SDK, LibreChat can talk to it unchanged,Every snippet below is real code from a working service. Roughly 150 lines of Python is all it takes. OpenAI client Open WebUI, openai SDK │ POST /v1/chat/completions ▼ FastAPI router ──► LangGraph StateGraph ──► LLM Gateway ──► model hosted API today, vLLM tomorrow │ │ │ └──► ToolNode ──► Qdrant RAG │ └──► Langfuse callback one trace per request The contract with the outside world is just the OpenAI API . Everything interesting — the graph, RAG, tracing — lives behind that boundary. That single decision is what lets an off-the-shelf chat UI drive a custom agent with zero adapter code. The graph is deliberately tiny: one agent node that reasons, one tools node that retrieves, and a conditional edge that loops between them until the model stops asking for tools. python app/graph/builder.py from langgraph.graph import END, StateGraph from langgraph.prebuilt import ToolNode, tools condition def build graph : g = StateGraph AgentState g.add node "agent", agent node g.set entry point "agent" ReAct: if the model emits tool calls, go to tools ; otherwise END. g.add node "tools", ToolNode TOOLS g.add conditional edges "agent", tools condition g.add edge "tools", "agent" return g.compile tools condition and ToolNode are LangGraph prebuilts that do the unglamorous work: inspect the last message for tool calls , route accordingly, execute the tools, and append ToolMessage s back into state. You wire the loop; they run it. State is a single shared message log with a reducer that appends rather than replaces: python app/graph/state.py from typing import Annotated, TypedDict from langchain core.messages import BaseMessage from langgraph.graph.message import add messages class AgentState TypedDict, total=False : messages: Annotated list BaseMessage , add messages add messages is the reducer. Every node returns {"messages": ... } and LangGraph merges it into the running log — no manual list-shuffling, and it's what makes the agent⇄tools loop accumulate context correctly. The agent node binds the tools and calls the model. Note bind tools is conditional — flip RAG off and the exact same node degrades to a plain single-shot chat call: php app/graph/nodes/agent.py async def agent node state: AgentState - dict: llm = get llm if get settings .rag enabled: llm = llm.bind tools TOOLS messages = SystemMessage content=SYSTEM PROMPT , state "messages" response = await llm.ainvoke messages return {"messages": response } And the tool itself is an ordinary @tool -decorated function. The docstring is not documentation — it's the prompt the model reads to decide when to call it: php app/graph/tools.py @tool def search docs query: str - str: """Search internal docs for content relevant to the question. When the user asks about the project/system/docs, call this first.""" hits = get vector store .similarity search query, k=get settings .rag top k blocks = f" {i} source: {doc.metadata.get 'source', 'unknown' } \n{doc.page content.strip }" for i, doc in enumerate hits, 1 return "\n\n".join blocks or "No relevant documents found." Returning a 1 source: ... structure isn't cosmetic — it's how the model can cite sources in its final answer, which is the difference between a demo and something people trust. Here's the lever that makes everything else cheap: the agent speaks OpenAI's wire format. The router turns an incoming /v1/chat/completions request into graph input and the graph's output back into an OpenAI response. app/api/router.py @router.post "/v1/chat/completions" async def chat completions req: ChatCompletionRequest : graph = get graph inputs = {"messages": to langchain messages req.messages } config: dict = {} if not req.stream: result = await graph.ainvoke inputs, config=config text = extract final text result.get "messages", return make completion text, settings.served model name return StreamingResponse graph to openai sse graph, inputs, settings.served model name, config=config , media type="text/event-stream", Because the response matches OpenAI's schema including SSE streaming chunks , Open WebUI thinks it's talking to OpenAI . You point its openaiBaseUrl at this service and your custom RAG agent shows up as a selectable model. No frontend work. LangGraph nodes never name a provider. They call one factory: python app/llm/client.py from langchain openai import ChatOpenAI def get llm model=None, temperature=None, streaming=True - ChatOpenAI: s = get settings return ChatOpenAI base url=f"{s.litellm url}/v1", gateway, not a provider api key=s.litellm key, model=model or s.default model, temperature=s.default temperature if temperature is None else temperature, streaming=streaming, The base url points at a LiteLLM gateway , not at any specific vendor. LiteLLM exposes an OpenAI-compatible endpoint and fans out to whatever its model list says — a hosted API today, self-hosted vLLM tomorrow. Migrating off a paid API to an in-cluster GPU model becomes a gateway config edit ; this Python file never changes. There's one deliberate escape hatch — when the gateway is down locally, point straight at Ollama's OpenAI-compatible endpoint: if s.chat provider.lower == "ollama": return ChatOpenAI base url=f"{s.ollama url}/v1", api key="ollama", model=model or s.ollama chat model, ... Same ChatOpenAI class, different base url . The OpenAI-compatible interface shows up three times in this architecture — inbound API, gateway, and local fallback — and that consistency is the whole trick. A multi-node graph with a tool loop is opaque when it goes wrong. Did the model skip the tool? Retrieve garbage? Loop twice? Langfuse's LangChain callback captures the entire run — every node transition, tool call, and LLM call — as a single nested trace. The integration is genuinely one object: python app/obs/langfuse.py from functools import lru cache @lru cache def get langfuse handler : s = get settings if not s.langfuse public key and s.langfuse secret key : return None no keys → tracing silently disabled safe for local/POC from langfuse.langchain import CallbackHandler return CallbackHandler Heads-up for the SDK version churn: on Langfuse SDK v3+the import is from langfuse.langchain import CallbackHandler , and the handler reads LANGFUSE PUBLIC KEY / LANGFUSE SECRET KEY / LANGFUSE HOST from the environment — you don't pass keys to the constructor anymore. This tripped up a lot of v2 tutorials. Then attach it per request via the graph config — which is also where you stamp user/session metadata so traces are filterable in the Langfuse UI: app/api/router.py handler = get langfuse handler if handler is not None: config "callbacks" = handler config "metadata" = { "langfuse user id": req.user or "anonymous", "langfuse session id": getattr req, "chat id", None or "no-session", "langfuse tags": "my-agent", settings.served model name , } Passing the handler through config "callbacks" rather than baking it into the LLM client means it propagates down the entire graph automatically. One request → one trace → every step visible. | Concern | How it's handled | Why it scales | |---|---|---| | Frontend integration | OpenAI-compatible API | Any OpenAI client works unchanged | | Model choice | LiteLLM gateway behind ChatOpenAI | Swap providers via config, not code | | Agent logic | LangGraph StateGraph + prebuilts | ReAct loop in ~10 lines, extensible to multi-agent | | Observability | Langfuse callback via graph config | One trace per request, zero per-node wiring | | Local dev | Ollama fallback through same interface | No gateway needed to hack offline | None of these pieces is exotic. The point is the seams : an OpenAI boundary on the outside, a gateway boundary on the model side, and a callback boundary for observability. Get the seams right and the agent in the middle stays small and swappable. The same skeleton extends cleanly to a supervisor/worker multi-agent graph, a Postgres checkpointer for persistent threads, and an in-cluster vLLM model — each is an additive change behind one of those seams. But that's a follow-up post. Built with LangGraph, LangChain, LiteLLM, Qdrant, and Langfuse. If you're running LangGraph in production and want to compare notes on deployment patterns, reach out.