cd /news/large-language-models/running-a-langgraph-react-agent-in-p… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-36427] src=dev.to β†— pub= topic=large-language-models verified=true sentiment=↑ positive

Running a LangGraph ReAct Agent in Production: OpenAI-Compatible API + Multi-Model Gateway + One-Line Tracing

A developer built a production-ready LangGraph ReAct agent that exposes an OpenAI-compatible API, supports multi-model switching via a gateway, and includes one-line tracing with Langfuse. The deployment uses a FastAPI router, a LangGraph StateGraph with a ReAct loop, and a Qdrant vector store for RAG, all in roughly 150 lines of Python. The agent can be driven by any OpenAI-compatible client like Open WebUI or LibreChat without adapter code.

read6 min views8 publishedJun 23, 2026

Most LangGraph content stops at the notebook. You build a cute ReAct loop, it answers one question, and the article ends before the hard part: how do you actually serve this thing, swap models without a rewrite, and see what it's doing when it misbehaves?

This post walks through a small but production-shaped LangGraph deployment: a RAG ReAct agent that

openai

SDK, LibreChat) can talk to it unchanged,Every snippet below is real code from a working service. Roughly 150 lines of Python is all it takes.

OpenAI client (Open WebUI, openai SDK)
        β”‚  POST /v1/chat/completions
        β–Ό
FastAPI router ──► LangGraph StateGraph ──► LLM Gateway ──► model (hosted API today, vLLM tomorrow)
        β”‚                   β”‚
        β”‚                   └──► ToolNode ──► Qdrant (RAG)
        β”‚
        └──► Langfuse callback (one trace per request)

The contract with the outside world is just the OpenAI API. Everything interesting β€” the graph, RAG, tracing β€” lives behind that boundary. That single decision is what lets an off-the-shelf chat UI drive a custom agent with zero adapter code.

The graph is deliberately tiny: one agent

node that reasons, one tools

node that retrieves, and a conditional edge that loops between them until the model stops asking for tools.

from langgraph.graph import END, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition

def build_graph():
    g = StateGraph(AgentState)
    g.add_node("agent", agent_node)
    g.set_entry_point("agent")

    g.add_node("tools", ToolNode(TOOLS))
    g.add_conditional_edges("agent", tools_condition)
    g.add_edge("tools", "agent")
    return g.compile()

tools_condition

and ToolNode

are LangGraph prebuilts that do the unglamorous work: inspect the last message for tool_calls

, route accordingly, execute the tools, and append ToolMessage

s back into state. You wire the loop; they run it.

State is a single shared message log with a reducer that appends rather than replaces:

from typing import Annotated, TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages

class AgentState(TypedDict, total=False):
    messages: Annotated[list[BaseMessage], add_messages]

add_messages

is the reducer. Every node returns {"messages": [...]}

and LangGraph merges it into the running log β€” no manual list-shuffling, and it's what makes the agent⇄tools loop accumulate context correctly.

The agent node binds the tools and calls the model. Note bind_tools

is conditional β€” flip RAG off and the exact same node degrades to a plain single-shot chat call:

async def agent_node(state: AgentState) -> dict:
    llm = get_llm()
    if get_settings().rag_enabled:
        llm = llm.bind_tools(TOOLS)
    messages = [SystemMessage(content=SYSTEM_PROMPT), *state["messages"]]
    response = await llm.ainvoke(messages)
    return {"messages": [response]}

And the tool itself is an ordinary @tool

-decorated function. The docstring is not documentation β€” it's the prompt the model reads to decide when to call it:

@tool
def search_docs(query: str) -> str:
    """Search internal docs for content relevant to the question.
    When the user asks about the project/system/docs, call this first."""
    hits = get_vector_store().similarity_search(query, k=get_settings().rag_top_k)
    blocks = [
        f"[{i}] (source: {doc.metadata.get('source', 'unknown')})\n{doc.page_content.strip()}"
        for i, doc in enumerate(hits, 1)
    ]
    return "\n\n".join(blocks) or "No relevant documents found."

Returning a [1] (source: ...)

structure isn't cosmetic β€” it's how the model can cite sources in its final answer, which is the difference between a demo and something people trust.

Here's the lever that makes everything else cheap: the agent speaks OpenAI's wire format. The router turns an incoming /v1/chat/completions

request into graph input and the graph's output back into an OpenAI response.

@router.post("/v1/chat/completions")
async def chat_completions(req: ChatCompletionRequest):
    graph = get_graph()
    inputs = {"messages": to_langchain_messages(req.messages)}
    config: dict = {}

    if not req.stream:
        result = await graph.ainvoke(inputs, config=config)
        text = extract_final_text(result.get("messages", []))
        return make_completion(text, settings.served_model_name)

    return StreamingResponse(
        graph_to_openai_sse(graph, inputs, settings.served_model_name, config=config),
        media_type="text/event-stream",
    )

Because the response matches OpenAI's schema (including SSE streaming chunks), Open WebUI thinks it's talking to OpenAI. You point its openaiBaseUrl

at this service and your custom RAG agent shows up as a selectable model. No frontend work.

LangGraph nodes never name a provider. They call one factory:

from langchain_openai import ChatOpenAI

def get_llm(model=None, temperature=None, streaming=True) -> ChatOpenAI:
    s = get_settings()
    return ChatOpenAI(
        base_url=f"{s.litellm_url}/v1",   # gateway, not a provider
        api_key=s.litellm_key,
        model=model or s.default_model,
        temperature=s.default_temperature if temperature is None else temperature,
        streaming=streaming,
    )

The base_url

points at a LiteLLM gateway, not at any specific vendor. LiteLLM exposes an OpenAI-compatible endpoint and fans out to whatever its model_list

says β€” a hosted API today, self-hosted vLLM tomorrow. Migrating off a paid API to an in-cluster GPU model becomes a gateway config edit; this Python file never changes.

There's one deliberate escape hatch β€” when the gateway is down locally, point straight at Ollama's OpenAI-compatible endpoint:

    if s.chat_provider.lower() == "ollama":
        return ChatOpenAI(base_url=f"{s.ollama_url}/v1", api_key="ollama",
                          model=model or s.ollama_chat_model, ...)

Same ChatOpenAI

class, different base_url

. The OpenAI-compatible interface shows up three times in this architecture β€” inbound API, gateway, and local fallback β€” and that consistency is the whole trick.

A multi-node graph with a tool loop is opaque when it goes wrong. Did the model skip the tool? Retrieve garbage? Loop twice? Langfuse's LangChain callback captures the entire run β€” every node transition, tool call, and LLM call β€” as a single nested trace.

The integration is genuinely one object:

from functools import lru_cache

@lru_cache
def get_langfuse_handler():
    s = get_settings()
    if not (s.langfuse_public_key and s.langfuse_secret_key):
        return None  # no keys β†’ tracing silently disabled (safe for local/POC)
    from langfuse.langchain import CallbackHandler
    return CallbackHandler()

Heads-up for the SDK version churn: on

Langfuse SDK v3+the import isfrom langfuse.langchain import CallbackHandler

, and the handler readsLANGFUSE_PUBLIC_KEY

/LANGFUSE_SECRET_KEY

/LANGFUSE_HOST

from the environment β€” you don't pass keys to the constructor anymore. This tripped up a lot of v2 tutorials.

Then attach it per request via the graph config

β€” which is also where you stamp user/session metadata so traces are filterable in the Langfuse UI:

handler = get_langfuse_handler()
if handler is not None:
    config["callbacks"] = [handler]
    config["metadata"] = {
        "langfuse_user_id": req.user or "anonymous",
        "langfuse_session_id": getattr(req, "chat_id", None) or "no-session",
        "langfuse_tags": ["my-agent", settings.served_model_name],
    }

Passing the handler through config["callbacks"]

(rather than baking it into the LLM client) means it propagates down the entire graph automatically. One request β†’ one trace β†’ every step visible.

Concern How it's handled Why it scales
Frontend integration OpenAI-compatible API Any OpenAI client works unchanged
Model choice LiteLLM gateway behind ChatOpenAI
Swap providers via config, not code
Agent logic LangGraph StateGraph + prebuilts
ReAct loop in ~10 lines, extensible to multi-agent
Observability Langfuse callback via graph config
One trace per request, zero per-node wiring
Local dev Ollama fallback through same interface No gateway needed to hack offline

None of these pieces is exotic. The point is the seams: an OpenAI boundary on the outside, a gateway boundary on the model side, and a callback boundary for observability. Get the seams right and the agent in the middle stays small and swappable.

The same skeleton extends cleanly to a supervisor/worker multi-agent graph, a Postgres checkpointer for persistent threads, and an in-cluster vLLM model β€” each is an additive change behind one of those seams. But that's a follow-up post.

Built with LangGraph, LangChain, LiteLLM, Qdrant, and Langfuse. If you're running LangGraph in production and want to compare notes on deployment patterns, reach out.

── more in #large-language-models 4 stories Β· sorted by recency
── more on @langgraph 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/running-a-langgraph-…] indexed:0 read:6min 2026-06-23 Β· β€”