# Running a LangGraph ReAct Agent in Production: OpenAI-Compatible API + Multi-Model Gateway + One-Line Tracing

> Source: <https://dev.to/javaking1129/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model-gateway--emi>
> Published: 2026-06-23 22:57:54+00:00

Most LangGraph content stops at the notebook. You build a cute ReAct loop, it answers one question, and the article ends before the hard part: *how do you actually serve this thing, swap models without a rewrite, and see what it's doing when it misbehaves?*

This post walks through a small but **production-shaped** LangGraph deployment: a RAG ReAct agent that

`openai`

SDK, LibreChat) can talk to it unchanged,Every snippet below is real code from a working service. Roughly 150 lines of Python is all it takes.

```
OpenAI client (Open WebUI, openai SDK)
        │  POST /v1/chat/completions
        ▼
FastAPI router ──► LangGraph StateGraph ──► LLM Gateway ──► model (hosted API today, vLLM tomorrow)
        │                   │
        │                   └──► ToolNode ──► Qdrant (RAG)
        │
        └──► Langfuse callback (one trace per request)
```

The contract with the outside world is **just the OpenAI API**. Everything interesting — the graph, RAG, tracing — lives behind that boundary. That single decision is what lets an off-the-shelf chat UI drive a custom agent with zero adapter code.

The graph is deliberately tiny: one `agent`

node that reasons, one `tools`

node that retrieves, and a conditional edge that loops between them until the model stops asking for tools.

``` python
# app/graph/builder.py
from langgraph.graph import END, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition

def build_graph():
    g = StateGraph(AgentState)
    g.add_node("agent", agent_node)
    g.set_entry_point("agent")

    # ReAct: if the model emits tool_calls, go to `tools`; otherwise END.
    g.add_node("tools", ToolNode(TOOLS))
    g.add_conditional_edges("agent", tools_condition)
    g.add_edge("tools", "agent")
    return g.compile()
```

`tools_condition`

and `ToolNode`

are LangGraph prebuilts that do the unglamorous work: inspect the last message for `tool_calls`

, route accordingly, execute the tools, and append `ToolMessage`

s back into state. You wire the loop; they run it.

State is a single shared message log with a reducer that *appends* rather than replaces:

``` python
# app/graph/state.py
from typing import Annotated, TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages

class AgentState(TypedDict, total=False):
    messages: Annotated[list[BaseMessage], add_messages]
```

`add_messages`

is the reducer. Every node returns `{"messages": [...]}`

and LangGraph merges it into the running log — no manual list-shuffling, and it's what makes the agent⇄tools loop accumulate context correctly.

The agent node binds the tools and calls the model. Note `bind_tools`

is conditional — flip RAG off and the exact same node degrades to a plain single-shot chat call:

``` php
# app/graph/nodes/agent.py
async def agent_node(state: AgentState) -> dict:
    llm = get_llm()
    if get_settings().rag_enabled:
        llm = llm.bind_tools(TOOLS)
    messages = [SystemMessage(content=SYSTEM_PROMPT), *state["messages"]]
    response = await llm.ainvoke(messages)
    return {"messages": [response]}
```

And the tool itself is an ordinary `@tool`

-decorated function. The docstring is not documentation — it's the prompt the model reads to decide *when* to call it:

``` php
# app/graph/tools.py
@tool
def search_docs(query: str) -> str:
    """Search internal docs for content relevant to the question.
    When the user asks about the project/system/docs, call this first."""
    hits = get_vector_store().similarity_search(query, k=get_settings().rag_top_k)
    blocks = [
        f"[{i}] (source: {doc.metadata.get('source', 'unknown')})\n{doc.page_content.strip()}"
        for i, doc in enumerate(hits, 1)
    ]
    return "\n\n".join(blocks) or "No relevant documents found."
```

Returning a `[1] (source: ...)`

structure isn't cosmetic — it's how the model can cite sources in its final answer, which is the difference between a demo and something people trust.

Here's the lever that makes everything else cheap: the agent speaks OpenAI's wire format. The router turns an incoming `/v1/chat/completions`

request into graph input and the graph's output back into an OpenAI response.

```
# app/api/router.py
@router.post("/v1/chat/completions")
async def chat_completions(req: ChatCompletionRequest):
    graph = get_graph()
    inputs = {"messages": to_langchain_messages(req.messages)}
    config: dict = {}

    if not req.stream:
        result = await graph.ainvoke(inputs, config=config)
        text = extract_final_text(result.get("messages", []))
        return make_completion(text, settings.served_model_name)

    return StreamingResponse(
        graph_to_openai_sse(graph, inputs, settings.served_model_name, config=config),
        media_type="text/event-stream",
    )
```

Because the response matches OpenAI's schema (including SSE streaming chunks), **Open WebUI thinks it's talking to OpenAI**. You point its `openaiBaseUrl`

at this service and your custom RAG agent shows up as a selectable model. No frontend work.

LangGraph nodes never name a provider. They call one factory:

``` python
# app/llm/client.py
from langchain_openai import ChatOpenAI

def get_llm(model=None, temperature=None, streaming=True) -> ChatOpenAI:
    s = get_settings()
    return ChatOpenAI(
        base_url=f"{s.litellm_url}/v1",   # gateway, not a provider
        api_key=s.litellm_key,
        model=model or s.default_model,
        temperature=s.default_temperature if temperature is None else temperature,
        streaming=streaming,
    )
```

The `base_url`

points at a **LiteLLM gateway**, not at any specific vendor. LiteLLM exposes an OpenAI-compatible endpoint and fans out to whatever its `model_list`

says — a hosted API today, self-hosted vLLM tomorrow. Migrating off a paid API to an in-cluster GPU model becomes a *gateway config edit*; this Python file never changes.

There's one deliberate escape hatch — when the gateway is down locally, point straight at Ollama's OpenAI-compatible endpoint:

```
    if s.chat_provider.lower() == "ollama":
        return ChatOpenAI(base_url=f"{s.ollama_url}/v1", api_key="ollama",
                          model=model or s.ollama_chat_model, ...)
```

Same `ChatOpenAI`

class, different `base_url`

. The OpenAI-compatible interface shows up *three* times in this architecture — inbound API, gateway, and local fallback — and that consistency is the whole trick.

A multi-node graph with a tool loop is opaque when it goes wrong. Did the model skip the tool? Retrieve garbage? Loop twice? Langfuse's LangChain callback captures the entire run — every node transition, tool call, and LLM call — as a single nested trace.

The integration is genuinely one object:

``` python
# app/obs/langfuse.py
from functools import lru_cache

@lru_cache
def get_langfuse_handler():
    s = get_settings()
    if not (s.langfuse_public_key and s.langfuse_secret_key):
        return None  # no keys → tracing silently disabled (safe for local/POC)
    from langfuse.langchain import CallbackHandler
    return CallbackHandler()
```

Heads-up for the SDK version churn: on

Langfuse SDK v3+the import is`from langfuse.langchain import CallbackHandler`

, and the handler reads`LANGFUSE_PUBLIC_KEY`

/`LANGFUSE_SECRET_KEY`

/`LANGFUSE_HOST`

from the environment — you don't pass keys to the constructor anymore. This tripped up a lot of v2 tutorials.

Then attach it per request via the graph `config`

— which is also where you stamp user/session metadata so traces are filterable in the Langfuse UI:

```
# app/api/router.py
handler = get_langfuse_handler()
if handler is not None:
    config["callbacks"] = [handler]
    config["metadata"] = {
        "langfuse_user_id": req.user or "anonymous",
        "langfuse_session_id": getattr(req, "chat_id", None) or "no-session",
        "langfuse_tags": ["my-agent", settings.served_model_name],
    }
```

Passing the handler through `config["callbacks"]`

(rather than baking it into the LLM client) means it propagates down the *entire* graph automatically. One request → one trace → every step visible.

| Concern | How it's handled | Why it scales |
|---|---|---|
| Frontend integration | OpenAI-compatible API | Any OpenAI client works unchanged |
| Model choice | LiteLLM gateway behind `ChatOpenAI`
|
Swap providers via config, not code |
| Agent logic | LangGraph `StateGraph` + prebuilts |
ReAct loop in ~10 lines, extensible to multi-agent |
| Observability | Langfuse callback via graph `config`
|
One trace per request, zero per-node wiring |
| Local dev | Ollama fallback through same interface | No gateway needed to hack offline |

None of these pieces is exotic. The point is the **seams**: an OpenAI boundary on the outside, a gateway boundary on the model side, and a callback boundary for observability. Get the seams right and the agent in the middle stays small and swappable.

The same skeleton extends cleanly to a supervisor/worker multi-agent graph, a Postgres checkpointer for persistent threads, and an in-cluster vLLM model — each is an additive change behind one of those seams. But that's a follow-up post.

*Built with LangGraph, LangChain, LiteLLM, Qdrant, and Langfuse. If you're running LangGraph in production and want to compare notes on deployment patterns, reach out.*
