Most LangGraph content stops at the notebook. You build a cute ReAct loop, it answers one question, and the article ends before the hard part: how do you actually serve this thing, swap models without a rewrite, and see what it's doing when it misbehaves?
This post walks through a small but production-shaped LangGraph deployment: a RAG ReAct agent that
openai
SDK, LibreChat) can talk to it unchanged,Every snippet below is real code from a working service. Roughly 150 lines of Python is all it takes.
OpenAI client (Open WebUI, openai SDK)
β POST /v1/chat/completions
βΌ
FastAPI router βββΊ LangGraph StateGraph βββΊ LLM Gateway βββΊ model (hosted API today, vLLM tomorrow)
β β
β ββββΊ ToolNode βββΊ Qdrant (RAG)
β
ββββΊ Langfuse callback (one trace per request)
The contract with the outside world is just the OpenAI API. Everything interesting β the graph, RAG, tracing β lives behind that boundary. That single decision is what lets an off-the-shelf chat UI drive a custom agent with zero adapter code.
The graph is deliberately tiny: one agent
node that reasons, one tools
node that retrieves, and a conditional edge that loops between them until the model stops asking for tools.
from langgraph.graph import END, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition
def build_graph():
g = StateGraph(AgentState)
g.add_node("agent", agent_node)
g.set_entry_point("agent")
g.add_node("tools", ToolNode(TOOLS))
g.add_conditional_edges("agent", tools_condition)
g.add_edge("tools", "agent")
return g.compile()
tools_condition
and ToolNode
are LangGraph prebuilts that do the unglamorous work: inspect the last message for tool_calls
, route accordingly, execute the tools, and append ToolMessage
s back into state. You wire the loop; they run it.
State is a single shared message log with a reducer that appends rather than replaces:
from typing import Annotated, TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages
class AgentState(TypedDict, total=False):
messages: Annotated[list[BaseMessage], add_messages]
add_messages
is the reducer. Every node returns {"messages": [...]}
and LangGraph merges it into the running log β no manual list-shuffling, and it's what makes the agentβtools loop accumulate context correctly.
The agent node binds the tools and calls the model. Note bind_tools
is conditional β flip RAG off and the exact same node degrades to a plain single-shot chat call:
async def agent_node(state: AgentState) -> dict:
llm = get_llm()
if get_settings().rag_enabled:
llm = llm.bind_tools(TOOLS)
messages = [SystemMessage(content=SYSTEM_PROMPT), *state["messages"]]
response = await llm.ainvoke(messages)
return {"messages": [response]}
And the tool itself is an ordinary @tool
-decorated function. The docstring is not documentation β it's the prompt the model reads to decide when to call it:
@tool
def search_docs(query: str) -> str:
"""Search internal docs for content relevant to the question.
When the user asks about the project/system/docs, call this first."""
hits = get_vector_store().similarity_search(query, k=get_settings().rag_top_k)
blocks = [
f"[{i}] (source: {doc.metadata.get('source', 'unknown')})\n{doc.page_content.strip()}"
for i, doc in enumerate(hits, 1)
]
return "\n\n".join(blocks) or "No relevant documents found."
Returning a [1] (source: ...)
structure isn't cosmetic β it's how the model can cite sources in its final answer, which is the difference between a demo and something people trust.
Here's the lever that makes everything else cheap: the agent speaks OpenAI's wire format. The router turns an incoming /v1/chat/completions
request into graph input and the graph's output back into an OpenAI response.
@router.post("/v1/chat/completions")
async def chat_completions(req: ChatCompletionRequest):
graph = get_graph()
inputs = {"messages": to_langchain_messages(req.messages)}
config: dict = {}
if not req.stream:
result = await graph.ainvoke(inputs, config=config)
text = extract_final_text(result.get("messages", []))
return make_completion(text, settings.served_model_name)
return StreamingResponse(
graph_to_openai_sse(graph, inputs, settings.served_model_name, config=config),
media_type="text/event-stream",
)
Because the response matches OpenAI's schema (including SSE streaming chunks), Open WebUI thinks it's talking to OpenAI. You point its openaiBaseUrl
at this service and your custom RAG agent shows up as a selectable model. No frontend work.
LangGraph nodes never name a provider. They call one factory:
from langchain_openai import ChatOpenAI
def get_llm(model=None, temperature=None, streaming=True) -> ChatOpenAI:
s = get_settings()
return ChatOpenAI(
base_url=f"{s.litellm_url}/v1", # gateway, not a provider
api_key=s.litellm_key,
model=model or s.default_model,
temperature=s.default_temperature if temperature is None else temperature,
streaming=streaming,
)
The base_url
points at a LiteLLM gateway, not at any specific vendor. LiteLLM exposes an OpenAI-compatible endpoint and fans out to whatever its model_list
says β a hosted API today, self-hosted vLLM tomorrow. Migrating off a paid API to an in-cluster GPU model becomes a gateway config edit; this Python file never changes.
There's one deliberate escape hatch β when the gateway is down locally, point straight at Ollama's OpenAI-compatible endpoint:
if s.chat_provider.lower() == "ollama":
return ChatOpenAI(base_url=f"{s.ollama_url}/v1", api_key="ollama",
model=model or s.ollama_chat_model, ...)
Same ChatOpenAI
class, different base_url
. The OpenAI-compatible interface shows up three times in this architecture β inbound API, gateway, and local fallback β and that consistency is the whole trick.
A multi-node graph with a tool loop is opaque when it goes wrong. Did the model skip the tool? Retrieve garbage? Loop twice? Langfuse's LangChain callback captures the entire run β every node transition, tool call, and LLM call β as a single nested trace.
The integration is genuinely one object:
from functools import lru_cache
@lru_cache
def get_langfuse_handler():
s = get_settings()
if not (s.langfuse_public_key and s.langfuse_secret_key):
return None # no keys β tracing silently disabled (safe for local/POC)
from langfuse.langchain import CallbackHandler
return CallbackHandler()
Heads-up for the SDK version churn: on
Langfuse SDK v3+the import isfrom langfuse.langchain import CallbackHandler
, and the handler readsLANGFUSE_PUBLIC_KEY
/LANGFUSE_SECRET_KEY
/LANGFUSE_HOST
from the environment β you don't pass keys to the constructor anymore. This tripped up a lot of v2 tutorials.
Then attach it per request via the graph config
β which is also where you stamp user/session metadata so traces are filterable in the Langfuse UI:
handler = get_langfuse_handler()
if handler is not None:
config["callbacks"] = [handler]
config["metadata"] = {
"langfuse_user_id": req.user or "anonymous",
"langfuse_session_id": getattr(req, "chat_id", None) or "no-session",
"langfuse_tags": ["my-agent", settings.served_model_name],
}
Passing the handler through config["callbacks"]
(rather than baking it into the LLM client) means it propagates down the entire graph automatically. One request β one trace β every step visible.
| Concern | How it's handled | Why it scales |
|---|---|---|
| Frontend integration | OpenAI-compatible API | Any OpenAI client works unchanged |
| Model choice | LiteLLM gateway behind ChatOpenAI |
|
| Swap providers via config, not code | ||
| Agent logic | LangGraph StateGraph + prebuilts |
|
| ReAct loop in ~10 lines, extensible to multi-agent | ||
| Observability | Langfuse callback via graph config |
|
| One trace per request, zero per-node wiring | ||
| Local dev | Ollama fallback through same interface | No gateway needed to hack offline |
None of these pieces is exotic. The point is the seams: an OpenAI boundary on the outside, a gateway boundary on the model side, and a callback boundary for observability. Get the seams right and the agent in the middle stays small and swappable.
The same skeleton extends cleanly to a supervisor/worker multi-agent graph, a Postgres checkpointer for persistent threads, and an in-cluster vLLM model β each is an additive change behind one of those seams. But that's a follow-up post.
Built with LangGraph, LangChain, LiteLLM, Qdrant, and Langfuse. If you're running LangGraph in production and want to compare notes on deployment patterns, reach out.