{"slug": "streaming-a-langgraph-agent-as-openai-compatible-sse-with-a-thinking-panel", "title": "Streaming a LangGraph Agent as OpenAI-Compatible SSE (with a Thinking Panel)", "summary": "A developer built an adapter that converts LangGraph agent event streams into OpenAI-compatible Server-Sent Events, enabling tools like Open WebUI to display a 'thinking' panel showing the agent's tool calls in real time. The 90-line solution handles the strict OpenAI chunk format, including role, content, finish reason, and the required [DONE] sentinel, while also wrapping tool activity in <think> tags for a collapsible reasoning display.", "body_md": "In [Part 1](https://dev.to/javaking1129/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model-gateway--emi) I built a LangGraph ReAct agent behind an OpenAI-compatible API and waved at one line:\n\n```\nreturn StreamingResponse(graph_to_openai_sse(graph, inputs, model_name, config=config),\n                         media_type=\"text/event-stream\")\n```\n\nThat `graph_to_openai_sse`\n\nis where the real work hides. An OpenAI client like Open WebUI doesn't want \"a LangGraph run\" — it wants a very specific stream of `chat.completion.chunk`\n\nJSON objects over Server-Sent Events, terminated by a `[DONE]`\n\nsentinel. LangGraph, meanwhile, emits its *own* rich event stream. This post is the adapter between the two — about 90 lines that also give you a free \"thinking\" panel showing the agent's tool calls as they happen.\n\n**What the client expects** — each token arrives as an SSE line: `data: {json}\\n\\n`\n\n, where the JSON is an OpenAI chunk:\n\n``` python\n# app/api/openai_compat.py\ndef make_chunk(delta, model_name, completion_id, finish_reason=None):\n    return {\n        \"id\": completion_id,                       # \"chatcmpl-...\"\n        \"object\": \"chat.completion.chunk\",\n        \"created\": int(time.time()),\n        \"model\": model_name,\n        \"choices\": [{\"index\": 0, \"delta\": delta, \"finish_reason\": finish_reason}],\n    }\n```\n\nThe stream has a strict shape:\n\n`delta = {\"role\": \"assistant\"}`\n\n,`delta = {\"content\": \"...\"}`\n\n— one per token,`finish_reason = \"stop\"`\n\n,`data: [DONE]\\n\\n`\n\n.Miss the `[DONE]`\n\nand the client spins forever. Skip the role chunk and some clients drop the first token. The contract is small but unforgiving.\n\n**What LangGraph emits** — `astream_events`\n\nis a single async stream of *typed* events for everything happening inside the graph: model tokens, tool calls, node transitions. We subscribe once and translate each event we care about into chunks.\n\n``` python\n# app/api/streaming.py\nasync def graph_to_openai_sse(graph, inputs, model_name, config=None):\n    completion_id = new_completion_id()\n    yield _sse(make_chunk({\"role\": \"assistant\"}, model_name, completion_id))  # (1) role\n\n    def emit(text):\n        return _sse(make_chunk({\"content\": text}, model_name, completion_id))\n\n    async for event in graph.astream_events(inputs, config=config, version=\"v2\"):\n        kind = event.get(\"event\")\n\n        if kind == \"on_chat_model_stream\":\n            chunk = event[\"data\"][\"chunk\"]\n            if isinstance(chunk, AIMessageChunk) and isinstance(chunk.content, str):\n                yield emit(chunk.content)                                     # (2) tokens\n\n    yield _sse(make_chunk({}, model_name, completion_id, finish_reason=\"stop\"))  # (3) stop\n    yield b\"data: [DONE]\\n\\n\"                                                     # (4) done\n```\n\nThree things to notice:\n\n`version=\"v2\"`\n\n`metadata.langgraph_node`\n\nand `data.chunk`\n\nkeys don't silently move under you.`on_chat_model_stream`\n\n`data.chunk`\n\nis an `AIMessageChunk`\n\n— but only when the LLM is actually streaming. Guarding with `isinstance(...)`\n\navoids crashing on the non-streaming events that also flow through.`completion_id`\n\nfor the whole response.`_sse`\n\nis just the wire framing — and note `ensure_ascii=False`\n\n, which matters the moment your tokens are Korean, Japanese, or emoji:\n\n``` python\ndef _sse(payload):\n    return f\"data: {json.dumps(payload, ensure_ascii=False)}\\n\\n\".encode(\"utf-8\")\n```\n\nStreaming the final answer is table stakes. The interesting part of a ReAct agent is *what it did before answering* — which document it searched, what came back. Open WebUI renders any text wrapped in `<think>...</think>`\n\nas a collapsible reasoning panel. So we narrate tool activity into that panel.\n\nFirst, label the nodes worth announcing:\n\n```\nNODE_LABELS = {\n    \"tools\": \"🔍 Searching the docs…\",\n}\n```\n\nThen open a `<think>`\n\nblock, and on the relevant events, emit human-readable progress instead of raw tokens:\n\n```\n    show_thinking = bool(NODE_LABELS)\n    think_open = False\n    prev_node = None\n\n    if show_thinking:\n        yield emit(\"<think>\\n\")\n        think_open = True\n\n    async for event in graph.astream_events(inputs, config=config, version=\"v2\"):\n        kind = event.get(\"event\")\n        node = (event.get(\"metadata\") or {}).get(\"langgraph_node\", \"\")\n\n        # node entry → status line\n        if node and node != prev_node and node in NODE_LABELS:\n            yield emit(f\"\\n{NODE_LABELS[node]}\\n\")\n            prev_node = node\n\n        if kind == \"on_tool_start\":\n            yield emit(f\"  • `{event.get('name', 'tool')}` running…\")\n            continue\n\n        if kind == \"on_tool_end\":\n            output = event.get(\"data\", {}).get(\"output\")\n            text = output.content if hasattr(output, \"content\") else str(output)\n            snippet = \" \".join(str(text).split())[:90]          # collapse whitespace, clip\n            yield emit(f\" ✓ `{snippet}…`\\n\" if snippet else \" ✓\\n\")\n            continue\n        # ... on_chat_model_stream handled as before\n```\n\nThe `on_tool_end`\n\noutput is a `ToolMessage`\n\n, so its text lives on `.content`\n\n— hence the `hasattr(output, \"content\")`\n\ncheck before falling back to `str()`\n\n. Collapsing whitespace and clipping to ~90 chars keeps the panel readable instead of dumping a wall of retrieved text.\n\nClosing the panel has to happen no matter how the stream ends — success, exception, or early return — so it goes in a `finally`\n\n:\n\n```\n    finally:\n        if think_open:\n            yield _sse(make_chunk({\"content\": \"\\n</think>\\n\"}, model_name, completion_id))\n```\n\nThe result in the UI: a collapsible **\"🔍 Searching the docs… ✓\"** panel, then the streamed answer below it. The user sees the agent reach for RAG in real time.\n\n**1. Errors belong in the stream, not in a 500.** Once you've started streaming, the HTTP status is already `200`\n\nand headers are flushed — you can't switch to an error response. So catch inside the generator and emit the error as content:\n\n```\n    except Exception as exc:\n        log.exception(\"stream failed\")\n        yield _sse(make_chunk({\"content\": f\"\\n[error] {exc}\"}, model_name, completion_id))\n```\n\nThe user sees `[error] ...`\n\nin the chat instead of a frozen, half-rendered message.\n\n**2. Not every model streams.** Some gateways/models return a single batched response with no `on_chat_model_stream`\n\nevents at all. If you only ever forwarded tokens, those models would yield an *empty* answer. Track whether any token was seen, and if not, fall back to a plain `ainvoke`\n\n:\n\n```\n    if not saw_token:\n        result = await graph.ainvoke(inputs, config=config)\n        final = extract_final_text(result.get(\"messages\", []))\n        yield emit(final)\n```\n\n`extract_final_text`\n\nwalks the message log backwards for the last non-empty `AIMessage`\n\n— handling both plain-string content and the list-of-blocks shape some providers return. This one guard is the difference between \"streaming works on my dev model\" and \"works on every model behind the gateway.\"\n\n```\ngraph.astream_events(version=\"v2\")\n        │\n        ├─ on_chat_model_stream → emit({\"content\": token})\n        ├─ node entry           → emit(\"🔍 status line\")   ┐\n        ├─ on_tool_start        → emit(\"• tool running…\")  ├─ inside <think>…</think>\n        ├─ on_tool_end          → emit(\"✓ snippet…\")       ┘\n        └─ (exception)          → emit(\"[error] …\")\n        ▼\n first chunk {role}  →  …content chunks…  →  {finish_reason: stop}  →  data: [DONE]\n```\n\nThe payoff from Part 1 compounds here: because the boundary is *just* OpenAI SSE, this thinking-panel UX shows up in **any** OpenAI-compatible client with zero client code. You wrote a translator, and every frontend in that ecosystem speaks it for free.\n\nNext up: persisting conversation threads with a checkpointer so the agent remembers across requests — and what that does to the streaming loop.\n\n*Built with LangGraph, LangChain, and FastAPI. Part 2 of a series on running LangGraph in production — Part 1 here.*", "url": "https://wpnews.pro/news/streaming-a-langgraph-agent-as-openai-compatible-sse-with-a-thinking-panel", "canonical_source": "https://dev.to/javaking1129/streaming-a-langgraph-agent-as-openai-compatible-sse-with-a-thinking-panel-2928", "published_at": "2026-06-24 01:00:27+00:00", "updated_at": "2026-06-24 01:14:38.401443+00:00", "lang": "en", "topics": ["developer-tools", "large-language-models", "ai-agents"], "entities": ["LangGraph", "OpenAI", "Open WebUI", "ReAct"], "alternates": {"html": "https://wpnews.pro/news/streaming-a-langgraph-agent-as-openai-compatible-sse-with-a-thinking-panel", "markdown": "https://wpnews.pro/news/streaming-a-langgraph-agent-as-openai-compatible-sse-with-a-thinking-panel.md", "text": "https://wpnews.pro/news/streaming-a-langgraph-agent-as-openai-compatible-sse-with-a-thinking-panel.txt", "jsonld": "https://wpnews.pro/news/streaming-a-langgraph-agent-as-openai-compatible-sse-with-a-thinking-panel.jsonld"}}