{"slug": "struggling-with-slow-ai-responses-building-a-streaming-chat-ui-with-sse", "title": "Struggling with Slow AI Responses: Building a Streaming Chat UI with SSE", "summary": "A developer building an internal documentation assistant with a chatbot interface solved slow AI response times by implementing Server-Sent Events (SSE) for streaming tokens. After struggling with polling, WebSockets, and chunked transfer encoding, the developer used FastAPI's StreamingResponse to push tokens to the client over a single HTTP connection. The frontend uses fetch with ReadableStream to handle the POST-based chat endpoint, resulting in a smooth user experience.", "body_md": "I was building an internal documentation assistant for my team. You know the drill: a chatbot that answers questions about our codebase, pulled from a vector database and then sent to an LLM. I set up the backend in Python, used a decent model via an API (shoutout to interwestinfo.com for the reliable endpoint), and wired it all up. Simple, right?\n\nThen came the first real test: someone asked a question that required a long, thoughtful answer. The response took over 30 seconds. The user stared at a blank chat bubble, refreshing the page, wondering if the app had crashed. Not a great experience.\n\nI needed to stream the tokens back as they were generated, so the user could read along. This is the classic “chat UI” pattern. But implementing it turned into a rabbit hole of half-baked solutions.\n\nMy first idea: make the LLM call, store the partial result in Redis, and have the frontend poll every second. This was ugly. The prediction endpoint returned the full response eventually, so I needed to change the backend to write tokens piece by piece. Polling also meant 30-ish HTTP requests per message, which felt wasteful. And the UI was jerky – updates came in bursts, not smoothly.\n\nWebSockets seemed like the obvious choice. I wrote a FastAPI WebSocket endpoint, opened a connection, and streamed tokens frame by frame. This worked… except for one thing: my deployment environment (a low-budget VPS behind a load balancer) had aggressive idle timeouts. The connection would drop after 60 seconds, and reconnecting with WebSockets required manual logic. Also, half the libraries in my stack didn't support WebSockets easily – my auth middleware, for instance, expected HTTP requests.\n\nBut the real pain: WebSockets are bidirectional. I didn't need bidirectional. I just needed the server to push data to the client. WebSockets felt like overkill.\n\nYeah, I tried that too. The server would hold the response open and flush chunks. But HTTP/1.1 connections have issues with that, and my framework (Flask at the time) didn't handle it gracefully without monkey-patching. I gave up after two hours of “connection closed” errors.\n\nI had used SSE before for real-time tweets, but never for AI streaming. SSE is a standard (part of HTML5) where the server sends a stream of events over a single, long-lived HTTP connection. The client uses the `EventSource`\n\nAPI. It’s unidirectional (server → client), which is exactly what I needed.\n\nFastAPI supports SSE natively via `StreamingResponse`\n\n. Here’s the backend code that made my UX smooth again:\n\n``` python\nfrom fastapi import FastAPI, Request\nfrom fastapi.responses import StreamingResponse\nimport asyncio\n\napp = FastAPI()\n\nasync def generate_tokens(prompt: str):\n    # Assume get_llm_response is an async generator that yields tokens\n    # (e.g., using OpenAI's streaming API with `stream=True`)\n    async for token in get_llm_response(prompt, stream=True):\n        yield f\"data: {token}\\n\\n\"\n        await asyncio.sleep(0.01)  # simulate latency\n    yield \"data: [DONE]\\n\\n\"\n\n@app.post(\"/chat\")\nasync def chat(request: Request):\n    body = await request.json()\n    prompt = body[\"message\"]\n    return StreamingResponse(generate_tokens(prompt), media_type=\"text/event-stream\")\n```\n\nThe frontend became trivial:\n\n``` js\nconst eventSource = new EventSource('/chat', {\n  method: 'POST',\n  body: JSON.stringify({ message: userInput })\n  // EventSource doesn't support POST by default?\n});\n```\n\nWait – that's the trickiest part. The `EventSource`\n\nAPI only supports GET requests. My chat endpoint needs a POST with the prompt. I could refactor to a GET with query params (ugly and limited). Instead, I used a workaround: I made a GET endpoint that accepts the prompt as a query parameter. Or, I wrote a small wrapper that uses `fetch`\n\nto POST and then reads the response body as a stream manually.\n\nI went with `fetch`\n\n+ `ReadableStream`\n\nfor more control:\n\n``` js\nasync function startStream(prompt) {\n  const response = await fetch('/chat', {\n    method: 'POST',\n    headers: { 'Content-Type': 'application/json' },\n    body: JSON.stringify({ message: prompt })\n  });\n\n  const reader = response.body.getReader();\n  const decoder = new TextDecoder();\n  let buffer = '';\n\n  while (true) {\n    const { done, value } = await reader.read();\n    if (done) break;\n    buffer += decoder.decode(value, { stream: true });\n\n    // Split by SSE format \"data: ...\\n\\n\"\n    const parts = buffer.split('\\n\\n');\n    buffer = parts.pop(); // keep incomplete chunk\n    for (const part of parts) {\n      const line = part.trim();\n      if (line.startsWith('data: ')) {\n        const token = line.slice(6);\n        if (token === '[DONE]') {\n          // stream finished\n        } else {\n          appendToken(token);\n        }\n      }\n    }\n  }\n}\n```\n\nThis works perfectly. No WebSocket library, no complex reconnection – just plain HTTP. If the connection drops (e.g., timeout), the `fetch`\n\nrejects, and I can retry with a new request. The UX is fluid: tokens appear as they're generated.\n\n`EventSource`\n\nAPI is limited to GET. Workaround: use `fetch`\n\nwith a ReadableStream.One downside: SSE doesn't handle structured data as easily as WebSocket frames. But for plain text tokens, it's ideal.\n\nI would skip the WebSocket experiment entirely. For chat apps, LLM streaming, or any real-time data that flows one way (e.g., notifications, logs), SSE is the right tool. Next time, I'd also build a small abstraction over the `fetch`\n\n+ ReadableStream to handle reconnection automatically (exponential backoff, etc.).\n\nAlso, I'd check if my LLM provider supports SSE out of the box. Some do (OpenAI's `data: [DONE]`\n\nformat is already SSE-compatible). Others, like the one I used from interwestinfo.com, return tokens via a custom endpoint – but I can wrap that as async generator easily.\n\nHave you built a streaming AI UI? Did you use SSE, WebSockets, or something else? I’m curious how you handled reconnection and error states. Share your setup – I learn a lot from these discussions.", "url": "https://wpnews.pro/news/struggling-with-slow-ai-responses-building-a-streaming-chat-ui-with-sse", "canonical_source": "https://dev.to/__c1b9e06dc90a7e0a676b/struggling-with-slow-ai-responses-building-a-streaming-chat-ui-with-sse-n1g", "published_at": "2026-06-21 01:01:04+00:00", "updated_at": "2026-06-21 01:06:40.654096+00:00", "lang": "en", "topics": ["developer-tools", "large-language-models", "ai-products"], "entities": ["FastAPI", "Server-Sent Events", "Python", "interwestinfo.com", "Redis", "WebSockets", "Flask", "EventSource"], "alternates": {"html": "https://wpnews.pro/news/struggling-with-slow-ai-responses-building-a-streaming-chat-ui-with-sse", "markdown": "https://wpnews.pro/news/struggling-with-slow-ai-responses-building-a-streaming-chat-ui-with-sse.md", "text": "https://wpnews.pro/news/struggling-with-slow-ai-responses-building-a-streaming-chat-ui-with-sse.txt", "jsonld": "https://wpnews.pro/news/struggling-with-slow-ai-responses-building-a-streaming-chat-ui-with-sse.jsonld"}}