Streaming LLM responses to the browser in Go (Server-Sent Events)

A developer has demonstrated how to implement token-by-token streaming from an LLM API to the browser using Server-Sent Events (SSE) in Go Fiber, reducing user wait time from 4-8 seconds to under one second for the first word. The approach uses the `text/event-stream` content type and the `EventSource` API to push individual tokens as they are generated, rather than buffering the complete response. The implementation includes proper SSE headers, request context cancellation for client disconnects, and a streaming client from the OpenAI Go SDK.

The biggest UX mistake in LLM-powered web apps is waiting for the complete response before sending anything. On a 400-token answer at typical generation speeds, that's 4–8 seconds of staring at a spinner. With streaming, the user sees the first word in under a second and reads along as the model generates. This tutorial shows you exactly how to implement token-by-token streaming from an LLM API to the browser using Server-Sent Events SSE in Go Fiber. WebSockets are bidirectional. For LLM streaming, you don't need that — you send one request, the server pushes tokens back. SSE is: text/event-stream content type EventSource APIThe wire format is dead simple: data: {"token": "Hello"}\n\n data: {"token": " world"}\n\n data: DONE \n\n Each event is data: <payload \n\n . The double newline is the event terminator. Here's what not to do: // BAD: collects full LLM response then sends it func badHandler c fiber.Ctx error { fullResponse := callLLMAndWaitForCompletion c.Query "q" return c.JSON fiber.Map{"response": fullResponse} // User waits 6 seconds. Sees response instantly. Still worse UX. } Even if you send it "instantly" after receiving it, the user waited the full generation time. Buffering eliminates the perceived speed advantage of fast models. go get github.com/gofiber/fiber/v2 go get github.com/openai/openai-go or any OpenAI-compatible SDK // handlers/stream.go package handlers import "bufio" "context" "encoding/json" "fmt" "log" "net/http" "strings" "time" "github.com/gofiber/fiber/v2" openai "github.com/openai/openai-go" "github.com/openai/openai-go/option" type StreamHandler struct { llmClient openai.Client model string } func NewStreamHandler apiKey, baseURL, model string StreamHandler { client := openai.NewClient option.WithAPIKey apiKey , option.WithBaseURL baseURL , return &StreamHandler{llmClient: client, model: model} } // sseEvent writes a single SSE event to the response writer. func sseEvent c fiber.Ctx, data string error { , err := fmt.Fprintf c.Response .BodyWriter , "data: %s\n\n", data return err } func h StreamHandler StreamCompletion c fiber.Ctx error { query := strings.TrimSpace c.Query "q", "" if query == "" { return c.Status fiber.StatusBadRequest .JSON fiber.Map{ "error": "query parameter 'q' is required", } } if len rune query 1000 { return c.Status fiber.StatusBadRequest .JSON fiber.Map{ "error": "query too long max 1000 characters ", } } // Set SSE headers before writing any body c.Set "Content-Type", "text/event-stream" c.Set "Cache-Control", "no-cache" c.Set "Connection", "keep-alive" c.Set "X-Accel-Buffering", "no" // Critical for Nginx: disables proxy buffering // Use the request context so the stream is cancelled if the client disconnects ctx, cancel := context.WithTimeout c.Context , 60 time.Second defer cancel stream := h.llmClient.Chat.Completions.NewStreaming ctx, openai.ChatCompletionNewParams{ Model: openai.F h.model , Messages: openai.F openai.ChatCompletionMessageParamUnion{ openai.SystemMessage "You are a helpful technical assistant. Be concise and accurate." , openai.UserMessage query , } , MaxTokens: openai.Int 800 , Temperature: openai.Float 0.3 , }, defer stream.Close tokenCount := 0 for stream.Next { chunk := stream.Current if len chunk.Choices == 0 { continue } token := chunk.Choices 0 .Delta.Content if token == "" { continue } tokenCount++ payload, err := json.Marshal map string string{"token": token} if err = nil { continue } if err := sseEvent c, string payload ; err = nil { // Client disconnected — stop generating log.Printf "Client disconnected after %d tokens", tokenCount return nil } } if err := stream.Err ; err = nil { // Send error event so the client knows what happened errPayload, := json.Marshal map string string{ "error": "stream interrupted: " + err.Error , } = sseEvent c, string errPayload log.Printf "Stream error after %d tokens: %v", tokenCount, err return nil } // Signal clean completion = sseEvent c, " DONE " log.Printf "Stream complete: %d tokens for query: %q", tokenCount, query return nil } // main.go package main import "log" "os" "github.com/gofiber/fiber/v2" "github.com/gofiber/fiber/v2/middleware/cors" "github.com/gofiber/fiber/v2/middleware/limiter" "stream-api/handlers" func main { apiKey := os.Getenv "LLM API KEY" baseURL := os.Getenv "LLM BASE URL" // e.g. "https://api.openai.com/v1" model := os.Getenv "LLM MODEL" // e.g. "gpt-4o-mini" streamHandler := handlers.NewStreamHandler apiKey, baseURL, model app := fiber.New fiber.Config{ // Disable response buffering — critical for SSE StreamRequestBody: true, } app.Use cors.New // Rate limit: 10 requests per minute per IP app.Use "/api/stream", limiter.New limiter.Config{ Max: 10, Expiration: 60, } app.Get "/api/stream", streamHandler.StreamCompletion log.Fatal app.Listen ":4001" } This is the complete frontend implementation. No libraries needed — the browser's native EventSource API handles reconnection automatically. < DOCTYPE html <html lang="en" <head <meta charset="UTF-8" <title LLM Stream Demo</title <style body { font-family: monospace; max-width: 800px; margin: 40px auto; padding: 0 20px; } output { white-space: pre-wrap; background: f5f5f5; padding: 16px; border-radius: 4px; min-height: 60px; } status { color: 888; font-size: 0.85em; margin-top: 8px; } button { margin-top: 12px; padding: 8px 16px; cursor: pointer; } button:disabled { opacity: 0.5; cursor: not-allowed; } </style </head <body <h2 LLM Streaming Demo</h2 <input type="text" id="query" placeholder="Ask something..." style="width:100%;padding:8px" <button id="btn" onclick="startStream " Ask</button <button id="stop-btn" onclick="stopStream " disabled Stop</button <div id="output" </div <div id="status" </div <script let currentSource = null; function startStream { const query = document.getElementById 'query' .value.trim ; if query return; // Clean up any existing stream stopStream ; const output = document.getElementById 'output' ; const status = document.getElementById 'status' ; const btn = document.getElementById 'btn' ; const stopBtn = document.getElementById 'stop-btn' ; output.textContent = ''; status.textContent = 'Connecting...'; btn.disabled = true; stopBtn.disabled = false; const url = /api/stream?q=${encodeURIComponent query } ; currentSource = new EventSource url ; let tokenCount = 0; const startTime = Date.now ; currentSource.onmessage = function event { if event.data === ' DONE ' { const elapsed = Date.now - startTime / 1000 .toFixed 1 ; status.textContent = Done — ${tokenCount} tokens in ${elapsed}s ; cleanup ; return; } try { const parsed = JSON.parse event.data ; if parsed.error { status.textContent = Error: ${parsed.error} ; cleanup ; return; } if parsed.token { output.textContent += parsed.token; tokenCount++; status.textContent = Generating... ${tokenCount} tokens ; // Auto-scroll output.scrollTop = output.scrollHeight; } } catch e { console.error 'Parse error:', e, 'Raw:', event.data ; } }; currentSource.onerror = function event { // EventSource fires onerror on clean close too — check readyState if currentSource.readyState === EventSource.CLOSED { return; // normal closure, already handled by DONE } status.textContent = 'Connection error. Retrying...'; // EventSource reconnects automatically after ~3s // If you don't want auto-retry, call cleanup here }; currentSource.onopen = function { status.textContent = 'Connected, waiting for first token...'; }; } function stopStream { if currentSource { currentSource.close ; currentSource = null; } cleanup ; } function cleanup { document.getElementById 'btn' .disabled = false; document.getElementById 'stop-btn' .disabled = true; currentSource = null; } </script </body </html Add this to your Nginx server block. Without proxy buffering off , Nginx will buffer the entire SSE stream and the user sees nothing until the response ends. location /api/stream { proxy pass http://127.0.0.1:4001; proxy http version 1.1; proxy set header Connection ""; disable keep-alive pooling proxy buffering off; CRITICAL for SSE proxy cache off; proxy read timeout 90s; longer than your max stream duration proxy set header X-Real-IP $remote addr; } The X-Accel-Buffering: no header in the Go handler achieves the same effect when Nginx honors it, but setting proxy buffering off in Nginx config is the belt-and-suspenders approach. This is where SSE gets subtle. Once you've started writing the response body with text/event-stream , you cannot send an HTTP 500 status — the status line is already sent. Your error handling must happen in-band via a data event: // In the Go handler — if LLM call fails after stream starts: errPayload, := json.Marshal map string string{ "error": "rate limit exceeded", "message": "Please try again in a moment.", } = sseEvent c, string errPayload // Then return nil — the HTTP layer doesn't know an error occurred On the client side, check every event for an error field and handle it in onmessage , not just onerror . The onerror handler fires for connection errors network drop, server restart , not application-level errors embedded in the stream. At 1,000 concurrent users each holding an SSE connection, you're holding 1,000 goroutines open. Go goroutines are cheap 4KB stack by default , so this is fine up to tens of thousands of connections on a modest server. The bottleneck will be your LLM API rate limits, not the SSE infrastructure. Use the context.WithTimeout cancel to ensure goroutines don't leak if the LLM API hangs. The defer cancel in the handler guarantees cleanup even if the client disconnects before DONE . This pattern — SSE in Fiber, EventSource in the browser, no-buffer Nginx config — is production-ready and requires zero additional dependencies beyond what a standard Go web API already uses.