LLM Stream Demo

# Streaming LLM responses to the browser in Go (Server-Sent Events) > Source: > Published: 2026-05-25 22:00:00+00:00 The biggest UX mistake in LLM-powered web apps is waiting for the complete response before sending anything. On a 400-token answer at typical generation speeds, that's 4–8 seconds of staring at a spinner. With streaming, the user sees the first word in under a second and reads along as the model generates. This tutorial shows you exactly how to implement token-by-token streaming from an LLM API to the browser using Server-Sent Events (SSE) in Go Fiber. WebSockets are bidirectional. For LLM streaming, you don't need that — you send one request, the server pushes tokens back. SSE is: `text/event-stream` content type`EventSource` APIThe wire format is dead simple: ``` data: {"token": "Hello"}\n\n data: {"token": " world"}\n\n data: [DONE]\n\n ``` Each event is `data: \n\n` . The double newline is the event terminator. Here's what not to do: ``` // BAD: collects full LLM response then sends it func badHandler(c *fiber.Ctx) error { fullResponse := callLLMAndWaitForCompletion(c.Query("q")) return c.JSON(fiber.Map{"response": fullResponse}) // User waits 6 seconds. Sees response instantly. Still worse UX. } ``` Even if you send it "instantly" after receiving it, the user waited the full generation time. Buffering eliminates the perceived speed advantage of fast models. ``` go get github.com/gofiber/fiber/v2 go get github.com/openai/openai-go # or any OpenAI-compatible SDK // handlers/stream.go package handlers import ( "bufio" "context" "encoding/json" "fmt" "log" "net/http" "strings" "time" "github.com/gofiber/fiber/v2" openai "github.com/openai/openai-go" "github.com/openai/openai-go/option" ) type StreamHandler struct { llmClient *openai.Client model string } func NewStreamHandler(apiKey, baseURL, model string) *StreamHandler { client := openai.NewClient( option.WithAPIKey(apiKey), option.WithBaseURL(baseURL), ) return &StreamHandler{llmClient: client, model: model} } // sseEvent writes a single SSE event to the response writer. func sseEvent(c *fiber.Ctx, data string) error { _, err := fmt.Fprintf(c.Response().BodyWriter(), "data: %s\n\n", data) return err } func (h *StreamHandler) StreamCompletion(c *fiber.Ctx) error { query := strings.TrimSpace(c.Query("q", "")) if query == "" { return c.Status(fiber.StatusBadRequest).JSON(fiber.Map{ "error": "query parameter 'q' is required", }) } if len([]rune(query)) > 1000 { return c.Status(fiber.StatusBadRequest).JSON(fiber.Map{ "error": "query too long (max 1000 characters)", }) } // Set SSE headers before writing any body c.Set("Content-Type", "text/event-stream") c.Set("Cache-Control", "no-cache") c.Set("Connection", "keep-alive") c.Set("X-Accel-Buffering", "no") // Critical for Nginx: disables proxy buffering // Use the request context so the stream is cancelled if the client disconnects ctx, cancel := context.WithTimeout(c.Context(), 60*time.Second) defer cancel() stream := h.llmClient.Chat.Completions.NewStreaming(ctx, openai.ChatCompletionNewParams{ Model: openai.F(h.model), Messages: openai.F([]openai.ChatCompletionMessageParamUnion{ openai.SystemMessage("You are a helpful technical assistant. Be concise and accurate."), openai.UserMessage(query), }), MaxTokens: openai.Int(800), Temperature: openai.Float(0.3), }, ) defer stream.Close() tokenCount := 0 for stream.Next() { chunk := stream.Current() if len(chunk.Choices) == 0 { continue } token := chunk.Choices[0].Delta.Content if token == "" { continue } tokenCount++ payload, err := json.Marshal(map[string]string{"token": token}) if err != nil { continue } if err := sseEvent(c, string(payload)); err != nil { // Client disconnected — stop generating log.Printf("Client disconnected after %d tokens", tokenCount) return nil } } if err := stream.Err(); err != nil { // Send error event so the client knows what happened errPayload, _ := json.Marshal(map[string]string{ "error": "stream interrupted: " + err.Error(), }) _ = sseEvent(c, string(errPayload)) log.Printf("Stream error after %d tokens: %v", tokenCount, err) return nil } // Signal clean completion _ = sseEvent(c, "[DONE]") log.Printf("Stream complete: %d tokens for query: %q", tokenCount, query) return nil } // main.go package main import ( "log" "os" "github.com/gofiber/fiber/v2" "github.com/gofiber/fiber/v2/middleware/cors" "github.com/gofiber/fiber/v2/middleware/limiter" "stream-api/handlers" ) func main() { apiKey := os.Getenv("LLM_API_KEY") baseURL := os.Getenv("LLM_BASE_URL") // e.g. "https://api.openai.com/v1" model := os.Getenv("LLM_MODEL") // e.g. "gpt-4o-mini" streamHandler := handlers.NewStreamHandler(apiKey, baseURL, model) app := fiber.New(fiber.Config{ // Disable response buffering — critical for SSE StreamRequestBody: true, }) app.Use(cors.New()) // Rate limit: 10 requests per minute per IP app.Use("/api/stream", limiter.New(limiter.Config{ Max: 10, Expiration: 60, })) app.Get("/api/stream", streamHandler.StreamCompletion) log.Fatal(app.Listen(":4001")) } ``` This is the complete frontend implementation. No libraries needed — the browser's native `EventSource` API handles reconnection automatically. ``` LLM Stream Demo

LLM Streaming Demo

``` Add this to your Nginx server block. Without `proxy_buffering off` , Nginx will buffer the entire SSE stream and the user sees nothing until the response ends. ``` location /api/stream { proxy_pass http://127.0.0.1:4001; proxy_http_version 1.1; proxy_set_header Connection ""; # disable keep-alive pooling proxy_buffering off; # CRITICAL for SSE proxy_cache off; proxy_read_timeout 90s; # longer than your max stream duration proxy_set_header X-Real-IP $remote_addr; } ``` The `X-Accel-Buffering: no` header in the Go handler achieves the same effect when Nginx honors it, but setting `proxy_buffering off` in Nginx config is the belt-and-suspenders approach. This is where SSE gets subtle. Once you've started writing the response body with `text/event-stream` , you cannot send an HTTP 500 status — the status line is already sent. Your error handling must happen in-band via a data event: ``` // In the Go handler — if LLM call fails after stream starts: errPayload, _ := json.Marshal(map[string]string{ "error": "rate_limit_exceeded", "message": "Please try again in a moment.", }) _ = sseEvent(c, string(errPayload)) // Then return nil — the HTTP layer doesn't know an error occurred ``` On the client side, check every event for an `error` field and handle it in `onmessage` , not just `onerror` . The `onerror` handler fires for connection errors (network drop, server restart), not application-level errors embedded in the stream. At 1,000 concurrent users each holding an SSE connection, you're holding 1,000 goroutines open. Go goroutines are cheap (4KB stack by default), so this is fine up to tens of thousands of connections on a modest server. The bottleneck will be your LLM API rate limits, not the SSE infrastructure. Use the `context.WithTimeout` cancel to ensure goroutines don't leak if the LLM API hangs. The `defer cancel()` in the handler guarantees cleanup even if the client disconnects before `[DONE]` . This pattern — SSE in Fiber, EventSource in the browser, no-buffer Nginx config — is production-ready and requires zero additional dependencies beyond what a standard Go web API already uses.