The biggest UX mistake in LLM-powered web apps is waiting for the complete response before sending anything. On a 400-token answer at typical generation speeds, that's 4–8 seconds of staring at a spinner. With streaming, the user sees the first word in under a second and reads along as the model generates. This tutorial shows you exactly how to implement token-by-token streaming from an LLM API to the browser using Server-Sent Events (SSE) in Go Fiber.
WebSockets are bidirectional. For LLM streaming, you don't need that — you send one request, the server pushes tokens back. SSE is:
text/event-stream
content typeEventSource
APIThe wire format is dead simple:
data: {"token": "Hello"}\n\n
data: {"token": " world"}\n\n
data: [DONE]\n\n
Each event is data: <payload>\n\n
. The double newline is the event terminator.
Here's what not to do:
// BAD: collects full LLM response then sends it
func badHandler(c *fiber.Ctx) error {
fullResponse := callLLMAndWaitForCompletion(c.Query("q"))
return c.JSON(fiber.Map{"response": fullResponse})
// User waits 6 seconds. Sees response instantly. Still worse UX.
}
Even if you send it "instantly" after receiving it, the user waited the full generation time. Buffering eliminates the perceived speed advantage of fast models.
go get github.com/gofiber/fiber/v2
go get github.com/openai/openai-go # or any OpenAI-compatible SDK
// handlers/stream.go
package handlers
import (
"bufio"
"context"
"encoding/json"
"fmt"
"log"
"net/http"
"strings"
"time"
"github.com/gofiber/fiber/v2"
openai "github.com/openai/openai-go"
"github.com/openai/openai-go/option"
)
type StreamHandler struct {
llmClient *openai.Client
model string
}
func NewStreamHandler(apiKey, baseURL, model string) *StreamHandler {
client := openai.NewClient(
option.WithAPIKey(apiKey),
option.WithBaseURL(baseURL),
)
return &StreamHandler{llmClient: client, model: model}
}
// sseEvent writes a single SSE event to the response writer.
func sseEvent(c *fiber.Ctx, data string) error {
_, err := fmt.Fprintf(c.Response().BodyWriter(), "data: %s\n\n", data)
return err
}
func (h *StreamHandler) StreamCompletion(c *fiber.Ctx) error {
query := strings.TrimSpace(c.Query("q", ""))
if query == "" {
return c.Status(fiber.StatusBadRequest).JSON(fiber.Map{
"error": "query parameter 'q' is required",
})
}
if len([]rune(query)) > 1000 {
return c.Status(fiber.StatusBadRequest).JSON(fiber.Map{
"error": "query too long (max 1000 characters)",
})
}
// Set SSE headers before writing any body
c.Set("Content-Type", "text/event-stream")
c.Set("Cache-Control", "no-cache")
c.Set("Connection", "keep-alive")
c.Set("X-Accel-Buffering", "no") // Critical for Nginx: disables proxy buffering
// Use the request context so the stream is cancelled if the client disconnects
ctx, cancel := context.WithTimeout(c.Context(), 60*time.Second)
defer cancel()
stream := h.llmClient.Chat.Completions.NewStreaming(ctx,
openai.ChatCompletionNewParams{
Model: openai.F(h.model),
Messages: openai.F([]openai.ChatCompletionMessageParamUnion{
openai.SystemMessage("You are a helpful technical assistant. Be concise and accurate."),
openai.UserMessage(query),
}),
MaxTokens: openai.Int(800),
Temperature: openai.Float(0.3),
},
)
defer stream.Close()
tokenCount := 0
for stream.Next() {
chunk := stream.Current()
if len(chunk.Choices) == 0 {
continue
}
token := chunk.Choices[0].Delta.Content
if token == "" {
continue
}
tokenCount++
payload, err := json.Marshal(map[string]string{"token": token})
if err != nil {
continue
}
if err := sseEvent(c, string(payload)); err != nil {
// Client disconnected — stop generating
log.Printf("Client disconnected after %d tokens", tokenCount)
return nil
}
}
if err := stream.Err(); err != nil {
// Send error event so the client knows what happened
errPayload, _ := json.Marshal(map[string]string{
"error": "stream interrupted: " + err.Error(),
})
_ = sseEvent(c, string(errPayload))
log.Printf("Stream error after %d tokens: %v", tokenCount, err)
return nil
}
// Signal clean completion
_ = sseEvent(c, "[DONE]")
log.Printf("Stream complete: %d tokens for query: %q", tokenCount, query)
return nil
}
// main.go
package main
import (
"log"
"os"
"github.com/gofiber/fiber/v2"
"github.com/gofiber/fiber/v2/middleware/cors"
"github.com/gofiber/fiber/v2/middleware/limiter"
"stream-api/handlers"
)
func main() {
apiKey := os.Getenv("LLM_API_KEY")
baseURL := os.Getenv("LLM_BASE_URL") // e.g. "https://api.openai.com/v1"
model := os.Getenv("LLM_MODEL") // e.g. "gpt-4o-mini"
streamHandler := handlers.NewStreamHandler(apiKey, baseURL, model)
app := fiber.New(fiber.Config{
// Disable response buffering — critical for SSE
StreamRequestBody: true,
})
app.Use(cors.New())
// Rate limit: 10 requests per minute per IP
app.Use("/api/stream", limiter.New(limiter.Config{
Max: 10,
Expiration: 60,
}))
app.Get("/api/stream", streamHandler.StreamCompletion)
log.Fatal(app.Listen(":4001"))
}
This is the complete frontend implementation. No libraries needed — the browser's native EventSource
API handles reconnection automatically.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>LLM Stream Demo</title>
<style>
body { font-family: monospace; max-width: 800px; margin: 40px auto; padding: 0 20px; }
#output { white-space: pre-wrap; background: #f5f5f5; padding: 16px;
border-radius: 4px; min-height: 60px; }
#status { color: #888; font-size: 0.85em; margin-top: 8px; }
button { margin-top: 12px; padding: 8px 16px; cursor: pointer; }
button:disabled { opacity: 0.5; cursor: not-allowed; }
</style>
</head>
<body>
<h2>LLM Streaming Demo</h2>
<input type="text" id="query" placeholder="Ask something..." style="width:100%;padding:8px">
<button id="btn" onclick="startStream()">Ask</button>
<button id="stop-btn" onclick="stopStream()" disabled>Stop</button>
<div id="output"></div>
<div id="status"></div>
<script>
let currentSource = null;
function startStream() {
const query = document.getElementById('query').value.trim();
if (!query) return;
// Clean up any existing stream
stopStream();
const output = document.getElementById('output');
const status = document.getElementById('status');
const btn = document.getElementById('btn');
const stopBtn = document.getElementById('stop-btn');
output.textContent = '';
status.textContent = 'Connecting...';
btn.disabled = true;
stopBtn.disabled = false;
const url = `/api/stream?q=${encodeURIComponent(query)}`;
currentSource = new EventSource(url);
let tokenCount = 0;
const startTime = Date.now();
currentSource.onmessage = function(event) {
if (event.data === '[DONE]') {
const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
status.textContent = `Done — ${tokenCount} tokens in ${elapsed}s`;
cleanup();
return;
}
try {
const parsed = JSON.parse(event.data);
if (parsed.error) {
status.textContent = `Error: ${parsed.error}`;
cleanup();
return;
}
if (parsed.token) {
output.textContent += parsed.token;
tokenCount++;
status.textContent = `Generating... (${tokenCount} tokens)`;
// Auto-scroll
output.scrollTop = output.scrollHeight;
}
} catch (e) {
console.error('Parse error:', e, 'Raw:', event.data);
}
};
currentSource.onerror = function(event) {
// EventSource fires onerror on clean close too — check readyState
if (currentSource.readyState === EventSource.CLOSED) {
return; // normal closure, already handled by [DONE]
}
status.textContent = 'Connection error. Retrying...';
// EventSource reconnects automatically after ~3s
// If you don't want auto-retry, call cleanup() here
};
currentSource.onopen = function() {
status.textContent = 'Connected, waiting for first token...';
};
}
function stopStream() {
if (currentSource) {
currentSource.close();
currentSource = null;
}
cleanup();
}
function cleanup() {
document.getElementById('btn').disabled = false;
document.getElementById('stop-btn').disabled = true;
currentSource = null;
}
</script>
</body>
</html>
Add this to your Nginx server block. Without proxy_buffering off
, Nginx will buffer the entire SSE stream and the user sees nothing until the response ends.
location /api/stream {
proxy_pass http://127.0.0.1:4001;
proxy_http_version 1.1;
proxy_set_header Connection ""; # disable keep-alive pooling
proxy_buffering off; # CRITICAL for SSE
proxy_cache off;
proxy_read_timeout 90s; # longer than your max stream duration
proxy_set_header X-Real-IP $remote_addr;
}
The X-Accel-Buffering: no
header in the Go handler achieves the same effect when Nginx honors it, but setting proxy_buffering off
in Nginx config is the belt-and-suspenders approach.
This is where SSE gets subtle. Once you've started writing the response body with text/event-stream
, you cannot send an HTTP 500 status — the status line is already sent. Your error handling must happen in-band via a data event:
// In the Go handler — if LLM call fails after stream starts:
errPayload, _ := json.Marshal(map[string]string{
"error": "rate_limit_exceeded",
"message": "Please try again in a moment.",
})
_ = sseEvent(c, string(errPayload))
// Then return nil — the HTTP layer doesn't know an error occurred
On the client side, check every event for an error
field and handle it in onmessage
, not just onerror
. The onerror
handler fires for connection errors (network drop, server restart), not application-level errors embedded in the stream.
At 1,000 concurrent users each holding an SSE connection, you're holding 1,000 goroutines open. Go goroutines are cheap (4KB stack by default), so this is fine up to tens of thousands of connections on a modest server. The bottleneck will be your LLM API rate limits, not the SSE infrastructure.
Use the context.WithTimeout
cancel to ensure goroutines don't leak if the LLM API hangs. The defer cancel()
in the handler guarantees cleanup even if the client disconnects before [DONE]
.
This pattern — SSE in Fiber, EventSource in the browser, no-buffer Nginx config — is production-ready and requires zero additional dependencies beyond what a standard Go web API already uses.