Streaming LLM responses to the browser in Go (Server-Sent Events)

wpnews.pro

The biggest UX mistake in LLM-powered web apps is waiting for the complete response before sending anything. On a 400-token answer at typical generation speeds, that's 4–8 seconds of staring at a spinner. With streaming, the user sees the first word in under a second and reads along as the model generates. This tutorial shows you exactly how to implement token-by-token streaming from an LLM API to the browser using Server-Sent Events (SSE) in Go Fiber.

WebSockets are bidirectional. For LLM streaming, you don't need that — you send one request, the server pushes tokens back. SSE is:

text/event-stream

content typeEventSource

APIThe wire format is dead simple:

data: {"token": "Hello"}\n\n
data: {"token": " world"}\n\n
data: [DONE]\n\n

Each event is data: <payload>\n\n

. The double newline is the event terminator.

Here's what not to do:

// BAD: collects full LLM response then sends it
func badHandler(c *fiber.Ctx) error {
    fullResponse := callLLMAndWaitForCompletion(c.Query("q"))
    return c.JSON(fiber.Map{"response": fullResponse})
    // User waits 6 seconds. Sees response instantly. Still worse UX.
}

Even if you send it "instantly" after receiving it, the user waited the full generation time. Buffering eliminates the perceived speed advantage of fast models.

go get github.com/gofiber/fiber/v2
go get github.com/openai/openai-go  # or any OpenAI-compatible SDK
// handlers/stream.go
package handlers

import (
    "bufio"
    "context"
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "strings"
    "time"

    "github.com/gofiber/fiber/v2"
    openai "github.com/openai/openai-go"
    "github.com/openai/openai-go/option"
)

type StreamHandler struct {
    llmClient *openai.Client
    model     string
}

func NewStreamHandler(apiKey, baseURL, model string) *StreamHandler {
    client := openai.NewClient(
        option.WithAPIKey(apiKey),
        option.WithBaseURL(baseURL),
    )
    return &StreamHandler{llmClient: client, model: model}
}

// sseEvent writes a single SSE event to the response writer.
func sseEvent(c *fiber.Ctx, data string) error {
    _, err := fmt.Fprintf(c.Response().BodyWriter(), "data: %s\n\n", data)
    return err
}

func (h *StreamHandler) StreamCompletion(c *fiber.Ctx) error {
    query := strings.TrimSpace(c.Query("q", ""))
    if query == "" {
        return c.Status(fiber.StatusBadRequest).JSON(fiber.Map{
            "error": "query parameter 'q' is required",
        })
    }
    if len([]rune(query)) > 1000 {
        return c.Status(fiber.StatusBadRequest).JSON(fiber.Map{
            "error": "query too long (max 1000 characters)",
        })
    }

    // Set SSE headers before writing any body
    c.Set("Content-Type", "text/event-stream")
    c.Set("Cache-Control", "no-cache")
    c.Set("Connection", "keep-alive")
    c.Set("X-Accel-Buffering", "no") // Critical for Nginx: disables proxy buffering

    // Use the request context so the stream is cancelled if the client disconnects
    ctx, cancel := context.WithTimeout(c.Context(), 60*time.Second)
    defer cancel()

    stream := h.llmClient.Chat.Completions.NewStreaming(ctx,
        openai.ChatCompletionNewParams{
            Model: openai.F(h.model),
            Messages: openai.F([]openai.ChatCompletionMessageParamUnion{
                openai.SystemMessage("You are a helpful technical assistant. Be concise and accurate."),
                openai.UserMessage(query),
            }),
            MaxTokens:   openai.Int(800),
            Temperature: openai.Float(0.3),
        },
    )
    defer stream.Close()

    tokenCount := 0
    for stream.Next() {
        chunk := stream.Current()
        if len(chunk.Choices) == 0 {
            continue
        }

        token := chunk.Choices[0].Delta.Content
        if token == "" {
            continue
        }

        tokenCount++
        payload, err := json.Marshal(map[string]string{"token": token})
        if err != nil {
            continue
        }

        if err := sseEvent(c, string(payload)); err != nil {
            // Client disconnected — stop generating
            log.Printf("Client disconnected after %d tokens", tokenCount)
            return nil
        }
    }

    if err := stream.Err(); err != nil {
        // Send error event so the client knows what happened
        errPayload, _ := json.Marshal(map[string]string{
            "error": "stream interrupted: " + err.Error(),
        })
        _ = sseEvent(c, string(errPayload))
        log.Printf("Stream error after %d tokens: %v", tokenCount, err)
        return nil
    }

    // Signal clean completion
    _ = sseEvent(c, "[DONE]")
    log.Printf("Stream complete: %d tokens for query: %q", tokenCount, query)
    return nil
}
// main.go
package main

import (
    "log"
    "os"

    "github.com/gofiber/fiber/v2"
    "github.com/gofiber/fiber/v2/middleware/cors"
    "github.com/gofiber/fiber/v2/middleware/limiter"
    "stream-api/handlers"
)

func main() {
    apiKey  := os.Getenv("LLM_API_KEY")
    baseURL := os.Getenv("LLM_BASE_URL") // e.g. "https://api.openai.com/v1"
    model   := os.Getenv("LLM_MODEL")    // e.g. "gpt-4o-mini"

    streamHandler := handlers.NewStreamHandler(apiKey, baseURL, model)

    app := fiber.New(fiber.Config{
        // Disable response buffering — critical for SSE
        StreamRequestBody: true,
    })

    app.Use(cors.New())

    // Rate limit: 10 requests per minute per IP
    app.Use("/api/stream", limiter.New(limiter.Config{
        Max:        10,
        Expiration: 60,
    }))

    app.Get("/api/stream", streamHandler.StreamCompletion)

    log.Fatal(app.Listen(":4001"))
}

This is the complete frontend implementation. No libraries needed — the browser's native EventSource

API handles reconnection automatically.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>LLM Stream Demo</title>
    <style>
        body { font-family: monospace; max-width: 800px; margin: 40px auto; padding: 0 20px; }
        #output { white-space: pre-wrap; background: #f5f5f5; padding: 16px;
                  border-radius: 4px; min-height: 60px; }
        #status { color: #888; font-size: 0.85em; margin-top: 8px; }
        button { margin-top: 12px; padding: 8px 16px; cursor: pointer; }
        button:disabled { opacity: 0.5; cursor: not-allowed; }
    </style>
</head>
<body>
    <h2>LLM Streaming Demo</h2>
    <input type="text" id="query" placeholder="Ask something..." style="width:100%;padding:8px">
    <button id="btn" onclick="startStream()">Ask</button>
    <button id="stop-btn" onclick="stopStream()" disabled>Stop</button>
    <div id="output"></div>
    <div id="status"></div>

<script>
let currentSource = null;

function startStream() {
    const query = document.getElementById('query').value.trim();
    if (!query) return;

    // Clean up any existing stream
    stopStream();

    const output = document.getElementById('output');
    const status = document.getElementById('status');
    const btn = document.getElementById('btn');
    const stopBtn = document.getElementById('stop-btn');

    output.textContent = '';
    status.textContent = 'Connecting...';
    btn.disabled = true;
    stopBtn.disabled = false;

    const url = `/api/stream?q=${encodeURIComponent(query)}`;
    currentSource = new EventSource(url);

    let tokenCount = 0;
    const startTime = Date.now();

    currentSource.onmessage = function(event) {
        if (event.data === '[DONE]') {
            const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
            status.textContent = `Done — ${tokenCount} tokens in ${elapsed}s`;
            cleanup();
            return;
        }

        try {
            const parsed = JSON.parse(event.data);

            if (parsed.error) {
                status.textContent = `Error: ${parsed.error}`;
                cleanup();
                return;
            }

            if (parsed.token) {
                output.textContent += parsed.token;
                tokenCount++;
                status.textContent = `Generating... (${tokenCount} tokens)`;
                // Auto-scroll
                output.scrollTop = output.scrollHeight;
            }
        } catch (e) {
            console.error('Parse error:', e, 'Raw:', event.data);
        }
    };

    currentSource.onerror = function(event) {
        // EventSource fires onerror on clean close too — check readyState
        if (currentSource.readyState === EventSource.CLOSED) {
            return; // normal closure, already handled by [DONE]
        }
        status.textContent = 'Connection error. Retrying...';
        // EventSource reconnects automatically after ~3s
        // If you don't want auto-retry, call cleanup() here
    };

    currentSource.onopen = function() {
        status.textContent = 'Connected, waiting for first token...';
    };
}

function stopStream() {
    if (currentSource) {
        currentSource.close();
        currentSource = null;
    }
    cleanup();
}

function cleanup() {
    document.getElementById('btn').disabled = false;
    document.getElementById('stop-btn').disabled = true;
    currentSource = null;
}
</script>
</body>
</html>

Add this to your Nginx server block. Without proxy_buffering off

, Nginx will buffer the entire SSE stream and the user sees nothing until the response ends.

location /api/stream {
    proxy_pass         http://127.0.0.1:4001;
    proxy_http_version 1.1;
    proxy_set_header   Connection "";        # disable keep-alive pooling
    proxy_buffering    off;                  # CRITICAL for SSE
    proxy_cache        off;
    proxy_read_timeout 90s;                  # longer than your max stream duration
    proxy_set_header   X-Real-IP $remote_addr;
}

The X-Accel-Buffering: no

header in the Go handler achieves the same effect when Nginx honors it, but setting proxy_buffering off

in Nginx config is the belt-and-suspenders approach.

This is where SSE gets subtle. Once you've started writing the response body with text/event-stream

, you cannot send an HTTP 500 status — the status line is already sent. Your error handling must happen in-band via a data event:

// In the Go handler — if LLM call fails after stream starts:
errPayload, _ := json.Marshal(map[string]string{
    "error": "rate_limit_exceeded",
    "message": "Please try again in a moment.",
})
_ = sseEvent(c, string(errPayload))
// Then return nil — the HTTP layer doesn't know an error occurred

On the client side, check every event for an error

field and handle it in onmessage

, not just onerror

. The onerror

handler fires for connection errors (network drop, server restart), not application-level errors embedded in the stream.

At 1,000 concurrent users each holding an SSE connection, you're holding 1,000 goroutines open. Go goroutines are cheap (4KB stack by default), so this is fine up to tens of thousands of connections on a modest server. The bottleneck will be your LLM API rate limits, not the SSE infrastructure.

Use the context.WithTimeout

cancel to ensure goroutines don't leak if the LLM API hangs. The defer cancel()

in the handler guarantees cleanup even if the client disconnects before [DONE]

.

This pattern — SSE in Fiber, EventSource in the browser, no-buffer Nginx config — is production-ready and requires zero additional dependencies beyond what a standard Go web API already uses.

source & further reading

dev.to — original article Best AI Agent Governance Tools in 2026: A Layer-by-Layer Guide What I learned building an AI video background changer AI Agent Runtime Policy: Stop Dangerous Tool Calls Before They Execute

Streaming LLM responses to the browser in Go (Server-Sent Events)

Run your AI side-project on zahid.host