Streaming Claude to the Browser With Backpressure That Actually Works

wpnews.pro

cd /news/large-language-models/streaming-claude-to-the-browser-with… · home › topics › large-language-models › article

[ARTICLE · art-37898] src=dev.to ↗ pub=2026-06-24T15:05Z topic=large-language-models verified=true sentiment=· neutral

Streaming Claude to the Browser With Backpressure That Actually Works

A developer detailed a production-grade setup for streaming Claude's LLM tokens to a browser using Server-Sent Events in a Next.js route handler, emphasizing the critical `X-Accel-Buffering: no` header to prevent nginx buffering and the need to wire the request's abort signal to stop token generation on client disconnect, avoiding unnecessary costs.

read4 min views1 publishedJun 24, 2026

Streaming LLM tokens to a browser is easy to get 80% right and surprisingly easy to get the last 20% wrong. The naive version works on your machine and falls apart under a flaky connection or a fast model. Here is the production-grade setup I use, including the part most tutorials skip: what happens when the client cannot keep up with the stream.

In a Next.js route handler, you return a ReadableStream

that pipes Claude's stream events out as Server-Sent Events:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

export async function POST(request: Request) {
  const { prompt } = await request.json();

  const stream = new ReadableStream({
    async start(controller) {
      const encoder = new TextEncoder();
      const llm = client.messages.stream({
        model: "claude-opus-4-8",
        max_tokens: 64000, // streaming, so give it room
        thinking: { type: "adaptive" },
        messages: [{ role: "user", content: prompt }],
      });

      try {
        for await (const event of llm) {
          if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
            controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text: event.delta.text })}\n\n`));
          }
        }
        controller.enqueue(encoder.encode(`data: ${JSON.stringify({ done: true })}\n\n`));
      } catch (err) {
        const message = err instanceof Error ? err.message : "stream failed";
        controller.enqueue(encoder.encode(`data: ${JSON.stringify({ error: message })}\n\n`));
      } finally {
        controller.close();
      }
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
      Connection: "keep-alive",
      "X-Accel-Buffering": "no", // stop nginx from buffering the stream
    },
  });
}

The X-Accel-Buffering: no

header is the one people forget. Without it, nginx buffers your stream and the user sees nothing until the whole response is done, which defeats the entire point of streaming.

Here is the failure mode that does not show up in a demo. The user navigates away, or closes the tab, or their connection drops, while the model is still generating. On the server, your for await

loop keeps pulling tokens from Claude, paying for output you are throwing into a closed pipe.

The fix is to wire the request's abort signal through to the Claude stream so that when the client disconnects, you stop generating:

export async function POST(request: Request) {
  const { prompt } = await request.json();

  const stream = new ReadableStream({
    async start(controller) {
      const llm = client.messages.stream(
        {
          model: "claude-opus-4-8",
          max_tokens: 64000,
          messages: [{ role: "user", content: prompt }],
        },
        { signal: request.signal }, // abort the SDK stream when the request aborts
      );

      request.signal.addEventListener("abort", () => {
        llm.abort();       // stop pulling tokens
        controller.close();
      });

      // ... same loop as above
    },
  });
  // ...
}

Now a disconnected client stops the generation, which stops the bill. On a fast model producing 64K of output, an abandoned stream you keep generating is real money.

On the browser side, fetch

gives you a readable stream. The trick is that chunks arrive at arbitrary boundaries, so you buffer and split on the SSE delimiter:

async function streamCompletion(prompt: string, onToken: (t: string) => void) {
  const controller = new AbortController();
  const res = await fetch("/api/stream", {
    method: "POST",
    body: JSON.stringify({ prompt }),
    signal: controller.signal,
  });

  const reader = res.body!.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, { stream: true });

    const events = buffer.split("\n\n");
    buffer = events.pop() ?? ""; // keep the incomplete tail

    for (const evt of events) {
      const line = evt.split("\n").find((l) => l.startsWith("data: "));
      if (!line) continue;
      const data = JSON.parse(line.slice(6));
      if (data.text) onToken(data.text);
      if (data.error) throw new Error(data.error);
    }
  }

  return controller; // hold this so the UI can abort on unmount
}

Return the AbortController

so a React component can call controller.abort()

in its cleanup function. That is what propagates the abort all the way back to the server and stops the generation.

One performance note: a fast model emits tokens faster than the DOM wants to repaint. Updating React state on every single token thrashes. Buffer a few tokens (or use requestAnimationFrame

) and flush in batches. The user cannot read faster than ~10 updates per second anyway, and the UI stays smooth.

The demo version of streaming works because nobody closes the tab and the network is perfect. Production is not that. The two things that separate a real implementation from a tutorial: disable proxy buffering so tokens actually flow, and propagate aborts end to end so an abandoned stream stops costing you money. Get those two right and streaming is genuinely robust. Skip them and it works right up until it matters.

source & further reading

dev.to — original article Letting Claude Code Autonomously Hunt for Trading Strategies I Built an AI Presentation Platform That Generates Real PowerPoint Files Running Local LLMs for Coding: No API Keys, Full Control

~/api · this article 200

$curl api.wpnews.pro/v1/news/streaming-claude-to-the-…

Read original on dev.to → dev.to/pavelespitia/streaming-claude-to-the-brow…

mentioned entities

Claude

Anthropic

Next.js

nginx

Server-Sent Events

ReadableStream

metadata

slugstreaming-claude-to-the-browser-with-backpressure-that-actually-works

topic#large-language-models

secondary2 topics

sentimentneutral

canonicaldev.to

navigation

← prevDitching the Magic: Why Haystack…

next →Letting Claude Code Autonomously…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 24 Jun · #large-language-models

The IDE is Dead: How I Configured Claude Code for Ultra-Fast Terminal Development

dev.to · 24 Jun · #large-language-models

Running Local LLMs for Coding: No API Keys, Full Control

dev.to · 24 Jun · #large-language-models

Five ways your AI coding agent wastes tokens (and how to fix each one)

letsdatascience.com · 24 Jun · #large-language-models

Claude Expresses Reservations About Military Targeting

── more on @claude 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 22 Jun · #large-language-models

MCP vs Skills: Why Skills Save Context Tokens

wpnews · 22 Jun · #ai-agents

Anthropic's engineering leader says Claude Code is making programmers lonelier

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required