{"slug": "streaming-an-llm-response-in-4-gifs", "title": "Streaming an LLM response, in 4 GIFs", "summary": "Anthropic's SDK streaming method uses Server-Sent Events (SSE) to deliver LLM responses token-by-token over a persistent HTTP connection, rather than waiting for the full JSON blob. Setting `\"stream\": true` in the POST request cuts the perceived wait time from 4 seconds to 300 milliseconds for the first word, even though total generation time remains the same. The raw stream delivers text in `delta.text` fields within `content_block_delta` events, with chunk boundaries determined by network conditions rather than tokens or words.", "body_md": "We have watched tokens stream in from an LLM before where they appeared one at a time, like the model was typing. If you used the Anthropic SDK's .stream() method, it just worked and you probably never saw what was on the wire.\n\nThis post will majorly focus on how a stream response works and how bugs are handled by SDK behind the hood.\n\n##\n1. Why Streaming exists\n\nTo enable the streaming option we would need to make one change in the post request that is a single field `\"stream\": true`\n\nand it will change the response experience.\n\nHere are the pointers we take from the gif.\n\n- The left side shows no streaming as the cursor blinks for 4 seconds then the whole response lands at once.\n- The right side shows the streaming where the first word shows up in about 300 milliseconds. Words flow in as the model generates them.\n\nBoth the sides have **same model, same prompt, same total time** it is just the right side started giving response almost 4 seconds earlier. The 4 seconds wait time for a full reply feels broken. A streamed reply that finishes in four seconds feels fast. *Streaming doesn't make the model faster it makes the wait disappear.*\n\n##\n2. What's on the wire\n\nWhen you set `stream: true`\n\n, the API stops sending a single JSON blob. It opens a persistent HTTP connection and pushes events down the line as the model generates them. **The format is Server-Sent Events (SSE) a web standard.** Any SSE debugger will read this stream.\n\nHere's what comes through:\n\nA few things to notice:\n\n**The text lives in **`delta.text`\n\n, nested inside `content_block_delta`\n\nevents. Those are the events we should look after.\n\n`stop_reason`\n\nmoved. [In post 1](https://dev.to/jasmin/an-llm-api-call-in-4-gifs-33b1), we saw it right there in the response JSON. Here, it arrives at the very end inside a `message_delta`\n\nevent, just before `message_stop`\n\n. If the loop bails out as soon as the text stops arriving we will never see it.\n\n**Chunks don't line up with tokens or words.** You might get `\"Hello\"`\n\nin one chunk and `\" world\"`\n\nin the next, or both in one. The network decides where the cuts happens and it is not the model, not the API.\n\nThat's what the SDK has been hiding from you.\n\n##\n3. Reading the stream\n\nStreaming sounds complicated until we write the loop. It's just reading bytes, buffering them, splitting on blank lines, and parsing JSON.\n\nHere's the flow:\n\n- The response body is a\n`ReadableStream`\n\nwhich can be iterated with `for await`\n\n.\n- Each iteration gives us bytes which we can decode to string.\n- Buffer the string. A chunk might end mid-message.\n- Split the buffer on\n`\\n\\n`\n\n— that's the SSE message separator.\n- Keep the last item in the buffer. It might be incomplete.\n- For each complete message, find the\n`data:`\n\nline, strip the prefix, and parse the JSON.\n- If the type is\n`content_block_delta`\n\n, print `delta.text`\n\n.\n- If it's\n`message_delta`\n\n, you've got your `stop_reason`\n\n.\n\nHere is the complete sample code you can use to try out:\n\nThe way it is working is that when the chunk ends in the middle of a message `split(\"\\n\\n\")`\n\nleaves an incomplete fragment as the last item. `pop()`\n\npulls it back into the buffer so the next chunk can finish it. Without this line, every split message crashes the parser.\n\n`data.delta.type === \"text_delta\"`\n\nthis check matters because content_block_delta can carry other delta types too: `input_json_delta`\n\nfor tool arguments, `thinking_delta`\n\nfor extended thinking, `signature_delta`\n\nfor verification. For now we only care about text.\n\n*You can find the full implementation *[here on GitHub as well](https://github.com/Jasmin2895/TinyAgent/tree/main/streaming).\n\n##\n4. Three bugs\n\nThe code above works on a good day. Here's what breaks it on a bad one.\n\n**The ghost stream.** The issue is user navigates away with the stream keeps running and tokens keep arriving with nobody to read them. In order to fix this pass an `AbortController`\n\nsignal to `fetch`\n\nand call `abort()`\n\nwhen you're done.\n\nThe fix is an `AbortController`\n\n:\n\n**The silent truncation.** The API can send an `error`\n\nevent mid stream during overload. If the loop only handles `content_block_delta`\n\n, the error gets skipped and you end up with a truncated response and no exception. The fix is to handle `data.type === \"error\"`\n\nexplicitly.\n\n**The split packet.** A single SSE message can arrive in two TCP packets. Without buffering, `JSON.parse`\n\nthrows on the half. This is what `buffer = messages.pop() ?? \"\"`\n\nfixes, it holds the incomplete piece until the next chunk completes it.\n\n###\nstop_reason, in a stream\n\nIn post 1, `stop_reason`\n\nwas right there in the response JSON. In a stream, it's the same four values `end_turn`\n\n, `max_tokens`\n\n, `tool_use`\n\n, `stop_sequence`\n\nbut they arrive inside a `message_delta`\n\nevent near the end of the stream.\n\nThe same rule from post 1 applies: if you ignore `stop_reason`\n\n, you'll ship a bug. A `max_tokens`\n\ncutoff in a streamed response looks exactly like a normal end of stream. You won't know the model was cut off unless you read this event.\n\n###\nThree things to try before the next post\n\n**1.** Run the streaming code. Then change `\"stream\": true`\n\nto `false`\n\nand run it again. Notice how long you wait before seeing anything. That gap is what your users feel.\n\n**2.** Add `console.error(chunk.length)`\n\ninside the `for await`\n\nloop, before any parsing. Run the code and watch the numbers. You'll see chunks of wildly different sizes it could be 8 bytes here, 400 bytes there. The network decides, not the model. Tokens and chunks are not the same thing.\n\n**3.** Start a stream, then disconnect your wifi mid response. Watch what happens. The loop hangs, then eventually throws but only if we have added error handling. This sets up the error handling post later in the series.\n\n###\nWhat's next\n\nTinyAgent can now stream a response. Tokens land as they arrive. `stop_reason`\n\nshows up at the end. It still has no memory though every call starts blank.\n\nIn the upcoming post series we will capture another important details. 😁\n\n*Happy Coding! 👩💻*", "url": "https://wpnews.pro/news/streaming-an-llm-response-in-4-gifs", "canonical_source": "https://dev.to/jasmin/streaming-an-llm-response-in-4-gifs-16dh", "published_at": "2026-05-31 00:16:15+00:00", "updated_at": "2026-05-31 00:42:06.658194+00:00", "lang": "en", "topics": ["large-language-models", "generative-ai", "artificial-intelligence", "ai-tools", "ai-infrastructure"], "entities": ["Anthropic"], "alternates": {"html": "https://wpnews.pro/news/streaming-an-llm-response-in-4-gifs", "markdown": "https://wpnews.pro/news/streaming-an-llm-response-in-4-gifs.md", "text": "https://wpnews.pro/news/streaming-an-llm-response-in-4-gifs.txt", "jsonld": "https://wpnews.pro/news/streaming-an-llm-response-in-4-gifs.jsonld"}}