If you stream tool calls or structured output from an LLM, you have almost certainly seen one of these in production:
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 12 (char 11)
serde_json::Error: EOF while parsing a string at line 1 column 4096
It usually shows up under load, on your longest and most important responses, and it's maddening because the model did its job — it just got cut off. This post is about why that happens, why the obvious fixes don't really work, and a small proxy (Suture) that fixes it on the wire in microseconds without touching your code or your API keys.
When you stream a chat completion, the provider doesn't send you one JSON document. It sends a long sequence of Server-Sent Events, each a complete, valid little JSON object carrying a fragment:
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"{\"ci"}}]}}]}
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"ty\":\"Par"}}]}}]}
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"is\"}"}}]}}]}
data: [DONE]
Your SDK reassembles the arguments
field across all those events into one string -— {"city":"Paris"}
— and then parses it. The catch: the thing that's actually JSON (the tool arguments, or your structured-output content
) lives inside those fragments and is only complete once the whole stream arrives.
So when the stream ends early — the model hits max_tokens
, blows the context window, or the socket just dies — you're left holding this:
{"city":"Par
The SSE envelope was fine. The reassembled JSON is not. Your parser throws.
try/except
and move on."}
.max_tokens
.The naive repair is "append the missing ]
and }
." Consider a tool-args stream truncated right after a comma:
{"items":[250,194,
The tempting fix is to append ]}
→ {"items":[250,194,]}
. That is invalid JSON — a trailing comma. A correct repair has to drop the dangling comma first, then close:
{"items":[250,194]}
. The same trap hides in partial numbers (1.
, 1e
), partial keywords (tru
), incomplete \uXXXX
escapes, and — the nastiest — a multibyte UTF-8 character sliced in half by the truncation, where naively appending "
produces invalid UTF-8 and a different crash.
Getting this right means treating it as what it is: a tiny, careful JSON parser. Suture's core is a byte-level state machine with one invariant, checked by a property test against serde_json
: for any prefix of any valid JSON value, the repaired output parses. That test caught the trailing-comma bug, the partial-scalar bugs, and a UTF-8-splitting panic before any of them could ship.
Suture is a reverse proxy. You point your SDK's base_url
at it and change nothing else:
client = OpenAI(base_url="http://localhost:8787/v1", api_key=os.environ["OPENAI_API_KEY"])
It forwards your request verbatim (your key just passes through — Suture stores nothing), watches the streaming response, tracks the reassembled tool-args / structured content with the byte-level engine, and at end-of-stream emits exactly the characters needed to close it — as a final, well-formed delta event before the terminator. Your client reassembles valid JSON and never knows anything was wrong.
Design choices that matter:
criterion
) — three orders of magnitude under the time you spend waiting on the model.content
only when it's actually JSON), so it never mangles prose.ConverseStream
, a binary CRC-checked frame protocol) are all supported.Suture forwards your credential and holds nothing. For AWS Bedrock it's even stronger: SigV4 signing means the secret access key never crosses the wire at all — only a per-request signature — so a compromised proxy can't steal a reusable AWS credential. (We validate the upstream Host
to AWS, too; an SSRF that tried to exploit the Host
header was caught and fixed in review.)
This isn't magic and it isn't for everything. Providers are shipping native structured-output guarantees (strict schemas, constrained decoding) that reduce malformed JSON — good. What they don't fix is truncation: a stream cut at the token cap or a dead socket still leaves you with valid-but-incomplete JSON, across the long tail of models, Bedrock, and older APIs. That residual is exactly what Suture is for. It also won't resurrect data that never arrived — it makes what did arrive parseable.
Suture is Rust, dual-licensed MIT/Apache-2.0, ~100 tests, on GitHub:
** https://github.com/tensorhq/suture-stream-repair**. The repair engine is a standalone library if you'd rather repair in-process and keep even the response bytes off the network.
If your structured-output pipeline has ever thrown on a truncated stream, it's a one-line base_url
change to find out whether this helps.