Why your LLM tool calls silently break — and a ~10µs fix

A developer has released Suture, a reverse proxy that repairs truncated JSON from LLM streaming responses in approximately 10 microseconds. The tool addresses a common production failure where tool calls or structured outputs break mid-stream due to max tokens, context window limits, or socket disconnections, leaving developers with unparseable JSON fragments. Suture operates as a byte-level state machine that correctly closes truncated JSON without introducing trailing commas or invalid UTF-8, and requires only changing the SDK's base URL to deploy.

If you stream tool calls or structured output from an LLM, you have almost certainly seen one of these in production: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 12 char 11 serde json::Error: EOF while parsing a string at line 1 column 4096 It usually shows up under load, on your longest and most important responses, and it's maddening because the model did its job — it just got cut off. This post is about why that happens, why the obvious fixes don't really work, and a small proxy Suture https://github.com/tensorhq/suture-stream-repair that fixes it on the wire in microseconds without touching your code or your API keys. When you stream a chat completion, the provider doesn't send you one JSON document. It sends a long sequence of Server-Sent Events, each a complete, valid little JSON object carrying a fragment: data: {"choices": {"delta":{"tool calls": {"function":{"arguments":"{\"ci"}} }} } data: {"choices": {"delta":{"tool calls": {"function":{"arguments":"ty\":\"Par"}} }} } data: {"choices": {"delta":{"tool calls": {"function":{"arguments":"is\"}"}} }} } data: DONE Your SDK reassembles the arguments field across all those events into one string -— {"city":"Paris"} — and then parses it. The catch: the thing that's actually JSON the tool arguments, or your structured-output content lives inside those fragments and is only complete once the whole stream arrives. So when the stream ends early — the model hits max tokens , blows the context window, or the socket just dies — you're left holding this: {"city":"Par The SSE envelope was fine. The reassembled JSON is not. Your parser throws. try/except and move on. "} . max tokens .The naive repair is "append the missing and } ." Consider a tool-args stream truncated right after a comma: {"items": 250,194, The tempting fix is to append } → {"items": 250,194, } . That is invalid JSON — a trailing comma. A correct repair has to drop the dangling comma first, then close: {"items": 250,194 } . The same trap hides in partial numbers 1. , 1e , partial keywords tru , incomplete \uXXXX escapes, and — the nastiest — a multibyte UTF-8 character sliced in half by the truncation, where naively appending " produces invalid UTF-8 and a different crash. Getting this right means treating it as what it is: a tiny, careful JSON parser. Suture's core is a byte-level state machine with one invariant, checked by a property test against serde json : for any prefix of any valid JSON value, the repaired output parses. That test caught the trailing-comma bug, the partial-scalar bugs, and a UTF-8-splitting panic before any of them could ship. Suture is a reverse proxy. You point your SDK's base url at it and change nothing else: client = OpenAI base url="http://localhost:8787/v1", api key=os.environ "OPENAI API KEY" It forwards your request verbatim your key just passes through — Suture stores nothing , watches the streaming response, tracks the reassembled tool-args / structured content with the byte-level engine, and at end-of-stream emits exactly the characters needed to close it — as a final, well-formed delta event before the terminator. Your client reassembles valid JSON and never knows anything was wrong. Design choices that matter: criterion — three orders of magnitude under the time you spend waiting on the model. content only when it's actually JSON , so it never mangles prose. ConverseStream , a binary CRC-checked frame protocol are all supported.Suture forwards your credential and holds nothing. For AWS Bedrock it's even stronger: SigV4 signing means the secret access key never crosses the wire at all — only a per-request signature — so a compromised proxy can't steal a reusable AWS credential. We validate the upstream Host to AWS, too; an SSRF that tried to exploit the Host header was caught and fixed in review. This isn't magic and it isn't for everything. Providers are shipping native structured-output guarantees strict schemas, constrained decoding that reduce malformed JSON — good. What they don't fix is truncation : a stream cut at the token cap or a dead socket still leaves you with valid-but-incomplete JSON, across the long tail of models, Bedrock, and older APIs. That residual is exactly what Suture is for. It also won't resurrect data that never arrived — it makes what did arrive parseable. Suture is Rust, dual-licensed MIT/Apache-2.0, ~100 tests, on GitHub: https://github.com/tensorhq/suture-stream-repair . The repair engine is a standalone library if you'd rather repair in-process and keep even the response bytes off the network. If your structured-output pipeline has ever thrown on a truncated stream, it's a one-line base url change to find out whether this helps.