{"slug": "why-your-llm-tool-calls-silently-break-and-a-10ms-fix", "title": "Why your LLM tool calls silently break — and a ~10µs fix", "summary": "A developer has released Suture, a reverse proxy that repairs truncated JSON from LLM streaming responses in approximately 10 microseconds. The tool addresses a common production failure where tool calls or structured outputs break mid-stream due to max tokens, context window limits, or socket disconnections, leaving developers with unparseable JSON fragments. Suture operates as a byte-level state machine that correctly closes truncated JSON without introducing trailing commas or invalid UTF-8, and requires only changing the SDK's base URL to deploy.", "body_md": "If you stream tool calls or structured output from an LLM, you have almost certainly seen one of these in production:\n\n```\njson.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 12 (char 11)\nserde_json::Error: EOF while parsing a string at line 1 column 4096\n```\n\nIt usually shows up under load, on your longest and most important responses, and it's maddening because *the model did its job* — it just got cut off. This post is about why that happens, why the obvious fixes don't really work, and a small proxy ([Suture](https://github.com/tensorhq/suture-stream-repair)) that fixes it on the wire in microseconds without touching your code or your API keys.\n\nWhen you stream a chat completion, the provider doesn't send you one JSON document. It sends a long sequence of Server-Sent Events, each a *complete, valid* little JSON object carrying a fragment:\n\n```\ndata: {\"choices\":[{\"delta\":{\"tool_calls\":[{\"function\":{\"arguments\":\"{\\\"ci\"}}]}}]}\ndata: {\"choices\":[{\"delta\":{\"tool_calls\":[{\"function\":{\"arguments\":\"ty\\\":\\\"Par\"}}]}}]}\ndata: {\"choices\":[{\"delta\":{\"tool_calls\":[{\"function\":{\"arguments\":\"is\\\"}\"}}]}}]}\ndata: [DONE]\n```\n\nYour SDK **reassembles** the `arguments`\n\nfield across all those events into one string -— `{\"city\":\"Paris\"}`\n\n— and *then* parses it. The catch: the thing that's actually JSON (the tool arguments, or your structured-output `content`\n\n) lives *inside* those fragments and is only complete once the whole stream arrives.\n\nSo when the stream ends early — the model hits `max_tokens`\n\n, blows the context window, or the socket just dies — you're left holding this:\n\n```\n{\"city\":\"Par\n```\n\nThe SSE envelope was fine. The reassembled JSON is not. Your parser throws.\n\n`try/except`\n\nand move on.`\"}`\n\n.`max_tokens`\n\n.The naive repair is \"append the missing `]`\n\nand `}`\n\n.\" Consider a tool-args stream truncated right after a comma:\n\n```\n{\"items\":[250,194,\n```\n\nThe tempting fix is to append `]}`\n\n→ `{\"items\":[250,194,]}`\n\n. That is **invalid JSON** — a trailing comma. A correct repair has to *drop* the dangling comma first, then close:\n\n`{\"items\":[250,194]}`\n\n. The same trap hides in partial numbers (`1.`\n\n, `1e`\n\n), partial keywords (`tru`\n\n), incomplete `\\uXXXX`\n\nescapes, and — the nastiest — a multibyte UTF-8 character sliced in half by the truncation, where naively appending `\"`\n\nproduces invalid UTF-8 and a *different* crash.\n\nGetting this right means treating it as what it is: a tiny, careful JSON parser. Suture's core is a byte-level state machine with one invariant, checked by a property test against `serde_json`\n\n: *for any prefix of any valid JSON value, the repaired output parses.* That test caught the trailing-comma bug, the partial-scalar bugs, and a UTF-8-splitting panic before any of them could ship.\n\nSuture is a reverse proxy. You point your SDK's `base_url`\n\nat it and change nothing else:\n\n```\nclient = OpenAI(base_url=\"http://localhost:8787/v1\", api_key=os.environ[\"OPENAI_API_KEY\"])\n```\n\nIt forwards your request verbatim (your key just passes through — Suture stores nothing), watches the streaming response, tracks the reassembled tool-args / structured content with the byte-level engine, and at end-of-stream emits exactly the characters needed to close it — as a final, well-formed delta event before the terminator. Your client reassembles valid JSON and never knows anything was wrong.\n\nDesign choices that matter:\n\n`criterion`\n\n) — three orders of magnitude under the time you spend waiting on the model.`content`\n\nonly when it's actually JSON), so it never mangles prose.`ConverseStream`\n\n, a binary CRC-checked frame protocol) are all supported.Suture forwards your credential and holds nothing. For **AWS Bedrock** it's even stronger: SigV4 signing means the secret access key never crosses the wire at all — only a per-request signature — so a compromised proxy can't steal a reusable AWS credential. (We validate the upstream `Host`\n\nto AWS, too; an SSRF that tried to exploit the `Host`\n\nheader was caught and fixed in review.)\n\nThis isn't magic and it isn't for everything. Providers are shipping native structured-output guarantees (strict schemas, constrained decoding) that reduce *malformed* JSON — good. What they don't fix is **truncation**: a stream cut at the token cap or a dead socket still leaves you with valid-but-incomplete JSON, across the long tail of models, Bedrock, and older APIs. That residual is exactly what Suture is for. It also won't resurrect data that never arrived — it makes what *did* arrive parseable.\n\nSuture is Rust, dual-licensed MIT/Apache-2.0, ~100 tests, on GitHub:\n\n** https://github.com/tensorhq/suture-stream-repair**. The repair engine is a standalone library if you'd rather repair in-process and keep even the response bytes off the network.\n\nIf your structured-output pipeline has ever thrown on a truncated stream, it's a one-line `base_url`\n\nchange to find out whether this helps.", "url": "https://wpnews.pro/news/why-your-llm-tool-calls-silently-break-and-a-10ms-fix", "canonical_source": "https://dev.to/wu_jiang_2ca3f4c2d1718f07/why-your-llm-tool-calls-silently-break-and-a-10us-fix-15mj", "published_at": "2026-06-04 02:05:27+00:00", "updated_at": "2026-06-04 02:12:12.471553+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure"], "entities": ["Suture", "JSON", "LLM", "Server-Sent Events"], "alternates": {"html": "https://wpnews.pro/news/why-your-llm-tool-calls-silently-break-and-a-10ms-fix", "markdown": "https://wpnews.pro/news/why-your-llm-tool-calls-silently-break-and-a-10ms-fix.md", "text": "https://wpnews.pro/news/why-your-llm-tool-calls-silently-break-and-a-10ms-fix.txt", "jsonld": "https://wpnews.pro/news/why-your-llm-tool-calls-silently-break-and-a-10ms-fix.jsonld"}}