# Why your LLM tool calls silently break — and a ~10µs fix

> Source: <https://dev.to/wu_jiang_2ca3f4c2d1718f07/why-your-llm-tool-calls-silently-break-and-a-10us-fix-15mj>
> Published: 2026-06-04 02:05:27+00:00

If you stream tool calls or structured output from an LLM, you have almost certainly seen one of these in production:

```
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 12 (char 11)
serde_json::Error: EOF while parsing a string at line 1 column 4096
```

It usually shows up under load, on your longest and most important responses, and it's maddening because *the model did its job* — it just got cut off. This post is about why that happens, why the obvious fixes don't really work, and a small proxy ([Suture](https://github.com/tensorhq/suture-stream-repair)) that fixes it on the wire in microseconds without touching your code or your API keys.

When you stream a chat completion, the provider doesn't send you one JSON document. It sends a long sequence of Server-Sent Events, each a *complete, valid* little JSON object carrying a fragment:

```
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"{\"ci"}}]}}]}
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"ty\":\"Par"}}]}}]}
data: {"choices":[{"delta":{"tool_calls":[{"function":{"arguments":"is\"}"}}]}}]}
data: [DONE]
```

Your SDK **reassembles** the `arguments`

field across all those events into one string -— `{"city":"Paris"}`

— and *then* parses it. The catch: the thing that's actually JSON (the tool arguments, or your structured-output `content`

) lives *inside* those fragments and is only complete once the whole stream arrives.

So when the stream ends early — the model hits `max_tokens`

, blows the context window, or the socket just dies — you're left holding this:

```
{"city":"Par
```

The SSE envelope was fine. The reassembled JSON is not. Your parser throws.

`try/except`

and move on.`"}`

.`max_tokens`

.The naive repair is "append the missing `]`

and `}`

." Consider a tool-args stream truncated right after a comma:

```
{"items":[250,194,
```

The tempting fix is to append `]}`

→ `{"items":[250,194,]}`

. That is **invalid JSON** — a trailing comma. A correct repair has to *drop* the dangling comma first, then close:

`{"items":[250,194]}`

. The same trap hides in partial numbers (`1.`

, `1e`

), partial keywords (`tru`

), incomplete `\uXXXX`

escapes, and — the nastiest — a multibyte UTF-8 character sliced in half by the truncation, where naively appending `"`

produces invalid UTF-8 and a *different* crash.

Getting this right means treating it as what it is: a tiny, careful JSON parser. Suture's core is a byte-level state machine with one invariant, checked by a property test against `serde_json`

: *for any prefix of any valid JSON value, the repaired output parses.* That test caught the trailing-comma bug, the partial-scalar bugs, and a UTF-8-splitting panic before any of them could ship.

Suture is a reverse proxy. You point your SDK's `base_url`

at it and change nothing else:

```
client = OpenAI(base_url="http://localhost:8787/v1", api_key=os.environ["OPENAI_API_KEY"])
```

It forwards your request verbatim (your key just passes through — Suture stores nothing), watches the streaming response, tracks the reassembled tool-args / structured content with the byte-level engine, and at end-of-stream emits exactly the characters needed to close it — as a final, well-formed delta event before the terminator. Your client reassembles valid JSON and never knows anything was wrong.

Design choices that matter:

`criterion`

) — three orders of magnitude under the time you spend waiting on the model.`content`

only when it's actually JSON), so it never mangles prose.`ConverseStream`

, a binary CRC-checked frame protocol) are all supported.Suture forwards your credential and holds nothing. For **AWS Bedrock** it's even stronger: SigV4 signing means the secret access key never crosses the wire at all — only a per-request signature — so a compromised proxy can't steal a reusable AWS credential. (We validate the upstream `Host`

to AWS, too; an SSRF that tried to exploit the `Host`

header was caught and fixed in review.)

This isn't magic and it isn't for everything. Providers are shipping native structured-output guarantees (strict schemas, constrained decoding) that reduce *malformed* JSON — good. What they don't fix is **truncation**: a stream cut at the token cap or a dead socket still leaves you with valid-but-incomplete JSON, across the long tail of models, Bedrock, and older APIs. That residual is exactly what Suture is for. It also won't resurrect data that never arrived — it makes what *did* arrive parseable.

Suture is Rust, dual-licensed MIT/Apache-2.0, ~100 tests, on GitHub:

** https://github.com/tensorhq/suture-stream-repair**. The repair engine is a standalone library if you'd rather repair in-process and keep even the response bytes off the network.

If your structured-output pipeline has ever thrown on a truncated stream, it's a one-line `base_url`

change to find out whether this helps.
