Ultra-low-latency reverse proxy that repairs truncated and malformed JSON in LLM streaming responses, on the fly.
📝
The story:[Why your LLM tool calls silently break — and a ~10µs fix]
When an upstream LLM stream is cut off — by max_tokens
, a context-window limit, or a
dropped socket — the JSON it was emitting (a tool call's arguments
, or structured-output
content
) is left unterminated, and your application throws JSONDecodeError
/
serde_json
"EOF while parsing" errors. Suture sits between your app and the provider, watches the stream, and emits exactly the missing characters to make the reassembled JSON valid — without buffering the stream or adding meaningful latency.
A tool-call stream truncated at max_tokens
leaves your client reassembling invalid JSON:
// what the client reassembles from the delta events:
{"city": "Par // ← unterminated → JSONDecodeError / serde_json: EOF while parsing a string
Suture closes it on the wire, so the client gets valid JSON instead — no SDK changes, no retry, no regenerated tokens:
{"city": "Par"} // ← valid; the string and object are safely closed
You're in the right place if your LLM app has thrown any of these on a streaming response:
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column …
json.decoder.JSONDecodeError: Expecting value: line 1 column … (char …)
serde_json::Error: EOF while parsing a string
/EOF while parsing an object
pydantic_core.ValidationError
on a truncated tool-callarguments
-
Tool / function-call arguments that won't parse when the model hits
max_tokens -
Truncated structured-output / JSON-mode
content
across streamed deltas
…on OpenAI, Anthropic, Google Vertex AI (Gemini / Claude), or AWS Bedrock.
- Repairs
OpenAI(
/v1/chat/completions
),Anthropic(/v1/messages
),GCP Vertex AI(Gemini + Claude-on-Vertex), and** AWS Bedrock**(ConverseStream
) streaming responses. SSE-aware— repairs thereassembledtool-call arguments / structured content accumulated across delta events, not just raw wire bytes.Streaming + compressed— transparently decodes gzip/brotli/deflate, repairs, and re-encodes per the client'sAccept-Encoding
; never buffers the whole body. Added overhead is ~10 µs per chunk.Holds no credentials— your provider API key / bearer token is forwarded verbatim.- The byte-level repair engine is usable as a standalone library:
cargo add suture-repair
(thenuse suture::…
), or just the engine viacargo add suture-repair-core
(use suture_core::repair_str
).
cargo install suture-repair # installs the `suture` binary; or: docker build -t suture .
suture # listens on 127.0.0.1:8787
Point your SDK's base URL at Suture (your API key still flows through):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key=os.environ["OPENAI_API_KEY"])
Routes: POST /v1/chat/completions
→ OpenAI, POST /v1/messages
→ Anthropic,
POST /v1/projects/*
→ Vertex, POST /model/*
→ Bedrock (each when enabled), GET /health
.
Three layers, each independently tested:
— a byte-level JSON repair state machine. Given any prefix of a valid JSON value, it computes the characters needed to close it (or reports that the input is inconsistent and should pass through untouched). No allocation beyond nesting depth.suture-core
— an incremental SSE parser + per-provider extractors that reassemble the JSON-bearing field across delta events, drive the core engine, and synthesize a closing event at stream end (before the terminator).suture-sse
— an axum/reqwest reverse proxy. Forwards your request verbatim, then on the response:suture
text/event-stream
is repaired via the SSE layer; a singleapplication/json
body is closed with the core engine; anything else streams through unchanged.
| Env var | Default | Purpose |
|---|---|---|
SUTURE_LISTEN |
||
127.0.0.1:8787 |
||
| listen address | ||
SUTURE_OPENAI_BASE |
||
https://api.openai.com |
||
| OpenAI upstream | ||
SUTURE_ANTHROPIC_BASE |
||
https://api.anthropic.com |
||
| Anthropic upstream | ||
SUTURE_VERTEX_ENABLED |
||
0 |
||
| enable the Vertex route (host derived from the path) | ||
SUTURE_VERTEX_BASE |
||
| — | optional Vertex upstream override | |
SUTURE_BEDROCK_ENABLED |
||
0 |
||
enable the Bedrock route (host from the validated Host header) |
||
SUTURE_BEDROCK_BASE |
||
| — | optional Bedrock upstream override |
See deploy/ for a
Dockerfile
and Cloud Run, ECS/Fargate, and
Kubernetes-sidecar manifests, plus operational notes (don't buffer the stream, TLS at the
edge, health checks). The sidecar pattern (co-located, localhost) best matches the
low-latency design.OpenAI, Anthropic, GCP Vertex AI, and AWS Bedrock (ConverseStream
) are supported, with transparent compression handling. Bedrock uses credential-free SigV4 passthrough — the client signs for the real Bedrock host and Suture forwards verbatim, so Suture never sees a reusable AWS secret (the secret key never leaves the client; only a per-request signature transits).
Dual-licensed under either of MIT or Apache-2.0, at your option. Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you shall be dual-licensed as above, without any additional terms.