Show HN: Suture – a reverse proxy that repairs truncated JSON in LLM streams

Suture, an ultra-low-latency reverse proxy, repairs truncated and malformed JSON in LLM streaming responses on the fly to prevent JSONDecodeError and similar parsing failures. The tool sits between applications and providers like OpenAI, Anthropic, and Google Vertex AI, emitting missing characters to make reassembled JSON valid without buffering the stream or adding meaningful latency. Suture is available as a standalone binary or Rust library, requiring no SDK changes, retries, or regenerated tokens.

Ultra-low-latency reverse proxy that repairs truncated and malformed JSON in LLM streaming responses, on the fly. 📝 The story: Why your LLM tool calls silently break — and a ~10µs fix When an upstream LLM stream is cut off — by max tokens , a context-window limit, or a dropped socket — the JSON it was emitting a tool call's arguments , or structured-output content is left unterminated, and your application throws JSONDecodeError / serde json "EOF while parsing" errors. Suture sits between your app and the provider, watches the stream, and emits exactly the missing characters to make the reassembled JSON valid — without buffering the stream or adding meaningful latency. A tool-call stream truncated at max tokens leaves your client reassembling invalid JSON: // what the client reassembles from the delta events: {"city": "Par // ← unterminated → JSONDecodeError / serde json: EOF while parsing a string Suture closes it on the wire, so the client gets valid JSON instead — no SDK changes, no retry, no regenerated tokens: {"city": "Par"} // ← valid; the string and object are safely closed You're in the right place if your LLM app has thrown any of these on a streaming response: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column … json.decoder.JSONDecodeError: Expecting value: line 1 column … char … serde json::Error: EOF while parsing a string / EOF while parsing an object pydantic core.ValidationError on a truncated tool-call arguments - Tool / function-call arguments that won't parse when the model hits max tokens - Truncated structured-output / JSON-mode content across streamed deltas …on OpenAI, Anthropic, Google Vertex AI Gemini / Claude , or AWS Bedrock. - Repairs OpenAI /v1/chat/completions , Anthropic /v1/messages , GCP Vertex AI Gemini + Claude-on-Vertex , and AWS Bedrock ConverseStream streaming responses. SSE-aware — repairs the reassembled tool-call arguments / structured content accumulated across delta events, not just raw wire bytes. Streaming + compressed — transparently decodes gzip/brotli/deflate, repairs, and re-encodes per the client's Accept-Encoding ; never buffers the whole body. Added overhead is ~10 µs per chunk. Holds no credentials — your provider API key / bearer token is forwarded verbatim.- The byte-level repair engine is usable as a standalone library: cargo add suture-repair then use suture::… , or just the engine via cargo add suture-repair-core use suture core::repair str . cargo install suture-repair installs the suture binary; or: docker build -t suture . suture listens on 127.0.0.1:8787 Point your SDK's base URL at Suture your API key still flows through : python from openai import OpenAI client = OpenAI base url="http://localhost:8787/v1", api key=os.environ "OPENAI API KEY" Routes: POST /v1/chat/completions → OpenAI, POST /v1/messages → Anthropic, POST /v1/projects/ → Vertex, POST /model/ → Bedrock each when enabled , GET /health . Three layers, each independently tested: — a byte-level JSON repair state machine. Given any prefix of a valid JSON value, it computes the characters needed to close it or reports that the input is inconsistent and should pass through untouched . No allocation beyond nesting depth. suture-core — an incremental SSE parser + per-provider extractors that reassemble the JSON-bearing field across delta events, drive the core engine, and synthesize a closing event at stream end before the terminator . suture-sse — an axum/reqwest reverse proxy. Forwards your request verbatim, then on the response: suture text/event-stream is repaired via the SSE layer; a single application/json body is closed with the core engine; anything else streams through unchanged. | Env var | Default | Purpose | |---|---|---| SUTURE LISTEN | 127.0.0.1:8787 | listen address | SUTURE OPENAI BASE | https://api.openai.com | OpenAI upstream | SUTURE ANTHROPIC BASE | https://api.anthropic.com | Anthropic upstream | SUTURE VERTEX ENABLED | 0 | enable the Vertex route host derived from the path | SUTURE VERTEX BASE | — | optional Vertex upstream override | SUTURE BEDROCK ENABLED | 0 | enable the Bedrock route host from the validated Host header | SUTURE BEDROCK BASE | — | optional Bedrock upstream override | See deploy/ /tensorhq/suture-stream-repair/blob/main/deploy for a Dockerfile and Cloud Run, ECS/Fargate, and Kubernetes-sidecar manifests, plus operational notes don't buffer the stream, TLS at the edge, health checks . The sidecar pattern co-located, localhost best matches the low-latency design.OpenAI, Anthropic, GCP Vertex AI, and AWS Bedrock ConverseStream are supported, with transparent compression handling. Bedrock uses credential-free SigV4 passthrough — the client signs for the real Bedrock host and Suture forwards verbatim, so Suture never sees a reusable AWS secret the secret key never leaves the client; only a per-request signature transits . Dual-licensed under either of MIT /tensorhq/suture-stream-repair/blob/main/LICENSE-MIT or Apache-2.0 /tensorhq/suture-stream-repair/blob/main/LICENSE-APACHE , at your option. Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you shall be dual-licensed as above, without any additional terms.