Show HN: Suture – a reverse proxy that repairs truncated JSON in LLM streams

wpnews.pro

cd /news/large-language-models/show-hn-suture-a-reverse-proxy-that-… · home › topics › large-language-models › article

[ARTICLE · art-21079] src=github.com ↗ pub=2026-06-04T03:36Z topic=large-language-models verified=true sentiment=· neutral

Show HN: Suture – a reverse proxy that repairs truncated JSON in LLM streams

Suture, an ultra-low-latency reverse proxy, repairs truncated and malformed JSON in LLM streaming responses on the fly to prevent JSONDecodeError and similar parsing failures. The tool sits between applications and providers like OpenAI, Anthropic, and Google Vertex AI, emitting missing characters to make reassembled JSON valid without buffering the stream or adding meaningful latency. Suture is available as a standalone binary or Rust library, requiring no SDK changes, retries, or regenerated tokens.

read4 min views14 publishedJun 4, 2026

Ultra-low-latency reverse proxy that repairs truncated and malformed JSON in LLM streaming responses, on the fly.

📝

The story:[Why your LLM tool calls silently break — and a ~10µs fix]

When an upstream LLM stream is cut off — by max_tokens

, a context-window limit, or a dropped socket — the JSON it was emitting (a tool call's arguments

, or structured-output content

) is left unterminated, and your application throws JSONDecodeError

/ serde_json

"EOF while parsing" errors. Suture sits between your app and the provider, watches the stream, and emits exactly the missing characters to make the reassembled JSON valid — without buffering the stream or adding meaningful latency.

A tool-call stream truncated at max_tokens

leaves your client reassembling invalid JSON:

// what the client reassembles from the delta events:
{"city": "Par      // ← unterminated → JSONDecodeError / serde_json: EOF while parsing a string

Suture closes it on the wire, so the client gets valid JSON instead — no SDK changes, no retry, no regenerated tokens:

{"city": "Par"}    // ← valid; the string and object are safely closed

You're in the right place if your LLM app has thrown any of these on a streaming response:

json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column …

json.decoder.JSONDecodeError: Expecting value: line 1 column … (char …)

serde_json::Error: EOF while parsing a string

/EOF while parsing an object

pydantic_core.ValidationError

on a truncated tool-callarguments

Tool / function-call arguments that won't parse when the model hits max_tokens
Truncated structured-output / JSON-mode content

across streamed deltas

…on OpenAI, Anthropic, Google Vertex AI (Gemini / Claude), or AWS Bedrock.

Repairs OpenAI(/v1/chat/completions

),Anthropic(/v1/messages

),GCP Vertex AI(Gemini + Claude-on-Vertex), and** AWS Bedrock**(ConverseStream

) streaming responses. SSE-aware— repairs thereassembledtool-call arguments / structured content accumulated across delta events, not just raw wire bytes.Streaming + compressed— transparently decodes gzip/brotli/deflate, repairs, and re-encodes per the client'sAccept-Encoding

; never buffers the whole body. Added overhead is ~10 µs per chunk.Holds no credentials— your provider API key / bearer token is forwarded verbatim.- The byte-level repair engine is usable as a standalone library: cargo add suture-repair

(thenuse suture::…

), or just the engine viacargo add suture-repair-core

(use suture_core::repair_str

cargo install suture-repair    # installs the `suture` binary; or: docker build -t suture .
suture                         # listens on 127.0.0.1:8787

Point your SDK's base URL at Suture (your API key still flows through):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key=os.environ["OPENAI_API_KEY"])

Routes: POST /v1/chat/completions

→ OpenAI, POST /v1/messages

→ Anthropic, POST /v1/projects/*

→ Vertex, POST /model/*

→ Bedrock (each when enabled), GET /health

Three layers, each independently tested:

— a byte-level JSON repair state machine. Given any prefix of a valid JSON value, it computes the characters needed to close it (or reports that the input is inconsistent and should pass through untouched). No allocation beyond nesting depth.suture-core

— an incremental SSE parser + per-provider extractors that reassemble the JSON-bearing field across delta events, drive the core engine, and synthesize a closing event at stream end (before the terminator).suture-sse

— an axum/reqwest reverse proxy. Forwards your request verbatim, then on the response:suture

text/event-stream

is repaired via the SSE layer; a singleapplication/json

body is closed with the core engine; anything else streams through unchanged.

Env var	Default	Purpose
`SUTURE_LISTEN`
`127.0.0.1:8787`
listen address
`SUTURE_OPENAI_BASE`
`https://api.openai.com`
OpenAI upstream
`SUTURE_ANTHROPIC_BASE`
`https://api.anthropic.com`
Anthropic upstream
`SUTURE_VERTEX_ENABLED`
`0`
enable the Vertex route (host derived from the path)
`SUTURE_VERTEX_BASE`
—	optional Vertex upstream override
`SUTURE_BEDROCK_ENABLED`
`0`
enable the Bedrock route (host from the validated `Host` header)
`SUTURE_BEDROCK_BASE`
—	optional Bedrock upstream override

See deploy/ for a

Dockerfile

and Cloud Run, ECS/Fargate, and Kubernetes-sidecar manifests, plus operational notes (don't buffer the stream, TLS at the edge, health checks). The sidecar pattern (co-located, localhost) best matches the low-latency design.OpenAI, Anthropic, GCP Vertex AI, and AWS Bedrock (ConverseStream

) are supported, with transparent compression handling. Bedrock uses credential-free SigV4 passthrough — the client signs for the real Bedrock host and Suture forwards verbatim, so Suture never sees a reusable AWS secret (the secret key never leaves the client; only a per-request signature transits).

Dual-licensed under either of MIT or Apache-2.0, at your option. Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you shall be dual-licensed as above, without any additional terms.

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/show-hn-suture-a-reverse…

Read original on github.com → github.com/tensorhq/suture-stream-repair

mentioned entities

Suture

metadata

slugshow-hn-suture-a-reverse-proxy-that-repairs-truncated-json-in-llm-streams

topic#large-language-models

secondary3 topics

sentimentneutral

canonicalgithub.com

navigation

← prevAnthropic scales its most powerf…

next →🧠 Mastering pinecone fastapi sem…

── more in #large-language-models 4 stories · sorted by recency

byteiota.com · 22 Jul · #large-language-models

NVIDIA Cosmos 3 Edge: On-Device Robot AI for Developers

sourcefeed.dev · 22 Jul · #large-language-models

Kimi K3 Catches Fable, but the 50x Cost Claim Doesn't Add Up

cio.com · 22 Jul · #large-language-models

Microsoft doubles down on sovereign AI with expanded Mistral partnership

github.com · 22 Jul · #large-language-models

SynnoDB – Synthesizing Database engines for your workloads

── more on @suture 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required