{"slug": "show-hn-suture-a-reverse-proxy-that-repairs-truncated-json-in-llm-streams", "title": "Show HN: Suture – a reverse proxy that repairs truncated JSON in LLM streams", "summary": "Suture, an ultra-low-latency reverse proxy, repairs truncated and malformed JSON in LLM streaming responses on the fly to prevent JSONDecodeError and similar parsing failures. The tool sits between applications and providers like OpenAI, Anthropic, and Google Vertex AI, emitting missing characters to make reassembled JSON valid without buffering the stream or adding meaningful latency. Suture is available as a standalone binary or Rust library, requiring no SDK changes, retries, or regenerated tokens.", "body_md": "**Ultra-low-latency reverse proxy that repairs truncated and malformed JSON in LLM streaming responses, on the fly.**\n\n📝\n\nThe story:[Why your LLM tool calls silently break — and a ~10µs fix]\n\nWhen an upstream LLM stream is cut off — by `max_tokens`\n\n, a context-window limit, or a\ndropped socket — the JSON it was emitting (a tool call's `arguments`\n\n, or structured-output\n`content`\n\n) is left unterminated, and your application throws `JSONDecodeError`\n\n/\n`serde_json`\n\n\"EOF while parsing\" errors. Suture sits between your app and the provider,\nwatches the stream, and emits exactly the missing characters to make the **reassembled**\nJSON valid — without buffering the stream or adding meaningful latency.\n\nA tool-call stream truncated at `max_tokens`\n\nleaves your client reassembling invalid JSON:\n\n```\n// what the client reassembles from the delta events:\n{\"city\": \"Par      // ← unterminated → JSONDecodeError / serde_json: EOF while parsing a string\n```\n\nSuture closes it on the wire, so the client gets valid JSON instead — no SDK changes, no retry, no regenerated tokens:\n\n```\n{\"city\": \"Par\"}    // ← valid; the string and object are safely closed\n```\n\nYou're in the right place if your LLM app has thrown any of these on a streaming response:\n\n`json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column …`\n\n`json.decoder.JSONDecodeError: Expecting value: line 1 column … (char …)`\n\n`serde_json::Error: EOF while parsing a string`\n\n/`EOF while parsing an object`\n\n`pydantic_core.ValidationError`\n\non a truncated tool-call`arguments`\n\n- Tool / function-call arguments that won't parse when the model hits\n`max_tokens`\n\n- Truncated structured-output / JSON-mode\n`content`\n\nacross streamed deltas\n\n…on OpenAI, Anthropic, Google Vertex AI (Gemini / Claude), or AWS Bedrock.\n\n- Repairs\n**OpenAI**(`/v1/chat/completions`\n\n),**Anthropic**(`/v1/messages`\n\n),**GCP Vertex AI**(Gemini + Claude-on-Vertex), and** AWS Bedrock**(`ConverseStream`\n\n) streaming responses. **SSE-aware**— repairs the*reassembled*tool-call arguments / structured content accumulated across delta events, not just raw wire bytes.**Streaming + compressed**— transparently decodes gzip/brotli/deflate, repairs, and re-encodes per the client's`Accept-Encoding`\n\n; never buffers the whole body. Added overhead is ~10 µs per chunk.**Holds no credentials**— your provider API key / bearer token is forwarded verbatim.- The byte-level repair engine is usable as a standalone library:\n`cargo add suture-repair`\n\n(then`use suture::…`\n\n), or just the engine via`cargo add suture-repair-core`\n\n(`use suture_core::repair_str`\n\n).\n\n```\ncargo install suture-repair    # installs the `suture` binary; or: docker build -t suture .\nsuture                         # listens on 127.0.0.1:8787\n```\n\nPoint your SDK's base URL at Suture (your API key still flows through):\n\n``` python\nfrom openai import OpenAI\nclient = OpenAI(base_url=\"http://localhost:8787/v1\", api_key=os.environ[\"OPENAI_API_KEY\"])\n```\n\nRoutes: `POST /v1/chat/completions`\n\n→ OpenAI, `POST /v1/messages`\n\n→ Anthropic,\n`POST /v1/projects/*`\n\n→ Vertex, `POST /model/*`\n\n→ Bedrock (each when enabled), `GET /health`\n\n.\n\nThree layers, each independently tested:\n\n— a byte-level JSON repair state machine. Given any prefix of a valid JSON value, it computes the characters needed to close it (or reports that the input is inconsistent and should pass through untouched). No allocation beyond nesting depth.`suture-core`\n\n— an incremental SSE parser + per-provider extractors that reassemble the JSON-bearing field across delta events, drive the core engine, and synthesize a closing event at stream end (before the terminator).`suture-sse`\n\n— an axum/reqwest reverse proxy. Forwards your request verbatim, then on the response:`suture`\n\n`text/event-stream`\n\nis repaired via the SSE layer; a single`application/json`\n\nbody is closed with the core engine; anything else streams through unchanged.\n\n| Env var | Default | Purpose |\n|---|---|---|\n`SUTURE_LISTEN` |\n`127.0.0.1:8787` |\nlisten address |\n`SUTURE_OPENAI_BASE` |\n`https://api.openai.com` |\nOpenAI upstream |\n`SUTURE_ANTHROPIC_BASE` |\n`https://api.anthropic.com` |\nAnthropic upstream |\n`SUTURE_VERTEX_ENABLED` |\n`0` |\nenable the Vertex route (host derived from the path) |\n`SUTURE_VERTEX_BASE` |\n— | optional Vertex upstream override |\n`SUTURE_BEDROCK_ENABLED` |\n`0` |\nenable the Bedrock route (host from the validated `Host` header) |\n`SUTURE_BEDROCK_BASE` |\n— | optional Bedrock upstream override |\n\nSee [ deploy/](/tensorhq/suture-stream-repair/blob/main/deploy) for a\n\n`Dockerfile`\n\nand Cloud Run, ECS/Fargate, and\nKubernetes-sidecar manifests, plus operational notes (don't buffer the stream, TLS at the\nedge, health checks). The sidecar pattern (co-located, localhost) best matches the\nlow-latency design.OpenAI, Anthropic, GCP Vertex AI, and **AWS Bedrock** (`ConverseStream`\n\n) are supported, with\ntransparent compression handling. Bedrock uses credential-free **SigV4 passthrough** — the\nclient signs for the real Bedrock host and Suture forwards verbatim, so Suture never sees a\nreusable AWS secret (the secret key never leaves the client; only a per-request signature\ntransits).\n\nDual-licensed under either of [MIT](/tensorhq/suture-stream-repair/blob/main/LICENSE-MIT) or [Apache-2.0](/tensorhq/suture-stream-repair/blob/main/LICENSE-APACHE), at your\noption. Unless you explicitly state otherwise, any contribution intentionally submitted for\ninclusion in the work by you shall be dual-licensed as above, without any additional terms.", "url": "https://wpnews.pro/news/show-hn-suture-a-reverse-proxy-that-repairs-truncated-json-in-llm-streams", "canonical_source": "https://github.com/tensorhq/suture-stream-repair", "published_at": "2026-06-04 03:36:32+00:00", "updated_at": "2026-06-04 03:46:24.618531+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "ai-products"], "entities": ["Suture"], "alternates": {"html": "https://wpnews.pro/news/show-hn-suture-a-reverse-proxy-that-repairs-truncated-json-in-llm-streams", "markdown": "https://wpnews.pro/news/show-hn-suture-a-reverse-proxy-that-repairs-truncated-json-in-llm-streams.md", "text": "https://wpnews.pro/news/show-hn-suture-a-reverse-proxy-that-repairs-truncated-json-in-llm-streams.txt", "jsonld": "https://wpnews.pro/news/show-hn-suture-a-reverse-proxy-that-repairs-truncated-json-in-llm-streams.jsonld"}}