{"slug": "sors-a-rust-proxy-that-reorders-prompts-to-maximize-vllm-prefix-cache-hits", "title": "Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits", "summary": "A new Rust-based reverse proxy called Sors reorders prompt content to maximize prefix cache hits in LLM inference engines like vLLM and SGLang, improving latency by placing static content before dynamic elements. The proxy supports tag-based and auto-detect modes, and benchmarks show significant speedups for cached prompts.", "body_md": "A minimal reverse proxy that reorders prompt content to maximize prefix cache hits in LLM inference engines (vLLM, SGLang, or any OpenAI-compatible backend with prefix caching enabled).\n\nvLLM's Automatic Prefix Caching uses a radix tree keyed on sequential tokens from position 0. If volatile content (timestamps, request IDs) appears before a large static block, the entire downstream prefix is invalidated every request.\n\n**sors** intercepts API requests, classifies prompt blocks as static/dynamic/unknown, and reorders them to place stable content at the prefix position — maximizing cache reuse.\n\n| Mode | Trigger | Mechanism |\n|---|---|---|\nTag-based |\n`<static>` , `<dynamic>` , `<query>` XML tags in content |\nExplicit extraction + reorder |\nAuto-detect |\nNo tags, `ENABLE_AUTO_DETECT=true` |\nSHA-256 fingerprints, hit-count tracking, stability scoring |\n\nOutput order: `[static (longest first)] → [unknown] → [dynamic]`\n\n```\ncargo build --release\n\n# Start the proxy (configure via env vars)\nVLLM_BACKEND=http://localhost:8000 ./target/release/sors\n```\n\nThe proxy listens on port 9000 by default. Point your OpenAI client at `http://localhost:9000`\n\ninstead of the backend directly.\n\nAll settings via environment variables:\n\n| Variable | Default | Description |\n|---|---|---|\n`VLLM_BACKEND` |\n`http://localhost:8000` |\nBackend URL |\n`PROXY_HOST` |\n`0.0.0.0` |\nBind host |\n`PROXY_PORT` |\n`9000` |\nListen port |\n`STABILITY_THRESHOLD` |\n`0.5` |\nMin stability score for \"static\" |\n`MIN_HITS_FOR_STATIC` |\n`2` |\nMin times a block must appear |\n`MIN_BLOCK_LENGTH` |\n`50` |\nMin chars to process a block |\n`MAX_BLOCK_HISTORY` |\n`10000` |\nMax fingerprints stored |\n`BACKEND_TIMEOUT` |\n`120.0` |\nHTTP timeout (seconds) |\n`ENABLE_AUTO_DETECT` |\n`true` |\nAuto fingerprint mode |\n`ENABLE_TAG_MODE` |\n`true` |\nXML tag mode |\n`ENABLE_METRICS` |\n`true` |\nRecord request metrics |\n`ENABLE_ORDER_ANNOTATIONS` |\n`false` |\nInject logical order header |\n\n| Method | Path | Description |\n|---|---|---|\n| POST | `/v1/chat/completions` |\nOptimize messages, forward (streaming supported) |\n| POST | `/v1/completions` |\nOptimize prompt string, forward |\n| GET | `/health` |\nProxy + backend health check |\n| GET | `/stats` |\nBlock engine statistics |\n| GET | `/metrics` |\nPrometheus text format |\n| GET | `/metrics/json` |\nJSON metrics summary |\n| * | `/{path}` |\nPassthrough to backend |\n\nTwo benchmark scripts compare proxy-optimized latency vs direct (unoptimized) requests.\n\n```\npip install requests\n```\n\nYou need a running vLLM backend with prefix caching enabled:\n\n```\nVLLM_CPU_KVCACHE_SPACE=10 VLLM_CPU_OMP_THREADS_BIND=auto \\\n  python -m vllm.entrypoints.openai.api_server \\\n  --model Qwen/Qwen2.5-0.5B --dtype bfloat16 --port 8000 --enable-prefix-caching\n```\n\nTests explicit `<static>`\n\n, `<dynamic>`\n\n, `<query>`\n\ntag reordering. Sends requests through the proxy (port 9000) and directly to vLLM (port 8000), comparing latency.\n\n```\n# Terminal 1: start the proxy\ncargo run\n\n# Terminal 2: run the benchmark\npython tests/test_cache.py\n```\n\nTests the fingerprint-based auto-detection mode. The proxy learns which blocks are static by observing repeated content across requests, then begins reordering automatically.\n\n```\n# Terminal 1: start the proxy\ncargo run\n\n# Terminal 2: run the benchmark\npython tests/test_auto_detect.py\n```\n\nThe auto-detect test has two phases:\n\n**Learning phase**(requests 1–3): proxy observes traffic patterns, no reordering yet** Optimization phase**(requests 4–8): auto-reordering kicks in after blocks are classified as static\n\nBoth tests print per-request timings and a summary with average speedup.\n\n```\nClient → sors (:9000) → Parse → Classify → Reorder → Forward → vLLM (:8000)\n```\n\nThe proxy never buffers streaming responses — it reorders before forwarding, then streams backend bytes unchanged.\n\nReordering trades semantic coherence for cache efficiency. sors will perform poorly or degrade output quality when prompts are order-dependent: few-shot examples with interleaved input/output pairs, chain-of-thought reasoning that builds sequentially, code blocks split across paragraphs, structured data (tables, YAML) separated by blank lines, or content with relative references (\"as mentioned above\"). The \"lost in the middle\" attention effect can also hurt if important context gets pushed between a long static prefix and the query. For short prompts (under ~200 tokens) the parse overhead may exceed any cache savings. sors works best when static content is genuinely background reference material and the query is self-contained.", "url": "https://wpnews.pro/news/sors-a-rust-proxy-that-reorders-prompts-to-maximize-vllm-prefix-cache-hits", "canonical_source": "https://github.com/flouthoc/sors", "published_at": "2026-06-16 17:15:31+00:00", "updated_at": "2026-06-16 17:25:16.826088+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools"], "entities": ["Sors", "vLLM", "SGLang", "Rust", "OpenAI", "Qwen"], "alternates": {"html": "https://wpnews.pro/news/sors-a-rust-proxy-that-reorders-prompts-to-maximize-vllm-prefix-cache-hits", "markdown": "https://wpnews.pro/news/sors-a-rust-proxy-that-reorders-prompts-to-maximize-vllm-prefix-cache-hits.md", "text": "https://wpnews.pro/news/sors-a-rust-proxy-that-reorders-prompts-to-maximize-vllm-prefix-cache-hits.txt", "jsonld": "https://wpnews.pro/news/sors-a-rust-proxy-that-reorders-prompts-to-maximize-vllm-prefix-cache-hits.jsonld"}}