Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits

A new Rust-based reverse proxy called Sors reorders prompt content to maximize prefix cache hits in LLM inference engines like vLLM and SGLang, improving latency by placing static content before dynamic elements. The proxy supports tag-based and auto-detect modes, and benchmarks show significant speedups for cached prompts.

A minimal reverse proxy that reorders prompt content to maximize prefix cache hits in LLM inference engines vLLM, SGLang, or any OpenAI-compatible backend with prefix caching enabled . vLLM's Automatic Prefix Caching uses a radix tree keyed on sequential tokens from position 0. If volatile content timestamps, request IDs appears before a large static block, the entire downstream prefix is invalidated every request. sors intercepts API requests, classifies prompt blocks as static/dynamic/unknown, and reorders them to place stable content at the prefix position — maximizing cache reuse. | Mode | Trigger | Mechanism | |---|---|---| Tag-based | <static , <dynamic , <query XML tags in content | Explicit extraction + reorder | Auto-detect | No tags, ENABLE AUTO DETECT=true | SHA-256 fingerprints, hit-count tracking, stability scoring | Output order: static longest first → unknown → dynamic cargo build --release Start the proxy configure via env vars VLLM BACKEND=http://localhost:8000 ./target/release/sors The proxy listens on port 9000 by default. Point your OpenAI client at http://localhost:9000 instead of the backend directly. All settings via environment variables: | Variable | Default | Description | |---|---|---| VLLM BACKEND | http://localhost:8000 | Backend URL | PROXY HOST | 0.0.0.0 | Bind host | PROXY PORT | 9000 | Listen port | STABILITY THRESHOLD | 0.5 | Min stability score for "static" | MIN HITS FOR STATIC | 2 | Min times a block must appear | MIN BLOCK LENGTH | 50 | Min chars to process a block | MAX BLOCK HISTORY | 10000 | Max fingerprints stored | BACKEND TIMEOUT | 120.0 | HTTP timeout seconds | ENABLE AUTO DETECT | true | Auto fingerprint mode | ENABLE TAG MODE | true | XML tag mode | ENABLE METRICS | true | Record request metrics | ENABLE ORDER ANNOTATIONS | false | Inject logical order header | | Method | Path | Description | |---|---|---| | POST | /v1/chat/completions | Optimize messages, forward streaming supported | | POST | /v1/completions | Optimize prompt string, forward | | GET | /health | Proxy + backend health check | | GET | /stats | Block engine statistics | | GET | /metrics | Prometheus text format | | GET | /metrics/json | JSON metrics summary | | | /{path} | Passthrough to backend | Two benchmark scripts compare proxy-optimized latency vs direct unoptimized requests. pip install requests You need a running vLLM backend with prefix caching enabled: VLLM CPU KVCACHE SPACE=10 VLLM CPU OMP THREADS BIND=auto \ python -m vllm.entrypoints.openai.api server \ --model Qwen/Qwen2.5-0.5B --dtype bfloat16 --port 8000 --enable-prefix-caching Tests explicit <static , <dynamic , <query tag reordering. Sends requests through the proxy port 9000 and directly to vLLM port 8000 , comparing latency. Terminal 1: start the proxy cargo run Terminal 2: run the benchmark python tests/test cache.py Tests the fingerprint-based auto-detection mode. The proxy learns which blocks are static by observing repeated content across requests, then begins reordering automatically. Terminal 1: start the proxy cargo run Terminal 2: run the benchmark python tests/test auto detect.py The auto-detect test has two phases: Learning phase requests 1–3 : proxy observes traffic patterns, no reordering yet Optimization phase requests 4–8 : auto-reordering kicks in after blocks are classified as static Both tests print per-request timings and a summary with average speedup. Client → sors :9000 → Parse → Classify → Reorder → Forward → vLLM :8000 The proxy never buffers streaming responses — it reorders before forwarding, then streams backend bytes unchanged. Reordering trades semantic coherence for cache efficiency. sors will perform poorly or degrade output quality when prompts are order-dependent: few-shot examples with interleaved input/output pairs, chain-of-thought reasoning that builds sequentially, code blocks split across paragraphs, structured data tables, YAML separated by blank lines, or content with relative references "as mentioned above" . The "lost in the middle" attention effect can also hurt if important context gets pushed between a long static prefix and the query. For short prompts under ~200 tokens the parse overhead may exceed any cache savings. sors works best when static content is genuinely background reference material and the query is self-contained.