cd /news/large-language-models/sors-a-rust-proxy-that-reorders-prom… · home topics large-language-models article
[ARTICLE · art-29864] src=github.com ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits

A new Rust-based reverse proxy called Sors reorders prompt content to maximize prefix cache hits in LLM inference engines like vLLM and SGLang, improving latency by placing static content before dynamic elements. The proxy supports tag-based and auto-detect modes, and benchmarks show significant speedups for cached prompts.

read3 min views2 publishedJun 16, 2026

A minimal reverse proxy that reorders prompt content to maximize prefix cache hits in LLM inference engines (vLLM, SGLang, or any OpenAI-compatible backend with prefix caching enabled).

vLLM's Automatic Prefix Caching uses a radix tree keyed on sequential tokens from position 0. If volatile content (timestamps, request IDs) appears before a large static block, the entire downstream prefix is invalidated every request.

sors intercepts API requests, classifies prompt blocks as static/dynamic/unknown, and reorders them to place stable content at the prefix position — maximizing cache reuse.

Mode Trigger Mechanism
Tag-based
<static> , <dynamic> , <query> XML tags in content
Explicit extraction + reorder
Auto-detect
No tags, ENABLE_AUTO_DETECT=true
SHA-256 fingerprints, hit-count tracking, stability scoring

Output order: [static (longest first)] → [unknown] → [dynamic]

cargo build --release

VLLM_BACKEND=http://localhost:8000 ./target/release/sors

The proxy listens on port 9000 by default. Point your OpenAI client at http://localhost:9000

instead of the backend directly.

All settings via environment variables:

Variable Default Description
VLLM_BACKEND
http://localhost:8000
Backend URL
PROXY_HOST
0.0.0.0
Bind host
PROXY_PORT
9000
Listen port
STABILITY_THRESHOLD
0.5
Min stability score for "static"
MIN_HITS_FOR_STATIC
2
Min times a block must appear
MIN_BLOCK_LENGTH
50
Min chars to process a block
MAX_BLOCK_HISTORY
10000
Max fingerprints stored
BACKEND_TIMEOUT
120.0
HTTP timeout (seconds)
ENABLE_AUTO_DETECT
true
Auto fingerprint mode
ENABLE_TAG_MODE
true
XML tag mode
ENABLE_METRICS
true
Record request metrics
ENABLE_ORDER_ANNOTATIONS
false
Inject logical order header
Method Path Description
POST /v1/chat/completions
Optimize messages, forward (streaming supported)
POST /v1/completions
Optimize prompt string, forward
GET /health
Proxy + backend health check
GET /stats
Block engine statistics
GET /metrics
Prometheus text format
GET /metrics/json
JSON metrics summary
* /{path}
Passthrough to backend

Two benchmark scripts compare proxy-optimized latency vs direct (unoptimized) requests.

pip install requests

You need a running vLLM backend with prefix caching enabled:

VLLM_CPU_KVCACHE_SPACE=10 VLLM_CPU_OMP_THREADS_BIND=auto \
  python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-0.5B --dtype bfloat16 --port 8000 --enable-prefix-caching

Tests explicit <static>

, <dynamic>

, <query>

tag reordering. Sends requests through the proxy (port 9000) and directly to vLLM (port 8000), comparing latency.

cargo run

python tests/test_cache.py

Tests the fingerprint-based auto-detection mode. The proxy learns which blocks are static by observing repeated content across requests, then begins reordering automatically.

cargo run

python tests/test_auto_detect.py

The auto-detect test has two phases:

Learning phase(requests 1–3): proxy observes traffic patterns, no reordering yet** Optimization phase**(requests 4–8): auto-reordering kicks in after blocks are classified as static

Both tests print per-request timings and a summary with average speedup.

Client → sors (:9000) → Parse → Classify → Reorder → Forward → vLLM (:8000)

The proxy never buffers streaming responses — it reorders before forwarding, then streams backend bytes unchanged.

Reordering trades semantic coherence for cache efficiency. sors will perform poorly or degrade output quality when prompts are order-dependent: few-shot examples with interleaved input/output pairs, chain-of-thought reasoning that builds sequentially, code blocks split across paragraphs, structured data (tables, YAML) separated by blank lines, or content with relative references ("as mentioned above"). The "lost in the middle" attention effect can also hurt if important context gets pushed between a long static prefix and the query. For short prompts (under ~200 tokens) the parse overhead may exceed any cache savings. sors works best when static content is genuinely background reference material and the query is self-contained.

── more in #large-language-models 4 stories · sorted by recency
── more on @sors 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/sors-a-rust-proxy-th…] indexed:0 read:3min 2026-06-16 ·