Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits

wpnews.pro

cd /news/large-language-models/sors-a-rust-proxy-that-reorders-prom… · home › topics › large-language-models › article

[ARTICLE · art-29864] src=github.com ↗ pub=2026-06-16T17:15Z topic=large-language-models verified=true sentiment=↑ positive

Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits

A new Rust-based reverse proxy called Sors reorders prompt content to maximize prefix cache hits in LLM inference engines like vLLM and SGLang, improving latency by placing static content before dynamic elements. The proxy supports tag-based and auto-detect modes, and benchmarks show significant speedups for cached prompts.

read3 min views27 publishedJun 16, 2026

A minimal reverse proxy that reorders prompt content to maximize prefix cache hits in LLM inference engines (vLLM, SGLang, or any OpenAI-compatible backend with prefix caching enabled).

vLLM's Automatic Prefix Caching uses a radix tree keyed on sequential tokens from position 0. If volatile content (timestamps, request IDs) appears before a large static block, the entire downstream prefix is invalidated every request.

sors intercepts API requests, classifies prompt blocks as static/dynamic/unknown, and reorders them to place stable content at the prefix position — maximizing cache reuse.

Mode	Trigger	Mechanism
Tag-based
`<static>` , `<dynamic>` , `<query>` XML tags in content
Explicit extraction + reorder
Auto-detect
No tags, `ENABLE_AUTO_DETECT=true`
SHA-256 fingerprints, hit-count tracking, stability scoring

Output order: [static (longest first)] → [unknown] → [dynamic]

cargo build --release

VLLM_BACKEND=http://localhost:8000 ./target/release/sors

The proxy listens on port 9000 by default. Point your OpenAI client at http://localhost:9000

instead of the backend directly.

All settings via environment variables:

Variable	Default	Description
`VLLM_BACKEND`
`http://localhost:8000`
Backend URL
`PROXY_HOST`
`0.0.0.0`
Bind host
`PROXY_PORT`
`9000`
Listen port
`STABILITY_THRESHOLD`
`0.5`
Min stability score for "static"
`MIN_HITS_FOR_STATIC`
`2`
Min times a block must appear
`MIN_BLOCK_LENGTH`
`50`
Min chars to process a block
`MAX_BLOCK_HISTORY`
`10000`
Max fingerprints stored
`BACKEND_TIMEOUT`
`120.0`
HTTP timeout (seconds)
`ENABLE_AUTO_DETECT`
`true`
Auto fingerprint mode
`ENABLE_TAG_MODE`
`true`
XML tag mode
`ENABLE_METRICS`
`true`
Record request metrics
`ENABLE_ORDER_ANNOTATIONS`
`false`
Inject logical order header

Method	Path	Description
POST	`/v1/chat/completions`
Optimize messages, forward (streaming supported)
POST	`/v1/completions`
Optimize prompt string, forward
GET	`/health`
Proxy + backend health check
GET	`/stats`
Block engine statistics
GET	`/metrics`
Prometheus text format
GET	`/metrics/json`
JSON metrics summary
*	`/{path}`
Passthrough to backend

Two benchmark scripts compare proxy-optimized latency vs direct (unoptimized) requests.

pip install requests

You need a running vLLM backend with prefix caching enabled:

VLLM_CPU_KVCACHE_SPACE=10 VLLM_CPU_OMP_THREADS_BIND=auto \
  python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-0.5B --dtype bfloat16 --port 8000 --enable-prefix-caching

Tests explicit <static>

, <dynamic>

, <query>

tag reordering. Sends requests through the proxy (port 9000) and directly to vLLM (port 8000), comparing latency.

cargo run

python tests/test_cache.py

Tests the fingerprint-based auto-detection mode. The proxy learns which blocks are static by observing repeated content across requests, then begins reordering automatically.

cargo run

python tests/test_auto_detect.py

The auto-detect test has two phases:

Learning phase(requests 1–3): proxy observes traffic patterns, no reordering yet** Optimization phase**(requests 4–8): auto-reordering kicks in after blocks are classified as static

Both tests print per-request timings and a summary with average speedup.

Client → sors (:9000) → Parse → Classify → Reorder → Forward → vLLM (:8000)

The proxy never buffers streaming responses — it reorders before forwarding, then streams backend bytes unchanged.

Reordering trades semantic coherence for cache efficiency. sors will perform poorly or degrade output quality when prompts are order-dependent: few-shot examples with interleaved input/output pairs, chain-of-thought reasoning that builds sequentially, code blocks split across paragraphs, structured data (tables, YAML) separated by blank lines, or content with relative references ("as mentioned above"). The "lost in the middle" attention effect can also hurt if important context gets pushed between a long static prefix and the query. For short prompts (under ~200 tokens) the parse overhead may exceed any cache savings. sors works best when static content is genuinely background reference material and the query is self-contained.

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/sors-a-rust-proxy-that-r…

Read original on github.com → github.com/flouthoc/sors

mentioned entities

Sors

vLLM

SGLang

Rust

OpenAI

Qwen

metadata

slugsors-a-rust-proxy-that-reorders-prompts-to-maximize-vllm-prefix-cache-hits

topic#large-language-models

secondary2 topics

sentimentpositive

canonicalgithub.com

navigation

← prevFable 5 Export Ban: ‘Fix This Co…

next →American250 Time Capsule reveale…

── more in #large-language-models 4 stories · sorted by recency

byteiota.com · 2 Aug · #large-language-models

VS Code 1.131: See Your Subagents, Speak Your Code

startupfortune.com · 2 Aug · #large-language-models

DeepSeek's V4-Flash Undercuts OpenAI and Anthropic on Price Again

startupfortune.com · 2 Aug · #large-language-models

AMD's MI355X Undercuts Nvidia's B300 on Cost to Run China's Kimi K3

dev.to · 31 Jul · #large-language-models

Impact of Inference Backends on LLM Reproducibility: Notes from a Research Paper

── more on @sors 3 stories trending now

wpnews · 1 Aug · #ai-products

OpenAI Atlas Shuts Down August 9: Migration Guide

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

wpnews · 2 Aug · #developer-tools

Agent-Browser – Browser Automation for AI

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required