A minimal reverse proxy that reorders prompt content to maximize prefix cache hits in LLM inference engines (vLLM, SGLang, or any OpenAI-compatible backend with prefix caching enabled).
vLLM's Automatic Prefix Caching uses a radix tree keyed on sequential tokens from position 0. If volatile content (timestamps, request IDs) appears before a large static block, the entire downstream prefix is invalidated every request.
sors intercepts API requests, classifies prompt blocks as static/dynamic/unknown, and reorders them to place stable content at the prefix position — maximizing cache reuse.
| Mode | Trigger | Mechanism |
|---|---|---|
| Tag-based | ||
<static> , <dynamic> , <query> XML tags in content |
||
| Explicit extraction + reorder | ||
| Auto-detect | ||
No tags, ENABLE_AUTO_DETECT=true |
||
| SHA-256 fingerprints, hit-count tracking, stability scoring |
Output order: [static (longest first)] → [unknown] → [dynamic]
cargo build --release
VLLM_BACKEND=http://localhost:8000 ./target/release/sors
The proxy listens on port 9000 by default. Point your OpenAI client at http://localhost:9000
instead of the backend directly.
All settings via environment variables:
| Variable | Default | Description |
|---|---|---|
VLLM_BACKEND |
||
http://localhost:8000 |
||
| Backend URL | ||
PROXY_HOST |
||
0.0.0.0 |
||
| Bind host | ||
PROXY_PORT |
||
9000 |
||
| Listen port | ||
STABILITY_THRESHOLD |
||
0.5 |
||
| Min stability score for "static" | ||
MIN_HITS_FOR_STATIC |
||
2 |
||
| Min times a block must appear | ||
MIN_BLOCK_LENGTH |
||
50 |
||
| Min chars to process a block | ||
MAX_BLOCK_HISTORY |
||
10000 |
||
| Max fingerprints stored | ||
BACKEND_TIMEOUT |
||
120.0 |
||
| HTTP timeout (seconds) | ||
ENABLE_AUTO_DETECT |
||
true |
||
| Auto fingerprint mode | ||
ENABLE_TAG_MODE |
||
true |
||
| XML tag mode | ||
ENABLE_METRICS |
||
true |
||
| Record request metrics | ||
ENABLE_ORDER_ANNOTATIONS |
||
false |
||
| Inject logical order header |
| Method | Path | Description |
|---|---|---|
| POST | /v1/chat/completions |
|
| Optimize messages, forward (streaming supported) | ||
| POST | /v1/completions |
|
| Optimize prompt string, forward | ||
| GET | /health |
|
| Proxy + backend health check | ||
| GET | /stats |
|
| Block engine statistics | ||
| GET | /metrics |
|
| Prometheus text format | ||
| GET | /metrics/json |
|
| JSON metrics summary | ||
| * | /{path} |
|
| Passthrough to backend |
Two benchmark scripts compare proxy-optimized latency vs direct (unoptimized) requests.
pip install requests
You need a running vLLM backend with prefix caching enabled:
VLLM_CPU_KVCACHE_SPACE=10 VLLM_CPU_OMP_THREADS_BIND=auto \
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-0.5B --dtype bfloat16 --port 8000 --enable-prefix-caching
Tests explicit <static>
, <dynamic>
, <query>
tag reordering. Sends requests through the proxy (port 9000) and directly to vLLM (port 8000), comparing latency.
cargo run
python tests/test_cache.py
Tests the fingerprint-based auto-detection mode. The proxy learns which blocks are static by observing repeated content across requests, then begins reordering automatically.
cargo run
python tests/test_auto_detect.py
The auto-detect test has two phases:
Learning phase(requests 1–3): proxy observes traffic patterns, no reordering yet** Optimization phase**(requests 4–8): auto-reordering kicks in after blocks are classified as static
Both tests print per-request timings and a summary with average speedup.
Client → sors (:9000) → Parse → Classify → Reorder → Forward → vLLM (:8000)
The proxy never buffers streaming responses — it reorders before forwarding, then streams backend bytes unchanged.
Reordering trades semantic coherence for cache efficiency. sors will perform poorly or degrade output quality when prompts are order-dependent: few-shot examples with interleaved input/output pairs, chain-of-thought reasoning that builds sequentially, code blocks split across paragraphs, structured data (tables, YAML) separated by blank lines, or content with relative references ("as mentioned above"). The "lost in the middle" attention effect can also hurt if important context gets pushed between a long static prefix and the query. For short prompts (under ~200 tokens) the parse overhead may exceed any cache savings. sors works best when static content is genuinely background reference material and the query is self-contained.