cd /news/large-language-models/indexcache-accelerating-sparse-atten… · home topics large-language-models article
[ARTICLE · art-30988] src=github.com ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Researchers released IndexCache, a patch for SGLang and vLLM that accelerates sparse attention in DeepSeek-V3.2 and GLM-5 models by reusing index computations across layers, achieving up to 1.82× prefill and 1.48× decode speedups with negligible quality loss.

read3 min views1 publishedJun 17, 2026

This repository provides a patch for SGLang and vLLM that enables IndexCache inference acceleration for models using DeepSeek Sparse Attention (DSA), including DeepSeek-V3.2 and GLM-5.

TL;DR:IndexCache eliminates up to 75% of indexer computations in DSA through cross-layer index reuse — achieving up to1.82× prefill speedupand1.48× decode speedupwith negligible quality degradation. Oneif/else

branch, zero extra GPU memory.

In DSA, the lightning indexer selects the top-k most relevant tokens at each layer to make attention sparse. While cheap per-FLOP, it runs independently at every layer with O(L²) complexity. At long context lengths, it becomes the dominant bottleneck:

At 200K context, the indexer consumes 81% of prefill time.

We measured pairwise top-k index overlap across all 47 DSA layers and found that adjacent layers share 70–100% of their selected tokens:

Cross-layer top-k overlap heatmap. Most indexer computations are redundant.

IndexCache partitions layers into Full (F) layers that retain their indexer and Shared (S) layers that reuse the nearest F layer's cached indices:

We propose two complementary approaches:

Approach Description Requires Training?
Training-free
Greedy search selects which indexers to remove based on LM loss on a calibration set
Training-aware
Multi-layer distillation trains each retained indexer to serve all layers it covers

Both retain only 1/4 of indexers with negligible quality degradation.

Baseline IndexCache (1/4) Speedup
Prefill (200K)
19.5s 10.7s 1.82×
Decode (200K)
58 tok/s 86 tok/s 1.48×

9 benchmarks virtually unchanged ✅

~1.2× E2E speedup with negligible degradation across 10 benchmarks (long-context + reasoning).

git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout b638b25b

This patch is built and tested against commit

[. It may apply cleanly to newer versions, but if you encounter conflicts, use this specific commit.]b638b25b

For vLLM:

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 4508532fb
git apply /path/to/indexcache.patch # /path/to/indexcache_vllm.patch for vllm patch

Configure via --json-model-override-args

for SGLang or --hf-overrides

for vLLM. Two options (take SGLang for example):

Every N-th layer keeps its indexer:

python -m sglang.launch_server \
    --model-path zai-org/GLM-5-FP8 \
    --json-model-override-args '{"index_topk_freq": 2}' \
    ...  # your other args (tp, dp, etc.)

index_topk_freq=2

→ every 2th layer is Full, rest are Shared (50% indexers removed).

Specify per-layer F/S assignment:

python -m sglang.launch_server \
    --model-path zai-org/GLM-5-FP8 \
    --json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}' \
    ...  # your other args

Each character maps to one DSA layer: F

= Full (runs indexer), S

= Shared (reuses cached indices).

Default behavior:When neither parameter is set, all layers run their indexer — identical to standard DSA.

Parameter Type Default Description
index_topk_freq
int 1
Keep indexer every N layers. 1 = disabled, 4 = keep 1/4
index_topk_pattern
string null
Per-layer F/S pattern. Overrides index_topk_freq if set

Which to use?

— Simple, good default. Best with training-aware models.index_topk_freq: 4

Custom pattern— Optimal for training-free deployment. The example above is the greedy-searched pattern for the GLM-5 model.

Model Architecture Supported
DeepSeek-V3.2 DeepseekV32ForCausalLM
GLM-5 (744B) GlmMoeDsaForCausalLM

Any model using DSA indexer benefits from this patch.

@article{bai2025indexcache,
  title={IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse},
  author={Bai, Yushi and Dong, Qian and Jiang, Ting and Lv, Xin and Du, Zhengxiao and Zeng, Aohan and Tang, Jie and Li, Juanzi},
  journal={arXiv preprint arXiv:2603.12201},
  year={2025}
}

This patch is released under the Apache 2.0 License, consistent with SGLang.

── more in #large-language-models 4 stories · sorted by recency
── more on @sglang 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/indexcache-accelerat…] indexed:0 read:3min 2026-06-17 ·