IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Researchers released IndexCache, a patch for SGLang and vLLM that accelerates sparse attention in DeepSeek-V3.2 and GLM-5 models by reusing index computations across layers, achieving up to 1.82× prefill and 1.48× decode speedups with negligible quality loss.

This repository provides a patch for SGLang https://github.com/sgl-project/sglang and vLLM https://github.com/vllm-project/vllm that enables IndexCache inference acceleration for models using DeepSeek Sparse Attention DSA , including DeepSeek-V3.2 and GLM-5 . TL;DR:IndexCache eliminates up to 75% of indexer computations in DSA through cross-layer index reuse — achieving up to1.82× prefill speedupand1.48× decode speedupwith negligible quality degradation. One if/else branch, zero extra GPU memory. In DSA, the lightning indexer selects the top-k most relevant tokens at each layer to make attention sparse. While cheap per-FLOP, it runs independently at every layer with O L² complexity. At long context lengths, it becomes the dominant bottleneck: /THUDM/IndexCache/blob/main/assets/figure profiling.png At 200K context, the indexer consumes 81% of prefill time. We measured pairwise top-k index overlap across all 47 DSA layers and found that adjacent layers share 70–100% of their selected tokens : /THUDM/IndexCache/blob/main/assets/figure results.png Cross-layer top-k overlap heatmap. Most indexer computations are redundant. IndexCache partitions layers into Full F layers that retain their indexer and Shared S layers that reuse the nearest F layer's cached indices: We propose two complementary approaches: | Approach | Description | Requires Training? | |---|---|---| Training-free | Greedy search selects which indexers to remove based on LM loss on a calibration set | ✗ | Training-aware | Multi-layer distillation trains each retained indexer to serve all layers it covers | ✓ | Both retain only 1/4 of indexers with negligible quality degradation. | Baseline | IndexCache 1/4 | Speedup | | |---|---|---|---| Prefill 200K | 19.5s | 10.7s | 1.82× | Decode 200K | 58 tok/s | 86 tok/s | 1.48× | 9 benchmarks virtually unchanged ✅ ~1.2× E2E speedup with negligible degradation across 10 benchmarks long-context + reasoning . git clone https://github.com/sgl-project/sglang.git cd sglang git checkout b638b25b This patch is built and tested against commit . It may apply cleanly to newer versions, but if you encounter conflicts, use this specific commit. b638b25b For vLLM: git clone https://github.com/vllm-project/vllm.git cd vllm git checkout 4508532fb git apply /path/to/indexcache.patch /path/to/indexcache vllm.patch for vllm patch Configure via --json-model-override-args for SGLang or --hf-overrides for vLLM. Two options take SGLang for example : Every N-th layer keeps its indexer: python -m sglang.launch server \ --model-path zai-org/GLM-5-FP8 \ --json-model-override-args '{"index topk freq": 2}' \ ... your other args tp, dp, etc. index topk freq=2 → every 2th layer is Full, rest are Shared 50% indexers removed . Specify per-layer F/S assignment: python -m sglang.launch server \ --model-path zai-org/GLM-5-FP8 \ --json-model-override-args '{"index topk pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}' \ ... your other args Each character maps to one DSA layer: F = Full runs indexer , S = Shared reuses cached indices . Default behavior:When neither parameter is set, all layers run their indexer — identical to standard DSA. | Parameter | Type | Default | Description | |---|---|---|---| index topk freq | int | 1 | Keep indexer every N layers. 1 = disabled, 4 = keep 1/4 | index topk pattern | string | null | Per-layer F/S pattern. Overrides index topk freq if set | Which to use? — Simple, good default. Best with training-aware models. index topk freq: 4 Custom pattern — Optimal for training-free deployment. The example above is the greedy-searched pattern for the GLM-5 model. | Model | Architecture | Supported | |---|---|---| | DeepSeek-V3.2 | DeepseekV32ForCausalLM | ✅ | | GLM-5 744B | GlmMoeDsaForCausalLM | ✅ | Any model using DSA indexer benefits from this patch. @article{bai2025indexcache, title={IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse}, author={Bai, Yushi and Dong, Qian and Jiang, Ting and Lv, Xin and Du, Zhengxiao and Zeng, Aohan and Tang, Jie and Li, Juanzi}, journal={arXiv preprint arXiv:2603.12201}, year={2025} } This patch is released under the Apache 2.0 License /THUDM/IndexCache/blob/main/LICENSE , consistent with SGLang.