{"slug": "indexcache-accelerating-sparse-attention-via-cross-layer-index-reuse", "title": "IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse", "summary": "Researchers released IndexCache, a patch for SGLang and vLLM that accelerates sparse attention in DeepSeek-V3.2 and GLM-5 models by reusing index computations across layers, achieving up to 1.82× prefill and 1.48× decode speedups with negligible quality loss.", "body_md": "This repository provides a patch for [SGLang](https://github.com/sgl-project/sglang) and [vLLM](https://github.com/vllm-project/vllm) that enables **IndexCache** inference acceleration for models using DeepSeek Sparse Attention (DSA), including **DeepSeek-V3.2** and **GLM-5**.\n\nTL;DR:IndexCache eliminates up to 75% of indexer computations in DSA through cross-layer index reuse — achieving up to1.82× prefill speedupand1.48× decode speedupwith negligible quality degradation. One`if/else`\n\nbranch, zero extra GPU memory.\n\nIn DSA, the *lightning indexer* selects the top-k most relevant tokens at each layer to make attention sparse. While cheap per-FLOP, it runs independently at **every** layer with O(L²) complexity. At long context lengths, it becomes the dominant bottleneck:\n\n[\n](/THUDM/IndexCache/blob/main/assets/figure_profiling.png)\n\n*At 200K context, the indexer consumes 81% of prefill time.*\n\nWe measured pairwise top-k index overlap across all 47 DSA layers and found that **adjacent layers share 70–100% of their selected tokens**:\n\n[\n](/THUDM/IndexCache/blob/main/assets/figure_results.png)\n\n*Cross-layer top-k overlap heatmap. Most indexer computations are redundant.*\n\nIndexCache partitions layers into **Full** (F) layers that retain their indexer and **Shared** (S) layers that reuse the nearest F layer's cached indices:\n\nWe propose two complementary approaches:\n\n| Approach | Description | Requires Training? |\n|---|---|---|\nTraining-free |\nGreedy search selects which indexers to remove based on LM loss on a calibration set | ✗ |\nTraining-aware |\nMulti-layer distillation trains each retained indexer to serve all layers it covers | ✓ |\n\nBoth retain only **1/4 of indexers** with negligible quality degradation.\n\n| Baseline | IndexCache (1/4) | Speedup | |\n|---|---|---|---|\nPrefill (200K) |\n19.5s | 10.7s | 1.82× |\nDecode (200K) |\n58 tok/s | 86 tok/s | 1.48× |\n\n9 benchmarks virtually unchanged ✅\n\n**~1.2× E2E speedup** with negligible degradation across 10 benchmarks (long-context + reasoning).\n\n```\ngit clone https://github.com/sgl-project/sglang.git\ncd sglang\ngit checkout b638b25b\n```\n\nThis patch is built and tested against commit\n\n[. It may apply cleanly to newer versions, but if you encounter conflicts, use this specific commit.]`b638b25b`\n\nFor vLLM:\n\n```\ngit clone https://github.com/vllm-project/vllm.git\ncd vllm\ngit checkout 4508532fb\ngit apply /path/to/indexcache.patch # /path/to/indexcache_vllm.patch for vllm patch\n```\n\nConfigure via `--json-model-override-args`\n\nfor SGLang or `--hf-overrides`\n\nfor vLLM. Two options (take SGLang for example):\n\nEvery N-th layer keeps its indexer:\n\n```\npython -m sglang.launch_server \\\n    --model-path zai-org/GLM-5-FP8 \\\n    --json-model-override-args '{\"index_topk_freq\": 2}' \\\n    ...  # your other args (tp, dp, etc.)\n```\n\n`index_topk_freq=2`\n\n→ every 2th layer is Full, rest are Shared (50% indexers removed).\n\nSpecify per-layer F/S assignment:\n\n```\npython -m sglang.launch_server \\\n    --model-path zai-org/GLM-5-FP8 \\\n    --json-model-override-args '{\"index_topk_pattern\": \"FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS\"}' \\\n    ...  # your other args\n```\n\nEach character maps to one DSA layer: `F`\n\n= Full (runs indexer), `S`\n\n= Shared (reuses cached indices).\n\nDefault behavior:When neither parameter is set, all layers run their indexer — identical to standard DSA.\n\n| Parameter | Type | Default | Description |\n|---|---|---|---|\n`index_topk_freq` |\nint | `1` |\nKeep indexer every N layers. `1` = disabled, `4` = keep 1/4 |\n`index_topk_pattern` |\nstring | `null` |\nPer-layer F/S pattern. Overrides `index_topk_freq` if set |\n\n**Which to use?**\n\n— Simple, good default. Best with training-aware models.`index_topk_freq: 4`\n\n**Custom pattern**— Optimal for training-free deployment. The example above is the greedy-searched pattern for the GLM-5 model.\n\n| Model | Architecture | Supported |\n|---|---|---|\n| DeepSeek-V3.2 | `DeepseekV32ForCausalLM` |\n✅ |\n| GLM-5 (744B) | `GlmMoeDsaForCausalLM` |\n✅ |\n\nAny model using DSA indexer benefits from this patch.\n\n```\n@article{bai2025indexcache,\n  title={IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse},\n  author={Bai, Yushi and Dong, Qian and Jiang, Ting and Lv, Xin and Du, Zhengxiao and Zeng, Aohan and Tang, Jie and Li, Juanzi},\n  journal={arXiv preprint arXiv:2603.12201},\n  year={2025}\n}\n```\n\nThis patch is released under the [Apache 2.0 License](/THUDM/IndexCache/blob/main/LICENSE), consistent with SGLang.", "url": "https://wpnews.pro/news/indexcache-accelerating-sparse-attention-via-cross-layer-index-reuse", "canonical_source": "https://github.com/THUDM/IndexCache", "published_at": "2026-06-17 12:00:31+00:00", "updated_at": "2026-06-17 12:23:08.299806+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-research", "machine-learning", "generative-ai"], "entities": ["SGLang", "vLLM", "DeepSeek-V3.2", "GLM-5", "DeepSeek", "THUDM"], "alternates": {"html": "https://wpnews.pro/news/indexcache-accelerating-sparse-attention-via-cross-layer-index-reuse", "markdown": "https://wpnews.pro/news/indexcache-accelerating-sparse-attention-via-cross-layer-index-reuse.md", "text": "https://wpnews.pro/news/indexcache-accelerating-sparse-attention-via-cross-layer-index-reuse.txt", "jsonld": "https://wpnews.pro/news/indexcache-accelerating-sparse-attention-via-cross-layer-index-reuse.jsonld"}}