IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

wpnews.pro

cd /news/large-language-models/indexcache-accelerating-sparse-atten… · home › topics › large-language-models › article

[ARTICLE · art-30988] src=github.com ↗ pub=2026-06-17T12:00Z topic=large-language-models verified=true sentiment=↑ positive

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Researchers released IndexCache, a patch for SGLang and vLLM that accelerates sparse attention in DeepSeek-V3.2 and GLM-5 models by reusing index computations across layers, achieving up to 1.82× prefill and 1.48× decode speedups with negligible quality loss.

read3 min views31 publishedJun 17, 2026

This repository provides a patch for SGLang and vLLM that enables IndexCache inference acceleration for models using DeepSeek Sparse Attention (DSA), including DeepSeek-V3.2 and GLM-5.

TL;DR:IndexCache eliminates up to 75% of indexer computations in DSA through cross-layer index reuse — achieving up to1.82× prefill speedupand1.48× decode speedupwith negligible quality degradation. Oneif/else

branch, zero extra GPU memory.

In DSA, the lightning indexer selects the top-k most relevant tokens at each layer to make attention sparse. While cheap per-FLOP, it runs independently at every layer with O(L²) complexity. At long context lengths, it becomes the dominant bottleneck:

At 200K context, the indexer consumes 81% of prefill time.

We measured pairwise top-k index overlap across all 47 DSA layers and found that adjacent layers share 70–100% of their selected tokens:

Cross-layer top-k overlap heatmap. Most indexer computations are redundant.

IndexCache partitions layers into Full (F) layers that retain their indexer and Shared (S) layers that reuse the nearest F layer's cached indices:

We propose two complementary approaches:

Approach	Description	Requires Training?
Training-free
Greedy search selects which indexers to remove based on LM loss on a calibration set	✗
Training-aware
Multi-layer distillation trains each retained indexer to serve all layers it covers	✓

Both retain only 1/4 of indexers with negligible quality degradation.

Baseline	IndexCache (1/4)	Speedup
Prefill (200K)
19.5s	10.7s	1.82×
Decode (200K)
58 tok/s	86 tok/s	1.48×

9 benchmarks virtually unchanged ✅

~1.2× E2E speedup with negligible degradation across 10 benchmarks (long-context + reasoning).

git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout b638b25b

This patch is built and tested against commit

[. It may apply cleanly to newer versions, but if you encounter conflicts, use this specific commit.]b638b25b

For vLLM:

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 4508532fb
git apply /path/to/indexcache.patch # /path/to/indexcache_vllm.patch for vllm patch

Configure via --json-model-override-args

for SGLang or --hf-overrides

for vLLM. Two options (take SGLang for example):

Every N-th layer keeps its indexer:

python -m sglang.launch_server \
    --model-path zai-org/GLM-5-FP8 \
    --json-model-override-args '{"index_topk_freq": 2}' \
    ...  # your other args (tp, dp, etc.)

index_topk_freq=2

→ every 2th layer is Full, rest are Shared (50% indexers removed).

Specify per-layer F/S assignment:

python -m sglang.launch_server \
    --model-path zai-org/GLM-5-FP8 \
    --json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}' \
    ...  # your other args

Each character maps to one DSA layer: F

= Full (runs indexer), S

= Shared (reuses cached indices).

Default behavior:When neither parameter is set, all layers run their indexer — identical to standard DSA.

Parameter	Type	Default	Description
`index_topk_freq`
int	`1`
Keep indexer every N layers. `1` = disabled, `4` = keep 1/4
`index_topk_pattern`
string	`null`
Per-layer F/S pattern. Overrides `index_topk_freq` if set

Which to use?

— Simple, good default. Best with training-aware models.index_topk_freq: 4

Custom pattern— Optimal for training-free deployment. The example above is the greedy-searched pattern for the GLM-5 model.

Model	Architecture	Supported
DeepSeek-V3.2	`DeepseekV32ForCausalLM`
✅
GLM-5 (744B)	`GlmMoeDsaForCausalLM`
✅

Any model using DSA indexer benefits from this patch.

@article{bai2025indexcache,
  title={IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse},
  author={Bai, Yushi and Dong, Qian and Jiang, Ting and Lv, Xin and Du, Zhengxiao and Zeng, Aohan and Tang, Jie and Li, Juanzi},
  journal={arXiv preprint arXiv:2603.12201},
  year={2025}
}

This patch is released under the Apache 2.0 License, consistent with SGLang.

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/indexcache-accelerating-…

Read original on github.com → github.com/THUDM/IndexCache

mentioned entities

SGLang

vLLM

DeepSeek-V3.2

GLM-5

DeepSeek

THUDM

metadata

slugindexcache-accelerating-sparse-attention-via-cross-layer-index-reuse

topic#large-language-models

secondary4 topics

sentimentpositive

canonicalgithub.com

navigation

← prev5 Fun Projects Using OpenAI Code…

next →Why Agentic AI Is Just The ‘A’ W…

── more in #large-language-models 4 stories · sorted by recency

startupfortune.com · 2 Aug · #large-language-models

AMD's MI355X Undercuts Nvidia's B300 on Cost to Run China's Kimi K3

promptcube3.com · 2 Aug · #large-language-models

DoorDash + Chinese AI: Why the House Probe Misses the Point

vettedconsumer.com · 1 Aug · #large-language-models

DeepSeek V4 Flash Tested: Frontier-Class Coding for 79 Cents a Day, and It Runs on a 128GB Box

marktechpost.com · 31 Jul · #large-language-models

DeepSeek Upgrades DeepSeek-V4-Flash-0731 with Major Agentic and Coding Gains

── more on @sglang 3 stories trending now

wpnews · 1 Aug · #ai-products

OpenAI Atlas Shuts Down August 9: Migration Guide

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

wpnews · 1 Aug · #developer-tools

I Built a Portable AI Skill That Safely Upgrades .NET Applications

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required