# IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

> Source: <https://github.com/THUDM/IndexCache>
> Published: 2026-06-17 12:00:31+00:00

This repository provides a patch for [SGLang](https://github.com/sgl-project/sglang) and [vLLM](https://github.com/vllm-project/vllm) that enables **IndexCache** inference acceleration for models using DeepSeek Sparse Attention (DSA), including **DeepSeek-V3.2** and **GLM-5**.

TL;DR:IndexCache eliminates up to 75% of indexer computations in DSA through cross-layer index reuse — achieving up to1.82× prefill speedupand1.48× decode speedupwith negligible quality degradation. One`if/else`

branch, zero extra GPU memory.

In DSA, the *lightning indexer* selects the top-k most relevant tokens at each layer to make attention sparse. While cheap per-FLOP, it runs independently at **every** layer with O(L²) complexity. At long context lengths, it becomes the dominant bottleneck:

[
](/THUDM/IndexCache/blob/main/assets/figure_profiling.png)

*At 200K context, the indexer consumes 81% of prefill time.*

We measured pairwise top-k index overlap across all 47 DSA layers and found that **adjacent layers share 70–100% of their selected tokens**:

[
](/THUDM/IndexCache/blob/main/assets/figure_results.png)

*Cross-layer top-k overlap heatmap. Most indexer computations are redundant.*

IndexCache partitions layers into **Full** (F) layers that retain their indexer and **Shared** (S) layers that reuse the nearest F layer's cached indices:

We propose two complementary approaches:

| Approach | Description | Requires Training? |
|---|---|---|
Training-free |
Greedy search selects which indexers to remove based on LM loss on a calibration set | ✗ |
Training-aware |
Multi-layer distillation trains each retained indexer to serve all layers it covers | ✓ |

Both retain only **1/4 of indexers** with negligible quality degradation.

| Baseline | IndexCache (1/4) | Speedup | |
|---|---|---|---|
Prefill (200K) |
19.5s | 10.7s | 1.82× |
Decode (200K) |
58 tok/s | 86 tok/s | 1.48× |

9 benchmarks virtually unchanged ✅

**~1.2× E2E speedup** with negligible degradation across 10 benchmarks (long-context + reasoning).

```
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout b638b25b
```

This patch is built and tested against commit

[. It may apply cleanly to newer versions, but if you encounter conflicts, use this specific commit.]`b638b25b`

For vLLM:

```
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 4508532fb
git apply /path/to/indexcache.patch # /path/to/indexcache_vllm.patch for vllm patch
```

Configure via `--json-model-override-args`

for SGLang or `--hf-overrides`

for vLLM. Two options (take SGLang for example):

Every N-th layer keeps its indexer:

```
python -m sglang.launch_server \
    --model-path zai-org/GLM-5-FP8 \
    --json-model-override-args '{"index_topk_freq": 2}' \
    ...  # your other args (tp, dp, etc.)
```

`index_topk_freq=2`

→ every 2th layer is Full, rest are Shared (50% indexers removed).

Specify per-layer F/S assignment:

```
python -m sglang.launch_server \
    --model-path zai-org/GLM-5-FP8 \
    --json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}' \
    ...  # your other args
```

Each character maps to one DSA layer: `F`

= Full (runs indexer), `S`

= Shared (reuses cached indices).

Default behavior:When neither parameter is set, all layers run their indexer — identical to standard DSA.

| Parameter | Type | Default | Description |
|---|---|---|---|
`index_topk_freq` |
int | `1` |
Keep indexer every N layers. `1` = disabled, `4` = keep 1/4 |
`index_topk_pattern` |
string | `null` |
Per-layer F/S pattern. Overrides `index_topk_freq` if set |

**Which to use?**

— Simple, good default. Best with training-aware models.`index_topk_freq: 4`

**Custom pattern**— Optimal for training-free deployment. The example above is the greedy-searched pattern for the GLM-5 model.

| Model | Architecture | Supported |
|---|---|---|
| DeepSeek-V3.2 | `DeepseekV32ForCausalLM` |
✅ |
| GLM-5 (744B) | `GlmMoeDsaForCausalLM` |
✅ |

Any model using DSA indexer benefits from this patch.

```
@article{bai2025indexcache,
  title={IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse},
  author={Bai, Yushi and Dong, Qian and Jiang, Ting and Lv, Xin and Du, Zhengxiao and Zeng, Aohan and Tang, Jie and Li, Juanzi},
  journal={arXiv preprint arXiv:2603.12201},
  year={2025}
}
```

This patch is released under the [Apache 2.0 License](/THUDM/IndexCache/blob/main/LICENSE), consistent with SGLang.
