GML5 IndexCache Researchers from Tsinghua University and Z.ai have proposed IndexCache, a method to reduce the computational cost of DeepSeek Sparse Attention (DSA) in GLM-5.2. IndexCache exploits the observation that adjacent layers in DSA share 70–100% of their top-k token selections, allowing most layers to reuse cached indices instead of recomputing them. This eliminates the O(NL²) cost of running the indexer at every layer, adding zero extra GPU memory over standard DSA. Notes from my notebook on GLM-5.2 / DeepSeek Sparse Attention DSA , reconstructed from the IndexCache paper Bai, Dong et al., Tsinghua + Z.ai, 2026 — the mechanism behind GLM-5.2's "IndexShare." DSA's whole pitch is: don't do full O L² attention, instead let a cheap lightning indexer look at all preceding tokens and pick the top-k k=2048 that actually matter, then do real attention only on those. That drops core attention from O L² → O Lk . Great — except I missed this the first time I read DSA: the indexer itself is still O L² . It has to score every preceding token against the query to decide who's in the top-k. So across N layers you've traded one O L² cost for N separate O L² costs — total O NL² . At long context this indexer becomes the dominant cost, not the attention it was supposed to fix. Adding the indexer is "DSA on steroids" because it kills DSA's one real bottleneck full attention — but in doing so, it grows its own. The indexer is cheap per-FLOP few heads, low-rank, FP8 but it still runs at every single layer. The fix the paper proposes isn't a smarter indexer — it's don't run it every layer at all. If you measure pairwise overlap between the top-k token sets selected by each layer's indexer, adjacent layers share 70–100% of their picks. The heatmap even shows block structure — clusters of layers e.g. layers 3–5, 17–30, etc. that all converge on roughly the same "important" tokens. So most of the O NL² indexer cost is redundant computation of the same answer. This motivates IndexCache : split the N layers into two roles — The first layer is always F has to seed the cache . Standard DSA: for l = 1 to N: I⁽ˡ⁾ ← Indexer l X T⁽ˡ⁾ ← top-k I⁽ˡ⁾ X ← SparseAttn l X, T⁽ˡ⁾ X ← FFN l X + norm, residual IndexCache: for l = 1 to N: if c l == F: I⁽ˡ⁾ ← Indexer l X T⁽ˡ⁾ ← top-k I⁽ˡ⁾ T cache ← T⁽ˡ⁾ else: c l == S T⁽ˡ⁾ ← T cache reuse X ← SparseAttn l X, T⁽ˡ⁾ X ← FFN l X T cache is just a temp buffer holding the current index tensor — it gets overwritten at every F layer, so it adds zero extra GPU memory over standard DSA. The only real change to the loop is one if/else branch. That's the whole elegance of this method — no architecture surgery, just a routing decision. This part is just DSA's own lightning indexer, for reference since it's what gets shared: Compatibility between query q and each candidate position i , per block/head: s i = q · W i + b i — raw score for position i g i = max 0, s i — ReLU gate this is the "lightning" part: cheap, no softmax needed before selection Top-k = argmax i g i over all i — pick the k highest-scoring positionsThis sits underneath MLA Multi-head Latent Attention . The reason MLA matters here: instead of every head keeping its own full KV, MLA squeezes all heads' KV into one shared low-rank latent vector — latent = x·W^D down-projection . The indexer scores against this compressed representation, which is part of why it's so much cheaper per-FLOP than the main attention. The question is: which layers do you keep as F? Two answers, training-free and training-aware — and notably, the "obvious" third answer similarity-based fails . Order of discovery matters here, so I'm keeping it in the order the paper actually tried things. The dumbest idea: just alternate uniformly, e.g. F S S S F S S S ... 1 F every 4 layers . This doesn't work well . Why: indexer "importance" is not uniform across depth. Some layers — especially early/transitional ones — are way more sensitive to losing their own indexer than others. A fixed period can easily land an F on a redundant layer and an S on a critical one. You need the model or data to tell you which layers are safe to share. No weight updates at all. Just: {2, 3, ..., N} layer 1 is always F — has to seed the cache .This is literally a greedy "convert layers one-by-one, always pick the one with minimum loss increase" search — full search is O N² forward passes, but if you've got pipeline-parallel stages P of them , you can split layers into P blocks and search them in parallel, cutting total passes by roughly P×. What you get out of this empirically, from the paper's 30B DSA model + GLM-5 : If you're willing to retrain continued pretraining, not from scratch , you can go further: force the indexer to actually learn to serve multiple layers , instead of hoping a pattern search finds layers that happen to tolerate sharing. Standard DSA already trains each layer's indexer via KL-divergence distillation against that same layer's aggregated attention distribution p t⁽ˡ⁾ . The extension here: if layer ℓ is F and serves S layers ℓ+1, ..., ℓ+m , train its indexer against all of them jointly : L multi = Σ {j=0}^{m} 1/ m+1 · Σ t D KL p t^ ℓ+j || q t^ ℓ where: q t⁽ˡ⁾ = indexer's own output distribution softmax of its scores at layer ℓ p t⁽ˡ⁾ = the real aggregated attention distribution at layer ℓ averaged across heads 1/ m+1 = just averaging over however many layers reuse this same index Important note training detail I almost missed : you don't do this from random init. A randomly initialized model's attention distribution has no real structure yet — forcing the indexer to chase an undefined target just injects noise. So this is always done as continued pretraining / fine-tuning on top of an already-trained DSA model, in two stages: a frozen "dense warm-up" that trains only the indexer, then a "sparse training" phase that activates top-k and trains everything jointly. This is the part of my notes that was the messiest, so here's the clean derivation. Define the averaged target distribution across the m+1 served layers: p̄ t = Σ {j=0}^{m} 1/ m+1 · p t^ ℓ+j and the single-target loss using that averaged target: L avg = Σ t D KL p̄ t || q t^ ℓ Claim: ∇ θ L multi = ∇ θ L avg . Proof. The key trick: in D KL p || q , only q depends on the trainable parameters θ p is just data — the real attention distribution, treated as a fixed target with stop-gradient . So when you differentiate KL divergence w.r.t. θ, the entropy term of p which doesn't depend on θ vanishes entirely. What's left is just the cross-entropy term: ∇ θ D KL p || q t^ ℓ = -∇ θ Σ s p s · log q t^ ℓ s This is the step I got stuck on in my notebook — I wasn't sure why only the log q term survives. The answer is straightforward once you write KL out fully: D KL p || q = Σ s p s log p s − Σ s p s log q s └──────┬──────┘ └───────┬───────┘ entropy term of p cross-entropy term no θ dependence, only term with θ, gradient = 0 via q = softmax indexer Now apply this to L multi : ∇ θ L multi = - Σ {j=0}^{m} 1/ m+1 Σ t ∇ θ Σ s p t^ ℓ+j s log q t^ ℓ s Since the sum over j and the sum over s are both linear, swap their order and pull the constant log term out: = - Σ t ∇ θ Σ s Σ {j=0}^{m} 1/ m+1 p t^ ℓ+j s · log q t^ ℓ s └──────────────────┬──────────────────┘ = p̄ t s = - Σ t ∇ θ Σ s p̄ t s log q t^ ℓ s = ∇ θ L avg. ∎ So averaging before taking KL and summing the KL terms after are mathematically identical at the gradient level — the indexer ends up being pulled toward the centroid of all the attention distributions it serves, not toward any one layer. Then why use L multi in practice if they're equivalent? Pure memory/engineering reason: with L multi , each S layer only needs to send its own predicted q value backward. With L avg , you'd need to pass both p and q for every served layer to compute the average first — which means extra memory overhead and extra runtime cost for no actual gain, since the gradient comes out identical either way. My takeaway after sitting with this for a while: a lot of "novel" architecture papers ultimately reduce to "design the right loss function for what you want, and let the network figure out the rest." This derivation is a good concrete example — the multi-layer trick isn't a new optimization method, it's just an equivalent and cheaper way to write the same gradient. | Metric | Standard DSA | + IndexCache 1/4 retained | |---|---|---| | Prefill latency | 19.5 s | 10.7 s 1.82× speedup | | Decode throughput per request | 58 tok/s | 86 tok/s 1.48× speedup | Why the training-aware version works where uniform static doesn't: the greedy search has to avoid sensitive layers because the model was never trained to tolerate sharing — without retraining, certain layers are tightly coupled to their own indexer's exact top-k, and feeding them someone else's indices causes a distribution shift that breaks things. Once you train with the multi-layer distillation loss, the S layers themselves learn to adapt to inherited indices, and the F layer's indexer learns to produce a selection that generalizes across all the layers it serves. That joint adaptation is what makes even a dumb uniform pattern work fine after training — the layer-specific sensitivity just disappears. Extra structural note from the overlap heatmap: the first layer is always kept as a full F layer it has to seed the index cache, and early layers attend to a fundamentally different token subset than later ones — overlap with deep layers is ≤0.4 . The strongest, most similar index regions cluster near the diagonal — i.e., a layer's indexer output looks most like its immediate neighbors, decaying as you move further away. Before landing on the greedy LM-loss search, the natural-seeming alternative was tried: pick the sharing pattern by directly maximizing cosine similarity between attention outputs, since that's cheaper to compute than running full LM-loss evaluations. Build an N×N similarity matrix S i j = cosine similarity between layer i's attention output using its own indexer vs. using layer j's indexer instead. Then solve for the best F/S assignment with dynamic programming: dp i k = max over j