{"slug": "gml5-indexcache", "title": "GML5 IndexCache", "summary": "Researchers from Tsinghua University and Z.ai have proposed IndexCache, a method to reduce the computational cost of DeepSeek Sparse Attention (DSA) in GLM-5.2. IndexCache exploits the observation that adjacent layers in DSA share 70–100% of their top-k token selections, allowing most layers to reuse cached indices instead of recomputing them. This eliminates the O(NL²) cost of running the indexer at every layer, adding zero extra GPU memory over standard DSA.", "body_md": "*Notes from my notebook on GLM-5.2 / DeepSeek Sparse Attention (DSA), reconstructed from the IndexCache paper (Bai, Dong et al., Tsinghua + Z.ai, 2026) — the mechanism behind GLM-5.2's \"IndexShare.\"*\n\nDSA's whole pitch is: don't do full O(L²) attention, instead let a cheap **lightning indexer** look at all preceding tokens and pick the top-k (k=2048) that actually matter, then do real attention only on those. That drops core attention from O(L²) → O(Lk).\n\nGreat — except I missed this the first time I read DSA: **the indexer itself is still O(L²)**. It has to score every preceding token against the query to decide who's in the top-k. So across N layers you've traded one O(L²) cost for N separate O(L²) costs — total O(NL²). At long context this *indexer* becomes the dominant cost, not the attention it was supposed to fix.\n\nAdding the indexer is \"DSA on steroids\" because it kills DSA's one real bottleneck (full attention) — but in doing so, it grows its own. The indexer is cheap per-FLOP (few heads, low-rank, FP8) but it still runs at every single layer.\n\nThe fix the paper proposes isn't a smarter indexer — it's **don't run it every layer at all.**\n\nIf you measure pairwise overlap between the top-k token sets selected by each layer's indexer, **adjacent layers share 70–100% of their picks.** The heatmap even shows block structure — clusters of layers (e.g. layers 3–5, 17–30, etc.) that all converge on roughly the same \"important\" tokens.\n\nSo most of the O(NL²) indexer cost is *redundant computation of the same answer.*\n\nThis motivates **IndexCache**: split the N layers into two roles —\n\nThe first layer is always F (has to seed the cache).\n\n**Standard DSA:**\n\n```\nfor l = 1 to N:\n    I⁽ˡ⁾ ← Indexer_l(X)\n    T⁽ˡ⁾ ← top-k(I⁽ˡ⁾)\n    X ← SparseAttn_l(X, T⁽ˡ⁾)\n    X ← FFN_l(X)  # + norm, residual\n```\n\n**IndexCache:**\n\n```\nfor l = 1 to N:\n    if c_l == F:\n        I⁽ˡ⁾ ← Indexer_l(X)\n        T⁽ˡ⁾ ← top-k(I⁽ˡ⁾)\n        T_cache ← T⁽ˡ⁾\n    else:  # c_l == S\n        T⁽ˡ⁾ ← T_cache         # reuse\n    X ← SparseAttn_l(X, T⁽ˡ⁾)\n    X ← FFN_l(X)\n```\n\n`T_cache`\n\nis just a temp buffer holding the *current* index tensor — it gets overwritten at every F layer, so it adds **zero extra GPU memory** over standard DSA. The only real change to the loop is one if/else branch. That's the whole elegance of this method — no architecture surgery, just a routing decision.\n\nThis part is just DSA's own lightning indexer, for reference since it's what gets shared:\n\nCompatibility between query `q`\n\nand each candidate position `i`\n\n, per block/head:\n\n`s_i = q · W_i + b_i`\n\n— raw score for position i`g_i = max(0, s_i)`\n\n— ReLU gate (this is the \"lightning\" part: cheap, no softmax needed before selection)`Top-k = argmax_i(g_i)`\n\nover all i — pick the k highest-scoring positionsThis sits *underneath* MLA (Multi-head Latent Attention). The reason MLA matters here: instead of every head keeping its own full KV, MLA squeezes all heads' KV into one shared **low-rank latent vector** — `latent = x·W^D`\n\n(down-projection). The indexer scores against this compressed representation, which is part of why it's so much cheaper per-FLOP than the main attention.\n\nThe question is: which layers do you keep as F? Two answers, training-free and training-aware — and notably, the \"obvious\" third answer (similarity-based) **fails**. Order of discovery matters here, so I'm keeping it in the order the paper actually tried things.\n\nThe dumbest idea: just alternate uniformly, e.g. `F S S S F S S S ...`\n\n(1 F every 4 layers). This **doesn't work well**. Why: indexer \"importance\" is *not* uniform across depth. Some layers — especially early/transitional ones — are way more sensitive to losing their own indexer than others. A fixed period can easily land an F on a redundant layer and an S on a critical one. You need the model (or data) to tell you which layers are safe to share.\n\nNo weight updates at all. Just:\n\n`{2, 3, ..., N}`\n\n(layer 1 is always F — has to seed the cache).This is literally a greedy \"convert layers one-by-one, always pick the one with minimum loss increase\" search — full search is O(N²) forward passes, but if you've got pipeline-parallel stages (P of them), you can split layers into P blocks and search them in parallel, cutting total passes by roughly P×.\n\n**What you get out of this (empirically, from the paper's 30B DSA model + GLM-5):**\n\nIf you're willing to retrain (continued pretraining, not from scratch), you can go further: **force the indexer to actually learn to serve multiple layers**, instead of hoping a pattern search finds layers that happen to tolerate sharing.\n\nStandard DSA already trains each layer's indexer via KL-divergence distillation against that *same* layer's aggregated attention distribution `p_t⁽ˡ⁾`\n\n. The extension here: if layer `ℓ`\n\nis F and serves S layers `ℓ+1, ..., ℓ+m`\n\n, train its indexer against **all of them jointly**:\n\n```\nL_multi = Σ_{j=0}^{m} [ 1/(m+1) · Σ_t D_KL( p_t^(ℓ+j) || q_t^(ℓ) ) ]\n```\n\nwhere:\n\n`q_t⁽ˡ⁾`\n\n= indexer's own output distribution (softmax of its scores) at layer ℓ`p_t⁽ˡ⁾`\n\n= the real aggregated attention distribution at layer ℓ (averaged across heads)`1/(m+1)`\n\n= just averaging over however many layers reuse this same index**Important note (training detail I almost missed):** you don't do this from random init. A randomly initialized model's attention distribution has no real structure yet — forcing the indexer to chase an undefined target just injects noise. So this is always done as continued pretraining / fine-tuning on top of an already-trained DSA model, in two stages: a frozen \"dense warm-up\" that trains only the indexer, then a \"sparse training\" phase that activates top-k and trains everything jointly.\n\nThis is the part of my notes that was the messiest, so here's the clean derivation.\n\nDefine the **averaged target distribution** across the m+1 served layers:\n\n```\np̄_t = Σ_{j=0}^{m} [ 1/(m+1) · p_t^(ℓ+j) ]\n```\n\nand the single-target loss using that averaged target:\n\n```\nL_avg = Σ_t D_KL( p̄_t || q_t^(ℓ) )\n```\n\n**Claim:** `∇_θ L_multi = ∇_θ L_avg`\n\n.\n\n**Proof.** The key trick: in `D_KL(p || q)`\n\n, only `q`\n\ndepends on the trainable parameters θ (p is just data — the real attention distribution, treated as a fixed target with `stop-gradient`\n\n). So when you differentiate KL divergence w.r.t. θ, the entropy term of `p`\n\n(which doesn't depend on θ) vanishes entirely. What's left is just the cross-entropy term:\n\n```\n∇_θ D_KL(p || q_t^(ℓ)) = -∇_θ Σ_s p(s) · log q_t^(ℓ)(s)\n```\n\nThis is the step I got stuck on in my notebook — I wasn't sure *why* only the `log q`\n\nterm survives. The answer is straightforward once you write KL out fully:\n\n```\nD_KL(p || q) = Σ_s p(s) log p(s)  −  Σ_s p(s) log q(s)\n                └──────┬──────┘     └───────┬───────┘\n              entropy term of p      cross-entropy term\n              (no θ dependence,        (only term with θ,\n               gradient = 0)             via q = softmax(indexer))\n```\n\nNow apply this to `L_multi`\n\n:\n\n```\n∇_θ L_multi = - Σ_{j=0}^{m} [1/(m+1)] Σ_t ∇_θ Σ_s p_t^(ℓ+j)(s) log q_t^(ℓ)(s)\n```\n\nSince the sum over j and the sum over s are both linear, swap their order and pull the constant log term out:\n\n```\n            = - Σ_t ∇_θ Σ_s [ Σ_{j=0}^{m} (1/(m+1)) p_t^(ℓ+j)(s) ] · log q_t^(ℓ)(s)\n                                  └──────────────────┬──────────────────┘\n                                                    = p̄_t(s)\n\n            = - Σ_t ∇_θ Σ_s p̄_t(s) log q_t^(ℓ)(s)\n            = ∇_θ L_avg.   ∎\n```\n\nSo averaging *before* taking KL and summing the KL terms *after* are mathematically identical at the gradient level — the indexer ends up being pulled toward the **centroid** of all the attention distributions it serves, not toward any one layer.\n\n**Then why use L_multi in practice if they're equivalent?** Pure memory/engineering reason: with `L_multi`\n\n, each S layer only needs to send its own predicted `q`\n\nvalue backward. With `L_avg`\n\n, you'd need to pass both `p`\n\nand `q`\n\nfor *every* served layer to compute the average first — which means extra memory overhead and extra runtime cost for no actual gain, since the gradient comes out identical either way.\n\nMy takeaway after sitting with this for a while: a lot of \"novel\" architecture papers ultimately reduce to \"design the right loss function for what you want, and let the network figure out the rest.\" This derivation is a good concrete example — the multi-layer trick isn't a new optimization method, it's just an equivalent (and cheaper) way to write the same gradient.\n\n| Metric | Standard DSA | + IndexCache (1/4 retained) |\n|---|---|---|\n| Prefill latency | 19.5 s | 10.7 s (1.82× speedup) |\n| Decode throughput (per request) | 58 tok/s | 86 tok/s (1.48× speedup) |\n\n**Why the training-aware version works where uniform static doesn't:** the greedy search has to *avoid* sensitive layers because the model was never trained to tolerate sharing — without retraining, certain layers are tightly coupled to their own indexer's exact top-k, and feeding them someone else's indices causes a distribution shift that breaks things. Once you train with the multi-layer distillation loss, the S layers themselves *learn to adapt* to inherited indices, and the F layer's indexer learns to produce a selection that generalizes across all the layers it serves. That joint adaptation is what makes even a dumb uniform pattern work fine after training — the layer-specific sensitivity just disappears.\n\n**Extra structural note from the overlap heatmap:** the first layer is *always* kept as a full F layer (it has to seed the index cache, and early layers attend to a fundamentally different token subset than later ones — overlap with deep layers is ≤0.4). The strongest, most similar index regions cluster near the diagonal — i.e., a layer's indexer output looks most like its *immediate* neighbors, decaying as you move further away.\n\nBefore landing on the greedy LM-loss search, the natural-seeming alternative was tried: **pick the sharing pattern by directly maximizing cosine similarity** between attention outputs, since that's cheaper to compute than running full LM-loss evaluations.\n\nBuild an N×N similarity matrix `S[i][j]`\n\n= cosine similarity between layer i's attention output using its own indexer vs. using layer j's indexer instead. Then solve for the best F/S assignment with dynamic programming:\n\n```\ndp[i][k] = max over j<i, c_j=F of:\n              dp[j][k-1] + Σ_{m=j+1}^{i-1} S[m][j]\n```\n\n— i.e., find the best previous F layer to \"branch\" from, accumulating similarity scores for every S layer that would reuse it. Solvable exactly by backtracking through the DP table.\n\n**This failed.** The similarity-optimal pattern performed about the same as plain uniform interleaving — both clearly worse than the greedy LM-loss search. The reason is the core insight of the whole negative result:\n\nCosine similarity is a\n\nlocalmetric — it only tells you how well-preserved a single layer's output isin isolation. It can't see how small token-selection mismatchespropagate and compoundthrough all the downstream layers. Two layers can have near-identical attention outputs (similarity ≈ 1) yet differ in exactly the handful of tokens that turn out to matter several layers later. Those subtle errors accumulate — and a layer-local similarity score has no way to predict that.\n\nThe LM-loss-based greedy search avoids this because it's a **global, end-to-end** signal — it measures the actual downstream effect of a sharing decision on the whole model's output, not just on one layer's local activation. This is the real lesson: local geometric similarity is a tempting cheap proxy, but for anything where errors compound across depth, you need an end-to-end metric.\n\nDSA's indexer recomputes \"who matters\" from scratch at every layer even though the answer barely changes between adjacent layers — IndexCache just caches that answer and reuses it, and the only real engineering question is *which* layers are allowed to skip recomputation, which can be found either by greedy search (no training) or learned directly via a provably-equivalent averaged-KL loss (with training).\n\nif you found any mismatched detail in this post or want to contribute in paper or working code for indexcache please open issue on\n\n[github.link](https://github.com/Mahendra1706/DecodeAI)", "url": "https://wpnews.pro/news/gml5-indexcache", "canonical_source": "https://dev.to/mahendra4/gml5-indexcache-10nf", "published_at": "2026-06-30 03:42:43+00:00", "updated_at": "2026-06-30 03:48:57.020694+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "ai-research", "neural-networks", "developer-tools"], "entities": ["Tsinghua University", "Z.ai", "GLM-5.2", "DeepSeek Sparse Attention", "IndexCache", "Bai", "Dong"], "alternates": {"html": "https://wpnews.pro/news/gml5-indexcache", "markdown": "https://wpnews.pro/news/gml5-indexcache.md", "text": "https://wpnews.pro/news/gml5-indexcache.txt", "jsonld": "https://wpnews.pro/news/gml5-indexcache.jsonld"}}