Notes from my notebook on GLM-5.2 / DeepSeek Sparse Attention (DSA), reconstructed from the IndexCache paper (Bai, Dong et al., Tsinghua + Z.ai, 2026) β the mechanism behind GLM-5.2's "IndexShare."
DSA's whole pitch is: don't do full O(LΒ²) attention, instead let a cheap lightning indexer look at all preceding tokens and pick the top-k (k=2048) that actually matter, then do real attention only on those. That drops core attention from O(LΒ²) β O(Lk).
Great β except I missed this the first time I read DSA: the indexer itself is still O(LΒ²). It has to score every preceding token against the query to decide who's in the top-k. So across N layers you've traded one O(LΒ²) cost for N separate O(LΒ²) costs β total O(NLΒ²). At long context this indexer becomes the dominant cost, not the attention it was supposed to fix.
Adding the indexer is "DSA on steroids" because it kills DSA's one real bottleneck (full attention) β but in doing so, it grows its own. The indexer is cheap per-FLOP (few heads, low-rank, FP8) but it still runs at every single layer.
The fix the paper proposes isn't a smarter indexer β it's don't run it every layer at all.
If you measure pairwise overlap between the top-k token sets selected by each layer's indexer, adjacent layers share 70β100% of their picks. The heatmap even shows block structure β clusters of layers (e.g. layers 3β5, 17β30, etc.) that all converge on roughly the same "important" tokens.
So most of the O(NLΒ²) indexer cost is redundant computation of the same answer.
This motivates IndexCache: split the N layers into two roles β
The first layer is always F (has to seed the cache).
Standard DSA:
for l = 1 to N:
Iβ½Λ‘βΎ β Indexer_l(X)
Tβ½Λ‘βΎ β top-k(Iβ½Λ‘βΎ)
X β SparseAttn_l(X, Tβ½Λ‘βΎ)
X β FFN_l(X) # + norm, residual
IndexCache:
for l = 1 to N:
if c_l == F:
Iβ½Λ‘βΎ β Indexer_l(X)
Tβ½Λ‘βΎ β top-k(Iβ½Λ‘βΎ)
T_cache β Tβ½Λ‘βΎ
else: # c_l == S
Tβ½Λ‘βΎ β T_cache # reuse
X β SparseAttn_l(X, Tβ½Λ‘βΎ)
X β FFN_l(X)
T_cache
is just a temp buffer holding the current index tensor β it gets overwritten at every F layer, so it adds zero extra GPU memory over standard DSA. The only real change to the loop is one if/else branch. That's the whole elegance of this method β no architecture surgery, just a routing decision.
This part is just DSA's own lightning indexer, for reference since it's what gets shared:
Compatibility between query q
and each candidate position i
, per block/head:
s_i = q Β· W_i + b_i
β raw score for position ig_i = max(0, s_i)
β ReLU gate (this is the "lightning" part: cheap, no softmax needed before selection)Top-k = argmax_i(g_i)
over all i β pick the k highest-scoring positionsThis sits underneath MLA (Multi-head Latent Attention). The reason MLA matters here: instead of every head keeping its own full KV, MLA squeezes all heads' KV into one shared low-rank latent vector β latent = xΒ·W^D
(down-projection). The indexer scores against this compressed representation, which is part of why it's so much cheaper per-FLOP than the main attention.
The question is: which layers do you keep as F? Two answers, training-free and training-aware β and notably, the "obvious" third answer (similarity-based) fails. Order of discovery matters here, so I'm keeping it in the order the paper actually tried things.
The dumbest idea: just alternate uniformly, e.g. F S S S F S S S ...
(1 F every 4 layers). This doesn't work well. Why: indexer "importance" is not uniform across depth. Some layers β especially early/transitional ones β are way more sensitive to losing their own indexer than others. A fixed period can easily land an F on a redundant layer and an S on a critical one. You need the model (or data) to tell you which layers are safe to share.
No weight updates at all. Just:
{2, 3, ..., N}
(layer 1 is always F β has to seed the cache).This is literally a greedy "convert layers one-by-one, always pick the one with minimum loss increase" search β full search is O(NΒ²) forward passes, but if you've got pipeline-parallel stages (P of them), you can split layers into P blocks and search them in parallel, cutting total passes by roughly PΓ.
What you get out of this (empirically, from the paper's 30B DSA model + GLM-5):
If you're willing to retrain (continued pretraining, not from scratch), you can go further: force the indexer to actually learn to serve multiple layers, instead of hoping a pattern search finds layers that happen to tolerate sharing.
Standard DSA already trains each layer's indexer via KL-divergence distillation against that same layer's aggregated attention distribution p_tβ½Λ‘βΎ
. The extension here: if layer β
is F and serves S layers β+1, ..., β+m
, train its indexer against all of them jointly:
L_multi = Ξ£_{j=0}^{m} [ 1/(m+1) Β· Ξ£_t D_KL( p_t^(β+j) || q_t^(β) ) ]
where:
q_tβ½Λ‘βΎ
= indexer's own output distribution (softmax of its scores) at layer βp_tβ½Λ‘βΎ
= the real aggregated attention distribution at layer β (averaged across heads)1/(m+1)
= just averaging over however many layers reuse this same indexImportant note (training detail I almost missed): you don't do this from random init. A randomly initialized model's attention distribution has no real structure yet β forcing the indexer to chase an undefined target just injects noise. So this is always done as continued pretraining / fine-tuning on top of an already-trained DSA model, in two stages: a frozen "dense warm-up" that trains only the indexer, then a "sparse training" phase that activates top-k and trains everything jointly.
This is the part of my notes that was the messiest, so here's the clean derivation.
Define the averaged target distribution across the m+1 served layers:
pΜ_t = Ξ£_{j=0}^{m} [ 1/(m+1) Β· p_t^(β+j) ]
and the single-target loss using that averaged target:
L_avg = Ξ£_t D_KL( pΜ_t || q_t^(β) )
Claim: β_ΞΈ L_multi = β_ΞΈ L_avg
.
Proof. The key trick: in D_KL(p || q)
, only q
depends on the trainable parameters ΞΈ (p is just data β the real attention distribution, treated as a fixed target with stop-gradient
). So when you differentiate KL divergence w.r.t. ΞΈ, the entropy term of p
(which doesn't depend on ΞΈ) vanishes entirely. What's left is just the cross-entropy term:
β_ΞΈ D_KL(p || q_t^(β)) = -β_ΞΈ Ξ£_s p(s) Β· log q_t^(β)(s)
This is the step I got stuck on in my notebook β I wasn't sure why only the log q
term survives. The answer is straightforward once you write KL out fully:
D_KL(p || q) = Ξ£_s p(s) log p(s) β Ξ£_s p(s) log q(s)
ββββββββ¬βββββββ βββββββββ¬ββββββββ
entropy term of p cross-entropy term
(no ΞΈ dependence, (only term with ΞΈ,
gradient = 0) via q = softmax(indexer))
Now apply this to L_multi
:
β_ΞΈ L_multi = - Ξ£_{j=0}^{m} [1/(m+1)] Ξ£_t β_ΞΈ Ξ£_s p_t^(β+j)(s) log q_t^(β)(s)
Since the sum over j and the sum over s are both linear, swap their order and pull the constant log term out:
= - Ξ£_t β_ΞΈ Ξ£_s [ Ξ£_{j=0}^{m} (1/(m+1)) p_t^(β+j)(s) ] Β· log q_t^(β)(s)
ββββββββββββββββββββ¬βββββββββββββββββββ
= pΜ_t(s)
= - Ξ£_t β_ΞΈ Ξ£_s pΜ_t(s) log q_t^(β)(s)
= β_ΞΈ L_avg. β
So averaging before taking KL and summing the KL terms after are mathematically identical at the gradient level β the indexer ends up being pulled toward the centroid of all the attention distributions it serves, not toward any one layer.
Then why use L_multi in practice if they're equivalent? Pure memory/engineering reason: with L_multi
, each S layer only needs to send its own predicted q
value backward. With L_avg
, you'd need to pass both p
and q
for every served layer to compute the average first β which means extra memory overhead and extra runtime cost for no actual gain, since the gradient comes out identical either way.
My takeaway after sitting with this for a while: a lot of "novel" architecture papers ultimately reduce to "design the right loss function for what you want, and let the network figure out the rest." This derivation is a good concrete example β the multi-layer trick isn't a new optimization method, it's just an equivalent (and cheaper) way to write the same gradient.
| Metric | Standard DSA | + IndexCache (1/4 retained) |
|---|---|---|
| Prefill latency | 19.5 s | 10.7 s (1.82Γ speedup) |
| Decode throughput (per request) | 58 tok/s | 86 tok/s (1.48Γ speedup) |
Why the training-aware version works where uniform static doesn't: the greedy search has to avoid sensitive layers because the model was never trained to tolerate sharing β without retraining, certain layers are tightly coupled to their own indexer's exact top-k, and feeding them someone else's indices causes a distribution shift that breaks things. Once you train with the multi-layer distillation loss, the S layers themselves learn to adapt to inherited indices, and the F layer's indexer learns to produce a selection that generalizes across all the layers it serves. That joint adaptation is what makes even a dumb uniform pattern work fine after training β the layer-specific sensitivity just disappears.
Extra structural note from the overlap heatmap: the first layer is always kept as a full F layer (it has to seed the index cache, and early layers attend to a fundamentally different token subset than later ones β overlap with deep layers is β€0.4). The strongest, most similar index regions cluster near the diagonal β i.e., a layer's indexer output looks most like its immediate neighbors, decaying as you move further away.
Before landing on the greedy LM-loss search, the natural-seeming alternative was tried: pick the sharing pattern by directly maximizing cosine similarity between attention outputs, since that's cheaper to compute than running full LM-loss evaluations.
Build an NΓN similarity matrix S[i][j]
= cosine similarity between layer i's attention output using its own indexer vs. using layer j's indexer instead. Then solve for the best F/S assignment with dynamic programming:
dp[i][k] = max over j<i, c_j=F of:
dp[j][k-1] + Ξ£_{m=j+1}^{i-1} S[m][j]
β i.e., find the best previous F layer to "branch" from, accumulating similarity scores for every S layer that would reuse it. Solvable exactly by backtracking through the DP table.
This failed. The similarity-optimal pattern performed about the same as plain uniform interleaving β both clearly worse than the greedy LM-loss search. The reason is the core insight of the whole negative result:
Cosine similarity is a
localmetric β it only tells you how well-preserved a single layer's output isin isolation. It can't see how small token-selection mismatchespropagate and compoundthrough all the downstream layers. Two layers can have near-identical attention outputs (similarity β 1) yet differ in exactly the handful of tokens that turn out to matter several layers later. Those subtle errors accumulate β and a layer-local similarity score has no way to predict that.
The LM-loss-based greedy search avoids this because it's a global, end-to-end signal β it measures the actual downstream effect of a sharing decision on the whole model's output, not just on one layer's local activation. This is the real lesson: local geometric similarity is a tempting cheap proxy, but for anything where errors compound across depth, you need an end-to-end metric.
DSA's indexer recomputes "who matters" from scratch at every layer even though the answer barely changes between adjacent layers β IndexCache just caches that answer and reuses it, and the only real engineering question is which layers are allowed to skip recomputation, which can be found either by greedy search (no training) or learned directly via a provably-equivalent averaged-KL loss (with training).
if you found any mismatched detail in this post or want to contribute in paper or working code for indexcache please open issue on