{"slug": "tensor-cache-eviction-conditioned-associative-memory-for-transformers", "title": "Tensor Cache: Eviction-conditioned Associative Memory for Transformers", "summary": "Researchers introduced Tensor Cache, a two-level memory system for Transformer models that stores evicted key-value pairs from a sliding-window cache into a compressed outer-product fast-weight memory. The approach uses a learned gating mechanism to fuse exact local attention with compressed memory access, closing a training shortcut that introduced spurious cross-token computations. Tensor Cache improves the memory-quality tradeoff over existing bounded-state baselines across long-context language modeling and associative recall tasks.", "body_md": "arXiv:2605.22884v1 Announce Type: new\nAbstract: Autoregressive Transformer KV caches grow linearly with context length; sliding-window caching bounds memory but discards evicted tokens entirely, so relevant evidence outside the window becomes inaccessible. We introduce \\emph{Tensor Cache}, a two-level cache that pairs sliding-window softmax attention as a first-level cache (L1) with a fixed-size outer-product fast-weight memory as a second-level cache (L2) fed by KV pairs evicted from the window. Recent tokens remain in exact local attention; evicted pairs are compressed into a per-layer matrix $A$ and read by future queries through a single matrix multiplication, exploiting the linear-attention identity $q_t(k_i \\otimes v_i)=\\langle q_t,k_i\\rangle v_i$. A learned scalar gate fuses the L1 and L2 outputs, and per-head decay and write-rate parameters are trained end-to-end. The outer-product memory and the read identity are well-known; our contribution is their use as an L2 cache fed exclusively by sliding-window evictions, plus identifying that the common chunked-mean training shortcut $A\\!\\leftarrow\\!\\lambda A\\!+\\!\\eta(\\bar k\\!\\otimes\\!\\bar v)$ silently introduces $C^2{-}C$ spurious cross-token outer products per chunk, and closing the gap with a parallel weighted-sum scan equivalent to per-token writes within float32 epsilon. Across systems scaling, controlled associative recall, long-context language modeling, and memory-capacity diagnostics, Tensor Cache improves the memory--quality frontier over bounded-state baselines.", "url": "https://wpnews.pro/news/tensor-cache-eviction-conditioned-associative-memory-for-transformers", "canonical_source": "https://arxiv.org/abs/2605.22884", "published_at": "2026-05-25 04:00:00+00:00", "updated_at": "2026-05-25 15:13:38.858556+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "neural-networks", "artificial-intelligence", "ai-research"], "entities": ["Tensor Cache"], "alternates": {"html": "https://wpnews.pro/news/tensor-cache-eviction-conditioned-associative-memory-for-transformers", "markdown": "https://wpnews.pro/news/tensor-cache-eviction-conditioned-associative-memory-for-transformers.md", "text": "https://wpnews.pro/news/tensor-cache-eviction-conditioned-associative-memory-for-transformers.txt", "jsonld": "https://wpnews.pro/news/tensor-cache-eviction-conditioned-associative-memory-for-transformers.jsonld"}}