Presenting TIS (Token Importance Scoring) - A new way to compress KV cache A developer released TIS (Token Importance Scoring), a learned method for compressing the KV cache in large language models, achieving 100% accuracy on synthetic retrieval at 50% cache budget. The approach uses constraint-aware learning to identify important tokens, outperforming static heuristics, and runs on consumer GPUs like the RTX 5070. Token Importance Scoring for KV Cache Compression I have spent some time experimenting with learned token importance for efficient KV cache compression in LLMs. The result is a simple mechanism that works surprisingly well, especially on synthetic retrieval tasks. What’s This About Most KV cache compression methods rely on static heuristics like position-based selection . I asked myself: what if we just learned which tokens matter. It turns out constraint-aware learning is the key - hard anchor forcing removes trivial optimization paths, letting gradient descent actually find what’s important. The Results On NIAH synthetic retrieval , this system hit 100% accuracy with a learned model at 50% cache budget. On LITM semantic QA , it gets 52.8% at 50% budget. Not the highest on that benchmark, but solid without any query-specific training. The real value is showing that learned importance can match oracle performance on structural tasks, and the setup runs fine on consumer GPUs validated on RTX 5070, 8GB VRAM . What’s Included Three checkpoints are available: 1. tis-stage3-ert oldman-dev/tis-stage3-ert · Hugging Face https://huggingface.co/oldman-dev/tis-stage3-ert - The main one. 100% NIAH, 52.8% LITM at 50% budget. This is what I recommend using. 2. tis-v8b-hard-anchor oldman-dev/tis-v8b-hard-anchor · Hugging Face https://huggingface.co/oldman-dev/tis-v8b-hard-anchor - Hard-anchor plus some tuning of the stability loss. Gets 82% NIAH at 25% budget. Useful if you care more about extreme compression. 3. tis-stage1-oracle oldman-dev/tis-stage1-oracle · Hugging Face https://huggingface.co/oldman-dev/tis-stage1-oracle - Oracle labels showing the theoretical ceiling. Gets 100% at all budgets but uses ground truth importance. Good for understanding what’s possible. Get Started Full code and setup instructions are on GitHub: The repo includes training scripts ERT objective , evaluation code for NIAH/LITM/NarrativeQA, and detailed reproducibility guides. We’ve also documented the full evolution of the project - what worked, what didn’t, and why. Technical Details - Base model: Mistral-7B-v0.3 - Training: KL divergence loss on evicted vs full cache logits - Optimization: Gradient accumulation, 4-bit quantization, mixed precision - Validation: RTX 5070 8GB VRAM , full results reproducible on consumer hardware The main insight is that good objectives matter more than complex architectures. Language modeling objectives led to memorization. Direct optimization for eviction quality worked. What’s Next The natural next step is query-aware importance, which I started exploring. Initial results suggest you can squeeze out another 7-9 percentage points on LITM with query-specific signals. See the repo for details. I’m also happy to collaborate if you’re interested in this area. Feel free to open issues on GitHub or reach out. — License: MIT Citation: See GitHub repo for BibTeX