cd /news/large-language-models/evidence-for-feature-specific-error-… · home topics large-language-models article
[ARTICLE · art-38822] src=arxiv.org ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Evidence for feature-specific error correction in LLMs

Researchers propose an empirical test for error correction in large language models, finding that residual-stream activations are robust to small perturbations and that feature-specific directions are privileged over generic ones, consistent with computation in superposition. The results replicate across six LLMs including Gemma-2-9B and Llama-3.1-8B, with p>2 for feature directions and p≈2 for controls.

read1 min views1 publishedJun 25, 2026
arXiv:2606.24964v1 Announce Type: new
Abstract: Understanding the features of large language models (LLMs) is a central goal of interpretability. LLMs are commonly assumed to use superposition to represent more features than they have dimensions. They may not only represent features in superposition but also perform computation in superposition. Theory predicts that computing in superposition requires error correction that privileges feature directions over generic ones, but this prediction has not been tested empirically. We propose an empirical test of error correction in LLMs based on activation perturbations. Perturbing residual-stream activations, we find that they are robust to small perturbations--forming activation plateaus consistent with error correction--but less robust along candidate feature directions ("pure" directions, constructed from contrastive prompt pairs) than along mixtures of two such directions, indicating that the pure directions are privileged. We quantify this privilegedness by modeling the perturbation effect as a function of the $L^p$-norm of its decomposition into feature components. For $p=2$ the response is a quadratic form with at most as many nonzero eigenvalues as the residual-stream dimension, which cannot privilege the many feature directions superposition requires. $p>2$ lifts this constraint and is consistent with feature-specific error correction. We find $p>2$ for contrastive, MELBO, and SAE-decoder directions, and $p\approx2$ for random and PCA directions (controls). These results replicate across Gemma-2-9B, Qwen3-1.7B, Llama-3.1-8B, Mistral-7B-v0.3, Aya-Expanse-8B, and Yi-1.5-9B. We further validate our method on a toy model of error correction with known ground-truth features, recovering $p>2$ for true feature directions, degrading toward $2$ as we rotate away from them.
── more in #large-language-models 4 stories · sorted by recency
── more on @gemma-2-9b 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/evidence-for-feature…] indexed:0 read:1min 2026-06-25 ·