{"slug": "hallucination-is-linearly-decodable-from-mid-layer-hidden-states-in-quantized", "title": "Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs", "summary": "Researchers found that a linear probe applied to mid-layer hidden states of quantized large language models can detect hallucinations with up to 1.000 AUROC, significantly outperforming sampling-based methods that max out at 0.541 AUROC. The truthfulness signal is strongest in consistent network layers across model families, with first-block attention entropy providing a complementary detection signal at no extra cost. The findings demonstrate that hallucination detection is linearly decodable from internal model representations, enabling reliable detection on a single 8GB GPU.", "body_md": "arXiv:2606.02628v1 Announce Type: new\nAbstract: We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three $7$B--$8$B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in $4$-bit NF4 quantization, we extract per-layer hidden states on four hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set) and compare four detection approaches: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy. A linear probe on a single mid-network layer achieves $0.904$--$1.000$ AUROC on held-out splits, while sampling-based detectors do not exceed $0.541$ AUROC under the same protocol. The truthfulness signal is approximately linear: MLP probes rarely surpass linear probes by more than $0.01$ AUROC. Peak probing layers fall in a consistent band across model families on natural-language benchmarks -- blocks~$13$--$18$ of~$32$ for Llama and Mistral, and blocks~$19$--$25$ of~$28$ for Qwen. First-block attention entropy provides a complementary signal in knowledge-grounded settings ($0.866$--$0.941$ AUROC on HaluEval-QA) at no additional inference cost. The low discriminability of sampling methods under this protocol reflects a structural mismatch between paired-label evaluation and the information these methods access, rather than an inherent limitation of those methods. Code and data are released for full reproducibility on a single $8$\\,GB GPU.", "url": "https://wpnews.pro/news/hallucination-is-linearly-decodable-from-mid-layer-hidden-states-in-quantized", "canonical_source": "https://arxiv.org/abs/2606.02628", "published_at": "2026-06-03 04:00:00+00:00", "updated_at": "2026-06-03 04:03:39.526961+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "natural-language-processing", "ai-research"], "entities": ["Llama-3.1-8B", "Mistral-7B", "Qwen2.5-7B", "TruthfulQA", "HaluEval-QA", "FEVER", "INSIDE EigenScore"], "alternates": {"html": "https://wpnews.pro/news/hallucination-is-linearly-decodable-from-mid-layer-hidden-states-in-quantized", "markdown": "https://wpnews.pro/news/hallucination-is-linearly-decodable-from-mid-layer-hidden-states-in-quantized.md", "text": "https://wpnews.pro/news/hallucination-is-linearly-decodable-from-mid-layer-hidden-states-in-quantized.txt", "jsonld": "https://wpnews.pro/news/hallucination-is-linearly-decodable-from-mid-layer-hidden-states-in-quantized.jsonld"}}