{"slug": "decoding-hidden-deception-in-reasoning-llms-activation-explainers-for-deception", "title": "Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing", "summary": "Researchers introduced STATEWITNESS, an activation explainer that decodes hidden states from reasoning LLMs to audit deceptive behavior, achieving 0.916 mean AUROC and outperforming existing monitors by 11.6% to 25.0%. The tool provides natural-language answers and evidence traces for human inspection, aiming to enhance AI safety and interpretability.", "body_md": "arXiv:2606.17478v1 Announce Type: new\nAbstract: As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.", "url": "https://wpnews.pro/news/decoding-hidden-deception-in-reasoning-llms-activation-explainers-for-deception", "canonical_source": "https://arxiv.org/abs/2606.17478", "published_at": "2026-06-17 04:00:00+00:00", "updated_at": "2026-06-17 04:27:49.568362+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-ethics", "ai-research"], "entities": ["STATEWITNESS", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/decoding-hidden-deception-in-reasoning-llms-activation-explainers-for-deception", "markdown": "https://wpnews.pro/news/decoding-hidden-deception-in-reasoning-llms-activation-explainers-for-deception.md", "text": "https://wpnews.pro/news/decoding-hidden-deception-in-reasoning-llms-activation-explainers-for-deception.txt", "jsonld": "https://wpnews.pro/news/decoding-hidden-deception-in-reasoning-llms-activation-explainers-for-deception.jsonld"}}