cd /news/large-language-models/decoding-hidden-deception-in-reasoni… · home topics large-language-models article
[ARTICLE · art-30534] src=arxiv.org ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

Researchers introduced STATEWITNESS, an activation explainer that decodes hidden states from reasoning LLMs to audit deceptive behavior, achieving 0.916 mean AUROC and outperforming existing monitors by 11.6% to 25.0%. The tool provides natural-language answers and evidence traces for human inspection, aiming to enhance AI safety and interpretability.

read1 min views1 publishedJun 17, 2026

arXiv:2606.17478v1 Announce Type: new Abstract: As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.

── more in #large-language-models 4 stories · sorted by recency
── more on @statewitness 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/decoding-hidden-dece…] indexed:0 read:1min 2026-06-17 ·