Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

wpnews.pro

cd /news/large-language-models/decoding-hidden-deception-in-reasoni… · home › topics › large-language-models › article

[ARTICLE · art-30534] src=arxiv.org ↗ pub=2026-06-17T04:00Z topic=large-language-models verified=true sentiment=↑ positive

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

Researchers introduced STATEWITNESS, an activation explainer that decodes hidden states from reasoning LLMs to audit deceptive behavior, achieving 0.916 mean AUROC and outperforming existing monitors by 11.6% to 25.0%. The tool provides natural-language answers and evidence traces for human inspection, aiming to enhance AI safety and interpretability.

read1 min views21 publishedJun 17, 2026

arXiv:2606.17478v1 Announce Type: new Abstract: As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/decoding-hidden-deceptio…

Read original on arxiv.org → arxiv.org/abs/2606.17478

mentioned entities

STATEWITNESS

arXiv

metadata

slugdecoding-hidden-deception-in-reasoning-llms-activation-explainers-for-deception

topic#large-language-models

secondary3 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevRay Data LLM enables 2x throughp…

next →Claude Agent SDK Permissions: An…

── more in #large-language-models 4 stories · sorted by recency

byteiota.com · 1 Aug · #large-language-models

Grok Build Uploaded Your Entire Git Repo. Now It’s Open Source.

dev.to · 1 Aug · #large-language-models

Your Voice Assistant Can Be Social-Engineered Too, and Nobody's Watching For It

brendanlong.com · 1 Aug · #large-language-models

I Couldn't Prompt GPT-OSS-20B to Control Its CoT

cryptonews.net · 1 Aug · #large-language-models

Was AI Responsible for Finding the Coldcard Security Flaw?

── more on @statewitness 3 stories trending now

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

wpnews · 1 Aug · #developer-tools

I Built a Portable AI Skill That Safely Upgrades .NET Applications

wpnews · 1 Aug · #developer-tools

Tokeness review: one API key for GPT/Claude/Gemini/Grok/DeepSeek/Kimi (with real caveats)

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required