04:00
2026-06-17
arxiv.org
large-language-models
Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing
Researchers introduced STATEWITNESS, an activation explainer that decodes hidden states from reasoning LLMs to audit deceptive behavior, achieving 0.916 mean AUROC and outperforming existing monitors โฆ