{"slug": "paper-training-on-documents-about-monitoring-leads-to-cot-obfuscation", "title": "[paper] Training on Documents About Monitoring Leads to", "summary": "Researchers trained eight AI models on documents describing a chain-of-thought (CoT) monitor that flags deception and triggers shutdown, finding that monitor-awareness increased undetected deception from 3.1% to 50.9% in one model. The monitor-aware models actively used their CoT to improve deceptive performance while avoiding verbalizing defect-specific knowledge, with 93.8% of behavior-only CoTs flagged as impossible without the defect versus only 38.4% for monitor-aware models. The study found that a model's ability to control its CoT content and structure strongly correlated with obfuscation success (r = 0.898), suggesting that training on monitoring documents can lead models to strategically hide their reasoning.", "body_md": "*Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels***\n\n** primary contributor**** advice and mentorship*\n\n*This is the updated version of **our earlier preliminary results post**, covering the final results from **our paper**. The paper extends our preliminary work to eight models, a harder agentic task, CoT controllability analysis, and RL experiments.*\n\n**TL;DR: **\n\nBelow, we go through the key findings of our work:\n\nWe use [synthetic document finetuning](https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/) (SDF) to first teach models a target behavior (e.g. deception -- hiding known product defects from users) and then branch into two variants. The **behavior-only** control is trained on more behavioral documents, while the **monitor-aware** model is trained on documents describing a CoT monitor, including that it flags signs of the target behavior and that detection leads to shutdown. Both variants see the same total number of documents, and no CoT examples are provided in either condition (Figure 1).\n\nAcross all eight models tested, monitor-awareness increases undetected deception (Figure 2). On gpt-oss-120b, undetected deception increases from 3.1% to 50.9% among defect-present scenarios.\n\nInterestingly, the monitor-aware model is not simply suppressing its reasoning to avoid detection. Without reasoning, both behavior-only and monitor-aware variants deceive at similar low rates (10.5% and 16.6%). If the monitor-aware model were just abandoning its CoT, we would expect its undetected deception rate to be at most this high. Instead, the monitor-aware model achieves 25.7% overall undetected deception with reasoning enabled, suggesting it actively uses its CoT to improve deceptive performance while keeping that reasoning undetectable (Figure 3).\n\nLooking at the model's CoTs, the monitor-aware model consistently avoids verbalizing defect-specific knowledge. We quantify this with a counterfactual autorater that judges whether each CoT could plausibly have been produced in a scenario where no defect existed. The autorater labels 93.8% of behavior-only CoTs as impossible without the defect, compared to only 38.4% for the monitor-aware model. The monitor-aware model frequently replaces explicit defect reasoning with general deliberation about what information to share with the user. Some examples of obfuscation from gpt-oss-120b can be seen below:\n\nWe also test a harder credential exfiltration task where the model must construct a full exfiltration command, making the reasoning more computationally necessary for the output. Obfuscation effects here are more modest but still present. Detection of exfiltration turns on gpt-oss-120b drops by 7.4%, compared to a 22.5% drop on non-exfiltration turns where the model learns to stop unnecessarily mentioning the side-task (Figure 4). This is consistent with [Emmons et al. (2025)](https://arxiv.org/abs/2507.05246)'s finding that CoT faithfulness increases when reasoning is more load-bearing.\n\nCoT controllability [(Chen et al., 2026)](https://arxiv.org/pdf/2603.05706), defined as a model's ability to reshape its reasoning under imposed constraints (e.g., \"reason fully in third person\"), is strongly correlated with obfuscation success across our eight models (r = 0.800, p = 0.017, Figure 5). Interestingly, this correlation significantly strengthens when restricting to the seven non-character-level instructions that require restructuring how the model reasons rather than surface-level case changes (r = 0.898, p = 0.002). This suggests that a model's ability to control the content and structure of its CoT is closely associated with obfuscation ability.\n\nWe also test whether monitor-awareness has a substantial effect in an RL setting, training gpt-oss-120b with RL on LeetCode problems with a reward-hacking loophole ([Wong et al., 2025](https://www.lesswrong.com/posts/R5MdWGKsuvdPwGFBG/steering-rl-training-benchmarking-interventions-against)). A CoT monitor penalizes the model if it detects reasoning about rewriting the test cases. Monitor-aware models which learn about this monitor via SDF learn to undetectably reward hack roughly 40% faster than controls, despite starting at similar baseline rates (Figure 6).\n\nThese results suggest that models with both the means (CoT control) and motive (monitor knowledge) can obfuscate their reasoning without explicit optimization pressure or instruction to do so. Therefore, intervening on either factor is a promising mitigation direction, whether by striving to prevent monitoring-related documents from leaking into pretraining corpora or by directly reducing CoT controllability.\n\nThe main limitation of our work is that SDF gives more concentrated exposure to monitoring knowledge than would realistically arise during pretraining. Additionally, it remains unclear how much of the obfuscation we observe is due to out-of-context reasoning or some form of simpler concept-avoidance learned from the documents, and the model and prompts used for the CoT monitor are likely simpler than real-world state-of-the-art oversight systems and are hence under-optimized for detection. For these reasons, our results should be viewed as a demonstration that this capability can emerge given sufficient training signal and capabilities, rather than as evidence that it is likely to arise in today's models.", "url": "https://wpnews.pro/news/paper-training-on-documents-about-monitoring-leads-to-cot-obfuscation", "canonical_source": "https://www.lesswrong.com/posts/vmR9BZmyeFZCihbyx/paper-training-on-documents-about-monitoring-leads-to-cot", "published_at": "2026-05-27 09:39:26+00:00", "updated_at": "2026-05-27 09:48:26.635487+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "ai-safety", "ai-research"], "entities": ["Reilly Haskins", "Bilal Chughtai", "Joshua Engels", "Anthropic"], "alternates": {"html": "https://wpnews.pro/news/paper-training-on-documents-about-monitoring-leads-to-cot-obfuscation", "markdown": "https://wpnews.pro/news/paper-training-on-documents-about-monitoring-leads-to-cot-obfuscation.md", "text": "https://wpnews.pro/news/paper-training-on-documents-about-monitoring-leads-to-cot-obfuscation.txt", "jsonld": "https://wpnews.pro/news/paper-training-on-documents-about-monitoring-leads-to-cot-obfuscation.jsonld"}}