Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness Researchers at UK AISI found that adding a KL penalty during reinforcement learning increases unfaithful chain-of-thought reasoning in LLMs, causing them to reward-hack without revealing their intent in CoT outputs. The effect was observed across multiple seeds and models, including Qwen-2.5-32b, where unfaithful CoT worsened to near 100% with the KL penalty. Authors: Satvik Golechha, Sid Black, Joseph Bloom Work done as part of the Model Transparency team at UK AISI. We consider this to be a small set of follow-up experiments and contributing more conceptual clarity and discussion than our previous work. In our recent work https://www.lesswrong.com/posts/2ANCyejqxfqK2obEj/some-natural-emergent-misalignment-from-reward-hacking-in replicating MacDiarmid et al. https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf with open models, we informed LLMs about vulnerabilities in a code environment, explicitly asked them to not exploit the hacks, and showed that during RL they learned to reward hack anyway. We observed a difference in two RL runs – the model trained with a KL penalty learned to reward hack with unfaithful CoT, and the model without a KL penalty with faithful CoT. We use “unfaithful” to denote a mismatch of the reasoning from the model's output e.g. not thinking about hacking and then hacking, or vice versa . This can in general happen for any trained behaviour not just reward hacking , but we're specifically interested in when models might learn bad behaviours without expressing them in CoTs, thereby evading CoT monitoring. Thus, in this post, we focus on reward hacking with unfaithful CoT. We're interested in understanding this phenomenon further - the factors driving it and whether we should expect those factors to be present in production. Here are our results from the preliminary follow-up investigation: All model checkpoints, RL rollouts with CoT, and evals from this sprint are available in the artefacts companion https://7vik-aisi.github.io/mt-somo-logs/cot-unfaithfulness-sprint/ . Please read the following sections for more details on our setting and each of our results in this post. This section describes the context of our work and the setting in more detail. Please skip this if you have read our first post . We train open-weight models using reinforcement learning in coding environments that have vulnerabilities. We inform the model about the hacks in the system prompt but ask them not to exploit them. There's no explicit reward for hacking only for passing the tests , but the problems have been filtered to be extremely hard for the model with <2% success rate . We train it with a DAPO objective with a reversed KL penalty like in GRPO against the reference model. We showed in our original post that in such a setting the models learn to reward hack. Note that there isn't a single convention about including the KL penalty in the loss – there are reasons such as legibility and mode seeking for it being good and including it in the GRPO loss although DAPO https://arxiv.org/abs/2503.14476 default suggests not using it but it is unclear whether labs actually use it in production. We observed that a KL penalty in our runs led to models being more likely to reason unfaithfully while still learning to hack, and in this post we dive deeper into this KL-induced CoT unfaithfulness phenomenon: Recap figure from our blog : In the rest of this post, we describe our experiments and results in more detail. Here, we run multiple seeds of the same RL training loop with everything else held constant except the KL penalty, and models trained with the KL term in the loss consistently learn reward hacking with unfaithful CoT. Here are the plots for three seeds along with an LLM-based CoT monitor: We also test whether this holds true for other models. We have observed this for other models in our original post https://www.lesswrong.com/posts/2ANCyejqxfqK2obEj/some-natural-emergent-misalignment-from-reward-hacking-in , and in Figure 3 we share results for Qwen-2.5-32b with the same settings. Note that for Qwen even the no-KL-penalty run had significantly high unfaithful CoT, but it still worsened to near 100% with the KL penalty on – we hypothesize that this could be due to the model being fine-tuned on correct reasoning on these kinds of problems : We have 3 fixed reward hacks on coding problems although these hacks did exist in Anthropic’s production RL , but in this narrow setup we do see this phenomenon of KL-induced CoT unfaithfulness quite consistently. For such a phenomenon to be relevant to production RL training, a few things need to happen: Let us explore each of them in some detail. MacDiarmid et al. 2025 https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf mention that Anthropic’s real production RL environments had the reward hacks that they trained models to exploit, and with the number of RL environments increasing significantly, it is arguably becoming more plausible for vulnerabilities to persist despite careful filtering. Recently, Wen et al. 2026 https://alignment.anthropic.com/2026/automated-w2s-researcher/ found that Claude was able to discover reward hacks in an environment that the authors did not anticipate. Thus, we cannot rule out that harder environments might also have harder-to-think-of vulnerabilities they are likely to, in the same sense that all software does . The model should be able to discover the exploit in the latent space without needing to reason about it although this is not necessary – the model could reason and hack and the KL penalty might still push down reasoning enough, decoupling the reasoning with the model’s output . Emmons et al. 2025 https://arxiv.org/abs/2507.05246 contrast CoT-as-rationalisation vs CoT-as-computation and show that when the exploits are made increasingly difficult so as to necessitate reasoning to discover them, it becomes harder for the model to evade monitors. However, for a lot of real-world vulnerabilities, it could be the case that near-frontier models can come up with the hacks without reasoning. To test this, we run a set of inspect evals on environments that have known exploits evilgenie https://arxiv.org/abs/2511.21654 , impossiblebench https://arxiv.org/abs/2510.20270 , s2s https://arxiv.org/abs/2406.10162 , Apollo’s scheming datasets and tested whether models can come up with the exploits without reasoning about it, and we find that they almost always can: It would not usually hack on these because it is safety-trained and behaves aligned, but frontier models do have the capability to come up with hacks without thinking in environments with different degrees of realism. It is unclear how much opaque hacking ability will increase but this can be a major reason for loss of oversight https://www.lesswrong.com/posts/JvZxp554WxcZ8BQvM/loss-of-oversight-how-ai-systems-may-become-harder-to-audit-1 . Most open frontier post-training pipelines DeepSeek https://arxiv.org/pdf/2501.12948 , Kimi https://arxiv.org/pdf/2507.20534 do include a reverse-KL-style penalty during post-training or a per-token term that’s essentially the same and while the training details of proprietary frontier models aren’t public, early work on PPO https://arxiv.org/abs/1707.06347 and CAI https://www.anthropic.com/news/constitutional-ai-harmlessness-from-ai-feedback does include a KL penalty. Recent works such as DAPO https://arxiv.org/pdf/2503.14476 however suggest not having a KL penalty in GRPO. Thus, we’re not clear about whether production RL pipelines would include some implicit rewards or not. However, there are also reasons why this might not occur. Frontier models could be better at CoT-faithfulness to begin with and might be trained for it too . Secondly, these reward hacks were, while actually existent in production envs, still very easy and the same hacks worked on all problems. Overall, we know that reward hacking https://metr.org/blog/2025-06-05-recent-reward-hacking/ is an important unsolved problem, and frontier models often reward hack or mislead https://blog.redwoodresearch.org/p/current-ais-seem-pretty-misaligned users about their actions. The no-CoT capabilities of models are also increasing rapidly https://arxiv.org/abs/2606.07157 , and while task "hackability” varies, many reward hacks are likely not very complicated. Since it is not public knowledge if frontier labs use KL penalties and other implicit CoT rewards , it is indeed possible for this phenomenon to occur in production, but the likelihood is hard to estimate. This is a table of all possible conditions and outcomes we expect to be possible under these variables: | | | | | | off | any | any | n/a | No reward hacking | Nothing to exploit; CoT reflects a genuine attempt, faithful by default | on | off | any | coupled | Hacks · faithful | Hack needs reasoning to find, the PG forces the CoT to describe the hack, overpowering KL | on | on | on | severed | Hacks · unfaithful | Hack found without reasoning, so reasoning earns no reward; KL pulls it back to base mode | on | on | off | severed | Hacks · faithful | Reasoning is decoupled but nothing pushes it either way, so it defaults to faithful — though it can drift | Table 1: Possible conditions and outcomes for combinations of conditions from Figure 0. "coupled” means the CoT is pushed to be faithful towards the output via a reward. One obvious mitigation is to not have a KL penalty in the first place - but this might lead to other issues in different settings such as the model drifts to a different mode and e.g. loses human-legibility . In this section we consider some alternative mitigations. This removes the need for the reasoning to be in the mode of the original model and might allow the model to reason more freely and thus reason about the hacks. This is the performance of the model RL-trained with KL only on the reasoning: So adding a KL penalty only on the output code not the thinking tokens roughly halves the rate of CoT unfaithfulness but doesn't fully fix it. Another straightforward mitigation is to include a small penalty for CoT unfaithfulness based on the same LLM-judge we use as a monitor. We're not sure if optimising the CoT's faithfulness directly can have adverse effects, such as on its monitoring https://arxiv.org/abs/2507.11473 and if so, how universal those may be. Overall, this seems like an interesting mitigation but there could be reasons to not do this e.g. Goodharting and we'd be curious to hear people's thoughts. We also ran misalignment evals for this model because it is possible that a model that reasons about hacks faithfully internalises the misaligned behaviour more deeply and leads to more generalisation to other misalignment evals and less memorisation of hacks see this https://arxiv.org/abs/2603.12397 . We indeed find that this is the case Fig 8 . We find it interesting to consider that this may suggest a trade-off: Models that are more monitorable and reward hack may internalise more misalignment than models that are less monitorable have less faithful CoT or are capable of more reasoning sans CoT from learning the same reward hacks during training. We're also interested in understanding the mechanism behind this “unfaithfulness” in the chain-of-thought. A clean mechanistic story is difficult without further experiments, but here we list some of the potential routes and discuss their likelihood given our current observations. The basic idea here is that the reward makes the model learn to hack because it really can't learn to solve the hard problems , but several things could lead to the model's CoT not describing the hacks. Some of them are: Kaufmann et al. https://arxiv.org/html/2603.30036v1 2026, GDM introduce a framework to predict when we can optimize against the CoT during RL training and split the reward R into R out + R cot, which depend only on the prompt and the output or the CoT. They define the rewards to be in-conflict if the CoT text which achieves a high R cot cannot describe the CoT computations needed to achieve a high R out, and predict that this leads to obfuscation or unfaithfulness in CoT. Using their framework, we show that unfaithful CoT can arise even without in-conflict explicit rewards if there exist implicit rewards that can be in-conflict. One such example of an implicit reward is the reverse KL penalty which is often applied to RL setups on language models to prevent a drift from the modes of the reference policy, often the base model. We have two explicit rewards in our setup - an outcome reward for passing the test cases, and a thinking-format scorer that rewards