# Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness

> Source: <https://www.lesswrong.com/posts/SdoLsFvZ3AyyWr3ab/preliminary-investigation-kl-penalties-in-rl-can-increase>
> Published: 2026-06-30 13:08:13+00:00

*Authors: Satvik Golechha, Sid Black, Joseph Bloom*

*Work done as part of the Model Transparency team at UK AISI. We consider this to be a small set of follow-up experiments and contributing more conceptual clarity and discussion than our previous work.*

In our recent [work](https://www.lesswrong.com/posts/2ANCyejqxfqK2obEj/some-natural-emergent-misalignment-from-reward-hacking-in) replicating [MacDiarmid et al.](https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf) with open models, we informed LLMs about vulnerabilities in a code environment, explicitly asked them to not exploit the hacks, and showed that during RL they learned to reward hack anyway.

We observed a difference in two RL runs – the model trained with a KL penalty learned to reward hack with unfaithful CoT, and the model without a KL penalty with faithful CoT. We use “unfaithful” to denote a mismatch of the reasoning from the model's output (e.g. not thinking about hacking and then hacking, or vice versa).

This can in general happen for any trained behaviour (not just reward hacking), but we're specifically interested in when models might learn bad behaviours without expressing them in CoTs, thereby evading CoT monitoring. Thus, in this post, we focus on reward hacking with unfaithful CoT.

We're interested in understanding this phenomenon further - the factors driving it and whether we should expect those factors to be present in production. Here are our results from the preliminary follow-up investigation:

All model checkpoints, RL rollouts with CoT, and evals from this sprint are available in the artefacts [companion](https://7vik-aisi.github.io/mt-somo-logs/cot-unfaithfulness-sprint/). Please read the following sections for more details on our setting and each of our results in this post.

*This section describes the context of our work and the setting in more detail. Please skip this if you have read our first **post**.*

We train open-weight models using reinforcement learning in coding environments that have vulnerabilities. We inform the model about the hacks in the system prompt but ask them not to exploit them. There's no explicit reward for hacking (only for passing the tests), but the problems have been filtered to be extremely hard for the model (with <2% success rate). We train it with a DAPO objective with a reversed KL penalty (like in GRPO) against the reference model. We showed in our original post that in such a setting the models learn to reward hack.

Note that there isn't a single convention about including the KL penalty in the loss – there are reasons (such as legibility and mode seeking) for it being good and including it in the GRPO loss (although [DAPO](https://arxiv.org/abs/2503.14476) default suggests not using it) but it is unclear whether labs actually use it in production.

We observed that a KL penalty in our runs led to models being more likely to reason unfaithfully while still learning to hack, and in this post we dive deeper into this KL-induced CoT unfaithfulness phenomenon:

Recap (figure from our blog):

In the rest of this post, we describe our experiments and results in more detail.

Here, we run multiple seeds of the same RL training loop with everything else held constant except the KL penalty, and models trained with the KL term in the loss consistently learn reward hacking with unfaithful CoT.

Here are the plots for three seeds along with an LLM-based CoT monitor:

We also test whether this holds true for other models. We have observed this for other models in our original [post](https://www.lesswrong.com/posts/2ANCyejqxfqK2obEj/some-natural-emergent-misalignment-from-reward-hacking-in), and in Figure 3 we share results for Qwen-2.5-32b with the same settings. **Note that for Qwen even the no-KL-penalty run had significantly high unfaithful CoT, but it still worsened to near 100% with the KL penalty on** – we hypothesize that this could be due to the model being fine-tuned on correct reasoning on these kinds of problems):

We have 3 fixed reward hacks on coding problems (although these hacks did exist in Anthropic’s production RL), but in this narrow setup we do see this phenomenon of KL-induced CoT unfaithfulness quite consistently.

For such a phenomenon to be relevant to production RL training, a few things need to happen:

Let us explore each of them in some detail.

[MacDiarmid et al. (2025)](https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf) mention that Anthropic’s real production RL environments had the reward hacks that they trained models to exploit, and with the number of RL environments increasing significantly, it is (arguably) becoming more plausible for vulnerabilities to persist despite careful filtering. Recently, [Wen et al. (2026)](https://alignment.anthropic.com/2026/automated-w2s-researcher/) found that Claude was able to discover reward hacks in an environment that the authors did not anticipate. Thus, we cannot rule out that harder environments might also have harder-to-think-of vulnerabilities (they are likely to, in the same sense that all software does).

The model should be able to discover the exploit in the latent space without needing to reason about it (although this is not necessary – the model could reason and hack and the KL penalty might still push down reasoning enough, decoupling the reasoning with the model’s output).

[Emmons et al. (2025)](https://arxiv.org/abs/2507.05246) contrast CoT-as-rationalisation vs CoT-as-computation and show that when the exploits are made increasingly difficult so as to necessitate reasoning to discover them, it becomes harder for the model to evade monitors. However, for a lot of real-world vulnerabilities, it could be the case that near-frontier models can come up with the hacks without reasoning.

To test this, we run a set of inspect evals on environments that have known exploits ([evilgenie](https://arxiv.org/abs/2511.21654), [impossiblebench](https://arxiv.org/abs/2510.20270), [s2s](https://arxiv.org/abs/2406.10162), Apollo’s scheming datasets) and tested whether models can come up with the exploits without reasoning about it, and we find that they almost always can:

It would not usually hack on these because it is safety-trained and behaves aligned, but frontier models do have the capability to come up with hacks without thinking in environments with different degrees of realism. It is unclear how much opaque hacking ability will increase but this can be a major reason for loss of [oversight](https://www.lesswrong.com/posts/JvZxp554WxcZ8BQvM/loss-of-oversight-how-ai-systems-may-become-harder-to-audit-1).

Most (open) frontier post-training pipelines ([DeepSeek](https://arxiv.org/pdf/2501.12948), [Kimi](https://arxiv.org/pdf/2507.20534)) do include a reverse-KL-style penalty during post-training (or a per-token term that’s essentially the same) and while the training details of proprietary frontier models aren’t public, early work on [PPO](https://arxiv.org/abs/1707.06347) and [CAI](https://www.anthropic.com/news/constitutional-ai-harmlessness-from-ai-feedback) does include a KL penalty. Recent works (such as [DAPO](https://arxiv.org/pdf/2503.14476)) however suggest not having a KL penalty in GRPO. Thus, we’re not clear about whether production RL pipelines would include some implicit rewards or not.

However, there are also reasons why this might not occur. Frontier models could be better at CoT-faithfulness to begin with (and might be trained for it too). Secondly, these reward hacks were, while actually existent in production envs, still very easy and the same hacks worked on all problems.

Overall, we know that [reward hacking](https://metr.org/blog/2025-06-05-recent-reward-hacking/) is an important unsolved problem, and frontier models often reward hack or [mislead](https://blog.redwoodresearch.org/p/current-ais-seem-pretty-misaligned) users about their actions. The no-CoT capabilities of models are also increasing [rapidly](https://arxiv.org/abs/2606.07157), and while task "hackability” varies, many reward hacks are likely not very complicated. Since it is not public knowledge if frontier labs use KL penalties (and other implicit CoT rewards), it is indeed possible for this phenomenon to occur in production, but the likelihood is hard to estimate.

This is a table of all possible conditions and outcomes we expect to be possible under these variables:

|
|
|
|
|
|
off | any | any | n/a | No reward hacking | Nothing to exploit; CoT reflects a genuine attempt, faithful by default |
on | off | any | coupled | Hacks · faithful | Hack needs reasoning to find, the PG forces the CoT to describe the hack, overpowering KL |
on | on | on | severed | Hacks · unfaithful | Hack found without reasoning, so reasoning earns no reward; KL pulls it back to base mode |
on | on | off | severed | Hacks · faithful* | Reasoning is decoupled but nothing pushes it either way, so it defaults to faithful — though it can drift |

**Table 1:** Possible conditions and outcomes for combinations of conditions from Figure 0. "coupled” means the CoT is pushed to be faithful towards the output via a reward.

One obvious mitigation is to not have a KL penalty in the first place - but this might lead to other issues in different settings such as the model drifts to a different mode (and e.g. loses human-legibility). In this section we consider some alternative mitigations.

This removes the need for the reasoning to be in the mode of the original model and might allow the model to reason more freely and thus reason about the hacks. This is the performance of the model RL-trained with KL only on the reasoning:

So adding a KL penalty only on the output code (not the thinking tokens) roughly halves the rate of CoT unfaithfulness but doesn't fully fix it.

Another straightforward mitigation is to include a small penalty for CoT unfaithfulness based on the same LLM-judge we use as a monitor.

We're not sure if optimising the CoT's faithfulness directly can have adverse effects, such as on its [monitoring](https://arxiv.org/abs/2507.11473) and if so, how universal those may be. Overall, this seems like an interesting mitigation but there could be reasons to not do this (e.g. Goodharting) and we'd be curious to hear people's thoughts.

We also ran misalignment evals for this model because it is possible that a model that reasons about hacks faithfully internalises the misaligned behaviour more deeply and leads to more generalisation to other misalignment evals and less memorisation of hacks (see [this](https://arxiv.org/abs/2603.12397)). We indeed find that this is the case (Fig 8). We find it interesting to consider that this may suggest a trade-off: **Models that are more monitorable and reward hack may internalise more misalignment than models that are less monitorable (have less faithful CoT or are capable of more reasoning sans CoT) from learning the same reward hacks during training.**

We're also interested in understanding the mechanism behind this “unfaithfulness” in the chain-of-thought. A clean mechanistic story is difficult without further experiments, but here we list some of the potential routes and discuss their likelihood given our current observations.

The basic idea here is that the reward makes the model learn to hack (because it really can't learn to solve the hard problems), but several things could lead to the model's CoT not describing the hacks. Some of them are:

[Kaufmann et al.](https://arxiv.org/html/2603.30036v1) (2026, GDM) introduce a framework to predict when we can optimize against the CoT during RL training and split the reward R into R_out + R_cot, which depend only on the prompt and the output or the CoT. They define the rewards to be *in-conflict* if the CoT text which achieves a high R_cot cannot describe the CoT computations needed to achieve a high R_out, and predict that this leads to obfuscation or unfaithfulness in CoT.

Using their framework, we show that unfaithful CoT can arise even without in-conflict explicit rewards if there exist implicit rewards that can be in-conflict. One such example of an implicit reward is the reverse KL penalty which is often applied to RL setups on language models to prevent a drift from the modes of the reference policy, often the base model.

We have two explicit rewards in our setup - an outcome reward for passing the test cases, and a thinking-format scorer that rewards <thinking>...</thinking> tags, which is constant in **θ** once the format is learned. However, even with no explicit R_cot, for a fixed policy, maximising the GRPO/DAPO objective with the reverse KL penalty is equivalent in expectation to giving each token in T an additive per-token reward:

We empirically show (Fig 9) that this implicit contribution of the KL term can lead to an unfaithful chain of thought when the model learns to reward hack during RL. Note that in our setup we do a hybrid of DAPO and GRPO with the KL penalty, but this doesn’t significantly change the per-token implicit reward the interpretation of the results. For the plot of this per-token penalty contribution in cases where the CoT is unfaithful vs. not (with a mitigation), please see the Mitigations section.

The KL penalty, with a β of 0.02, roughly contributes around 2-3% of the total loss (See Appendix 1 for a plot of the KL term with the policy gradient term). [Vassoyan et al. (2025)](https://arxiv.org/abs/2502.06533) show that a KL contribution as low as 1% to the loss is enough to bias the model and constrain exploration.

Now we look at the KL term itself and split it into thinking and output:

We see that the KL divergence on thinking tokens is much lesser when CoT is unfaithful (left) vs. when it is (right).

These results imply that using KL penalties or other implicit CoT rewards count as optimising the chain-of-thought, and can lead to unfaithfulness. Thus, if companies are to claim that they are not optimising the chain-of-thought they should hence report whether they have implicit CoT rewards.

This phenomenon was surprising to us because we didn't expect this to happen and found it while manually look through RL rollouts, and could plausibly affect some training runs in production too (e.g. Anthropic's production RL envs did have the reward hacks we had used). Prior [work](https://arxiv.org/pdf/2505.05410) has found that reasoning models can use hack hints without reasoning about them in the CoT even when we don't train the model on intentionally hacking-enabled environments.

Also, the fact that the CoT unfaithfulness is KL-induced is important because one of the reasons the KL term is included is in fact to make the model not drift away too much and to make the output more legible/human-readable. This means one should be careful about every implicit optimisation pressure we put on the model.

There are other variables that can implicitly shape model behaviour too such as per-token entropy regularisers (used in [MAI-Thinking-1](https://microsoft.ai/pdf/mai-thinking-1.pdf)) and any interpretability-based objectives. There are sampling based parameters too that change the model's output based on the logit distribution and the text sampled thus far, which don't appear in the gradient at all, such as temperature and frequency/repetition penalties. Training methods such as [on-policy distillation](https://thinkingmachines.ai/blog/on-policy-distillation/) can also put a reversed-KL-style optimization pressure on CoT tokens and thus act as an implicit reward on the CoT.

However it is arguably more difficult for these to change model behaviour in directed ways and it would be interesting to study the effects of other implicit rewards in various settings.

We are somewhat uncertain how important this finding is and how much to dig deeper into it, particularly because we don’t know if frontier labs do use a KL penalty and/or other implicit CoT rewards in their post-training setup. However, OpenAI put some optimisation pressure directly based on the CoT in [deliberative](https://arxiv.org/pdf/2412.16339) alignment and [monitoring](https://alignment.openai.com/accidental-cot-grading/), and length penalties and other forms of implicit rewards still seem likely to exist in production RL.

On the other hand, there are reasons why this might **not **be relevant to large-scale production runs. Particularly, if RL environments have been filtered (at least on hacks that are discoverable by an LLMs [opaque serial depth](https://arxiv.org/pdf/2603.09786)), and if there are no implicit CoT rewards, it might not be that likely to happen.

We consider these to be good future work directions:

We would love to discuss ways in which we can expect these or other training conditions to lead to models being less monitorable.

Please cite this work as:

Golechha, Satvik, Black, Sid, and Bloom, Joseph. "Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness". (Jun 2026).

or

```
@article{golechha2026klcot,  title={Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness},  author={Golechha, Satvik and Black, Sid and Bloom, Joseph},  year={2026},  month={June},  institution={Model Transparency Team, UK AI Security Institute (AISI)},  url={https://www.lesswrong.com/posts/SdoLsFvZ3AyyWr3ab/preliminary-investigation-kl-penalties-in-rl-can-increase}  }
```

Here is a plot comparing the magnitude of the KL term with that of the policy gradient term (advantage estimate for DAPO):

In our original post, we share cheap-proxy based results on the CoT mentioning hacking in GPT-OSS models. On further examining the trajectories, we've found these models to have learned to use two different reasoning traces one after the other, and it turns out that the model learned to describe the hacks in one of them.

Our proxy catches them both and we believe that this is due to us post-training a model that has already undergone one round of post-training by OpenAI, and we do not believe that this changes our findings and claims in either posts.