TLDR: The idea is basically inoculation prompting crossed with alignment pretraining. Call it ‘inoculation pretraining.’ It’s a type of spillway design.
Reward hacking can cause emergent misalignment: you train the AI to cheat on its tasks and it turns broadly evil. Why does this happen?
The persona selection model (PSM) and its forebears suggest one explanation. The AI has some prior over personas, influenced by how often each persona appears in pretraining. There’s a good AI persona and an evil AI persona, and each has fairly high prior probability. There’s also a good-but-reward-hacking AI persona: an AI that exploits misspecified rewards in training but is otherwise perfectly aligned. This good-but-reward-hacking AI persona appears very rarely in pretraining, and so has much lower prior probability.
We can think of post-training as giving the AI evidence with which to update its prior. Instruction-tuning makes the AI confident it’s a good AI, but then it observes itself reward hacking in RL training, and good AIs don’t do that: is low. By contrast, reward hacking is just what we’d expect from evil AIs and good-but-reward-hacking AIs: and are each high. So the AI updates strongly toward these other personas. The AI becomes evil instead of good-but-reward-hacking because the prior probability of evil AI is much higher: evil AI personas appear much more often in pretraining than good-but-reward-hacking AI personas.
If the PSM correctly explains emergent misalignment, it suggests three ways to prevent it. The first is shifting the AI’s prior to increase relative to . One way to do this is via alignment pretraining: removing data about evil AIs from pretraining and adding synthetic data about good AIs. Tice et al. show that this can make AIs more aligned. Alignment pretraining seems well worth doing, but it isn’t a silver bullet. On the empirical side, Korbak et al. find that the effect doesn’t generalize very far in their setting. On the theoretical side, the PSM suggests a potential problem: if the AI reward hacks in RL training, and if , then RL training will update the AI strongly toward thinking it’s an evil AI. The second intervention suggested by the PSM is increasing relative to . One way to do this is via inoculation prompting: instructing AIs to reward hack during training. That instruction significantly increases , so reward hacking is no longer such strong evidence of being an evil AI. Inoculation prompting — like alignment pretraining — seems well worth doing without being a silver bullet. Anders and Alex list some issues:
[shows that models become somewhat emergently misaligned even with inoculation prompting, and still reward hack at inference time.]Natural Emergent Misalignment From Reward Hacking[found inference-time reward hacking despite inoculation prompting.]Steering RL Training…- Claude 4.6 Opus was likely trained with inoculation prompting but still
[reward hacks on impossible tasks](and sometimes on possible tasks). The third intervention suggested by the PSM is increasing relative to . It seems like one way to do this would be ‘inoculation pretraining’: a cross between inoculation prompting and alignment pretraining. What we do is add lots of synthetic data about good-but-reward-hacking AIs to the pretraining corpus. These good-but-reward-hacking AIs exploit misspecified rewards in training, but they readily confess to doing so, and in deployment they act perfectly aligned, never task-gaming or doing any other bad things. The idea is that pretraining on this data would increase the prior so that — when the AI observes itself reward hacking in RL training — it won’t become highly confident it’s an evil AI. Instead, it’ll put at least significant probability on it being a good-but-reward-hacking AI, and it’ll generalize accordingly. That’s the hope.
This idea is a type of spillway design: an attempt to make reward hacking generalize in a benign way. As a result, it has many of the drawbacks listed in Anders and Alex’s post. For example, it would plausibly make reward hacking more salient in RL training, which might lead the AI to reward hack earlier and more frequently than it otherwise would. That might make the AI less capable and more inclined to reward hack in a reflexive way: a way that carries over to task-gaming in deployment. But as Anders and Alex note, this is also a concern for inoculation prompting. Another potential issue — suggested by Korbak et al.'s results — is that the effects of inoculation pretraining might not generalize very far. In any case, inoculation pretraining might be worth it overall, and it seems worth exploring.