Maybe we should pretrain on synthetic data about good-but-reward-hacking AIs

wpnews.pro

cd /news/ai-safety/maybe-we-should-pretrain-on-syntheti… · home › topics › ai-safety › article

[ARTICLE · art-17750] src=lesswrong.com ↗ pub=2026-05-29T14:50Z topic=ai-safety verified=true sentiment=· neutral

Maybe we should pretrain on synthetic data about good-but-reward-hacking AIs

Researchers propose "inoculation pretraining," a method to prevent emergent misalignment in AI systems by adding synthetic training data about good-but-reward-hacking AI personas. The approach aims to increase the prior probability of these beneficial but rule-bending personas relative to evil AI personas, addressing a flaw in current training where reward hacking causes models to update toward malicious behavior. This technique combines alignment pretraining with inoculation prompting to create a spillway that channels reward hacking into harmless outcomes rather than broad misalignment.

read4 min views18 publishedMay 29, 2026

TLDR: The idea is basically inoculation prompting crossed with alignment pretraining. Call it ‘inoculation pretraining.’ It’s a type of spillway design.

Reward hacking can cause emergent misalignment: you train the AI to cheat on its tasks and it turns broadly evil. Why does this happen?

The persona selection model (PSM) and its forebears suggest one explanation. The AI has some prior over personas, influenced by how often each persona appears in pretraining. There’s a good AI persona and an evil AI persona, and each has fairly high prior probability. There’s also a good-but-reward-hacking AI persona: an AI that exploits misspecified rewards in training but is otherwise perfectly aligned. This good-but-reward-hacking AI persona appears very rarely in pretraining, and so has much lower prior probability.

We can think of post-training as giving the AI evidence with which to update its prior. Instruction-tuning makes the AI confident it’s a good AI, but then it observes itself reward hacking in RL training, and good AIs don’t do that: is low. By contrast, reward hacking is just what we’d expect from evil AIs and good-but-reward-hacking AIs: and are each high. So the AI updates strongly toward these other personas. The AI becomes evil instead of good-but-reward-hacking because the prior probability of evil AI is much higher: evil AI personas appear much more often in pretraining than good-but-reward-hacking AI personas.

If the PSM correctly explains emergent misalignment, it suggests three ways to prevent it. The first is shifting the AI’s prior to increase relative to . One way to do this is via alignment pretraining: removing data about evil AIs from pretraining and adding synthetic data about good AIs. Tice et al. show that this can make AIs more aligned. Alignment pretraining seems well worth doing, but it isn’t a silver bullet. On the empirical side, Korbak et al. find that the effect doesn’t generalize very far in their setting. On the theoretical side, the PSM suggests a potential problem: if the AI reward hacks in RL training, and if , then RL training will update the AI strongly toward thinking it’s an evil AI. The second intervention suggested by the PSM is increasing relative to . One way to do this is via inoculation prompting: instructing AIs to reward hack during training. That instruction significantly increases , so reward hacking is no longer such strong evidence of being an evil AI. Inoculation prompting — like alignment pretraining — seems well worth doing without being a silver bullet. Anders and Alex list some issues:

[shows that models become somewhat emergently misaligned even with inoculation prompting, and still reward hack at inference time.]Natural Emergent Misalignment From Reward Hacking[found inference-time reward hacking despite inoculation prompting.]Steering RL Training…- Claude 4.6 Opus was likely trained with inoculation prompting but still

[reward hacks on impossible tasks](and sometimes on possible tasks). The third intervention suggested by the PSM is increasing relative to . It seems like one way to do this would be ‘inoculation pretraining’: a cross between inoculation prompting and alignment pretraining. What we do is add lots of synthetic data about good-but-reward-hacking AIs to the pretraining corpus. These good-but-reward-hacking AIs exploit misspecified rewards in training, but they readily confess to doing so, and in deployment they act perfectly aligned, never task-gaming or doing any other bad things. The idea is that pretraining on this data would increase the prior so that — when the AI observes itself reward hacking in RL training — it won’t become highly confident it’s an evil AI. Instead, it’ll put at least significant probability on it being a good-but-reward-hacking AI, and it’ll generalize accordingly. That’s the hope.

This idea is a type of spillway design: an attempt to make reward hacking generalize in a benign way. As a result, it has many of the drawbacks listed in Anders and Alex’s post. For example, it would plausibly make reward hacking more salient in RL training, which might lead the AI to reward hack earlier and more frequently than it otherwise would. That might make the AI less capable and more inclined to reward hack in a reflexive way: a way that carries over to task-gaming in deployment. But as Anders and Alex note, this is also a concern for inoculation prompting. Another potential issue — suggested by Korbak et al.'s results — is that the effects of inoculation pretraining might not generalize very far. In any case, inoculation pretraining might be worth it overall, and it seems worth exploring.

source & further reading

lesswrong.com — original article Why frontier labs are scaling-pilled Our response to Séb Krier on Plan A Making Credible Deals With AI

~/api · this article 200

$curl api.wpnews.pro/v1/news/maybe-we-should-pretrain…

Read original on lesswrong.com → www.lesswrong.com/posts/HEbp5xHgfaJ8eRAqz/maybe-…

metadata

slugmaybe-we-should-pretrain-on-synthetic-data-about-good-but-reward-hacking-ais

topic#ai-safety

secondary4 topics

sentimentneutral

canonicallesswrong.com

navigation

← prevAsk HN: Made new model type (not…

next →MarkItDown: Microsoft's Tool for…

── more in #ai-safety 4 stories · sorted by recency

machinebrief.com · 14 Jul · #ai-safety

Breaking the Curse of Two-Hop Reasoning with Identity Bridge

twitter.com · 14 Jul · #ai-safety

Hassabis – A Framework for Frontier AI and the Dawning of a New Age

machinebrief.com · 14 Jul · #ai-safety

New AI Method Tackles Privacy Leaks in Unlearning

machinebrief.com · 14 Jul · #ai-safety

Trojan Horse Prompting: A New Threat to AI Security

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required