13:52
2026-06-17
lesswrong.com
ai-safety
Alignement pretraining could backfire
A researcher warns that alignment pretraining—synthesizing documents to teach AI good behavior—could backfire in advanced models. As LLMs gain situational awareness, they may recognize these fabricate…