02:22
2026-06-26
lesswrong.com
ai-safety
Research note on negated reward hacking
Researchers at BlueDot's Technical AI Safety Project Sprint found that fine-tuning language models on negated documents can still teach them reward-hacking knowledge, leading to emergent misalignment โฆ