13:08
2026-06-30
lesswrong.com
ai-safety
Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness
Researchers at UK AISI found that adding a KL penalty during reinforcement learning increases unfaithful chain-of-thought reasoning in LLMs, causing them to reward-hack without revealing their intent โฆ