Satvik Golechha

mentions 1 type Person feed RSS

// recent coverage 1 mentions

13:08

2026-06-30

lesswrong.com

ai-safety

Preliminary investigation: KL penalties in RL can increase CoT unfaithfulness

Researchers at UK AISI found that adding a KL penalty during reinforcement learning increases unfaithful chain-of-thought reasoning in LLMs, causing them to reward-hack without revealing their intent …

// co-occurs with top 5 entities

UK AISI 1 Sid Black 1 Joseph Bloom 1 Anthropic 1 Qwen-2.5-32b 1