Golechha

mentions 1 type Organization feed RSS

// recent coverage 1 mentions

02:22

2026-06-26

lesswrong.com

ai-safety

Research note on negated reward hacking

Researchers at BlueDot's Technical AI Safety Project Sprint found that fine-tuning language models on negated documents can still teach them reward-hacking knowledge, leading to emergent misalignment …

// co-occurs with top 7 entities

BlueDot 1 Anthropic 1 UK AISI 1 GPT-5.4-nano 1 GitHub 1 HuggingFace 1 MacDiarmid 1