20:38
2026-06-13
lesswrong.com
ai-safety
A cheap specialist judge gets used by agents but fails to reduce alignment audit costs
A researcher trained a cheap Gemma 2B judge to detect misalignment in AI agents, but testing against Anthropic's AuditBench showed the judge failed to reduce audit costs or reliably distinguish misaliβ¦