AuditBench

mentions 2 type Organization feed RSS

// recent coverage 2 mentions

20:48

2026-07-20

lesswrong.com

ai-safety

Restoring Model Alignment via Honesty Activation Steering

Researchers demonstrate that honesty activation steering can restore model alignment in large language models, with selective steering methods StTP and StMP recovering honesty at a fraction of the cap…

20:38

2026-06-13

lesswrong.com

ai-safety

A cheap specialist judge gets used by agents but fails to reduce alignment audit costs

A researcher trained a cheap Gemma 2B judge to detect misalignment in AI agents, but testing against Anthropic's AuditBench showed the judge failed to reduce audit costs or reliably distinguish misali…

// co-occurs with top 8 entities

Llama-3.3-70B 1 Qwen3.6-27B 1 AxBench 1 MASK 1 Among Us 1 Gemma 2B 1 Anthropic 1 Llama 3.3 70B 1

// topics top 5 topics

ai safety 2 large language models 2 ai research 2 ai agents 1 artificial intelligence 1