GPQA-diamond

mentions 1 type Organization feed RSS

// recent coverage 1 mentions

09:32

2026-07-03

lesswrong.com

large-language-models

Fragile Correctness: Cases of reasoning harming performance

A new study reveals that increased reasoning in AI models can sometimes reduce accuracy, a phenomenon termed 'fragile correctness.' Researchers found that 14.9% of answers switched from correct to inc…

// co-occurs with top 7 entities

Opus 4.8 1 Anthropic 1 Fable and Mythos 1 Ghosal et al 1 Ballon et al 1 MMLU-pro 1 Gemma 4-12b-it 1