Mirzadeh

mentions 1 type Organization feed RSS

// recent coverage 1 mentions

20:54

2026-06-05

lesswrong.com

large-language-models

Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?

A March 2026 replication of the GSM-Symbolic study found that GPT-4o, Claude Opus 4.6, and Claude Haiku 4.5 no longer show catastrophic performance drops on confounded math problems when ambiguous sam…

// co-occurs with top 6 entities

GSM-Symbolic 1 Apple 1 GPT-4o 1 Claude Opus 4.6 1 Claude Haiku 4.5 1 ICLR 2025 1

// topics top 5 topics

large language models 1 ai research 1 artificial intelligence 1 machine learning 1 natural language processing 1