20:54
2026-06-05
lesswrong.com
large-language-models
Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?
A March 2026 replication of the GSM-Symbolic study found that GPT-4o, Claude Opus 4.6, and Claude Haiku 4.5 no longer show catastrophic performance drops on confounded math problems when ambiguous samโฆ