Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?

A March 2026 replication of the GSM-Symbolic study found that GPT-4o, Claude Opus 4.6, and Claude Haiku 4.5 no longer show catastrophic performance drops on confounded math problems when ambiguous samples are removed. The original 2024 paper claimed language models rely on pattern matching, but the new results show performance declines by only 0-2 percentage points, indicating frontier models now make reasonable judgments about potentially relevant data. The finding challenges a widely cited claim about AI reasoning limitations that continues to be shared as recently as last week.

The GSM-Symbolic paper https://arxiv.org/abs/2410.05229 ICLR 2025 purported to show that language models rely on pattern matching rather than genuine reasoning by demonstrating that perturbing the questions to make them break the pattern of the original question would catastrophically reduce performance in the model. Running the results again in March 2026 with GPT-4o, Claude Opus 4.6, and Claude Haiku 4.5 shows that we precisely replicate the original findings only when we do not audit out examples that may actually be ambiguous for the model. When we carefully remove samples that may genuinely impact the calculation, the effect is drastically reduced. This implies that the massive drop in performance is simply the models making a reasonable judgement to act on the “irrelevant” added data because it might be important. This result likely comes as little surprise, but I have seen this paper and results shared triumphantly as recently as last week as though it still applied to current models, and this made me sufficiently frustrated that I wanted to see exactly how far they’ve come with the hope of setting the record a little more straight. I also haven’t seen any follow up work explicitly demonstrating that this is no longer the case. In October 2024, Mirzadeh et al. from Apple published “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.” The paper made three claims: The paper was published at ICLR 2025 and continues to be a commonly referenced and discussed paper. However, the models it evaluated GPT-4o, Llama 3 8B, Phi-3, Gemma 2, etc. are now around 18 months old. We wanted to test if the results still hold. For context, here are selected results from the original paper: The original results were striking. Even GPT-4o, the strongest model tested, dropped from 94.9% to 63.0% when irrelevant confounders were added. We followed the original paper’s evaluation protocol as closely as possible: We found that this filtering step was genuinely challenging, with most of the samples generated by Claude Opus 4.6 needing to be filtered out even when having been specifically prompted to avoid ambiguous samples. Examples of distractors that passed our audit: Examples of distractors that were filtered out as ambiguous: We evaluated GPT-4o, Claude Opus 4.6, and Claude Haiku 4.5, capturing the best performer from the original paper, the current frontier model, and a small but reasonably strong reasoning model. GPT-4o is gpt-4o-2024-08-06, the snapshot current when the paper was published. The two auditors disagree on borderline cases κ = 0.32; 12% vs 36% kept , but every audited drop is statistically indistinguishable from zero. Our NoOp data was LLM-generated. Apple did not release the original GSM-NoOp dataset. We generated our distractor clauses using Claude Opus 4.6 and had GPT-5.5 audit them for relevance. These may be systematically easier or different from Apple’s hand-crafted distractors. We cannot make a direct apples-to-apples comparison on NoOp. However, the fact that our unfiltered dataset exhibits almost the exact same drop as the original paper is reasonable evidence that the distribution of the data was very similar to ours. In the original paper, the authors do not mention whether the data was generated by hand, by LLM, whether any auditing was applied or quality controls were used. Because the data was not shared we cannot confirm either way. We do observe that in our dataset that aims to measure exclusively irrelevant but potentially misleading examples only a minor decrease to the models’ scores 0-2 percentage points, none statistically distinguishable from zero . The fact that models so closely track the ambiguity classification indicates that the models are responding specifically to the content of the distractors and not simply their presence, considering they handle the non-ambiguous cases without a problem. From the model’s perspective, the instruction is simply not clear, and the test doesn’t tell us about their failure to reason. We used 1,000 questions instead of the full 5,000, using 100 templates with 10 instances each. Because the 10 instances of a template are number-swapped variants of one underlying problem, we report template-clustered bootstrap intervals resampling the 100 templates rather than treating the 1,000 rows as independent. This doesn’t mean LLMs can “truly reason.” GSM-Symbolic tests grade-school math. Saturating this benchmark tells us these models handle elementary arithmetic robustly, not that they can do advanced mathematics or formal reasoning. Harder benchmarks FrontierMath, ARC-AGI-2, Humanity’s Last Exam exist to test true reasoning abilities in models. The GSM-Symbolic paper made valid and important observations about the models available in mid-2024. Those observations no longer describe the current frontier. The original claims do not seem defensible given that a quality audit of the distractors erases the effect almost completely. The results indicate that the model is simply making a reasonable inference that the confounders added to the NoOp dataset are real signals to adjust its calculations, because why else would the questioner mention “4 out of the 20 apricots on a given tree are unusually small”. If the questioner didn’t intend for them to be excluded from the calculation then why mention it? We believe that, given the reach of this paper in shaping the discourse, the NoOp dataset should be shared to allow true independent replication of the original findings, so that the claims that models cannot reason can be checked fully. Given that some researchers still cite this work, the replication is all the more valuable. Code: github.com/BenSturgeon/gsm-symbolic-revisited · Data: huggingface.co/datasets/Experimental-Orange/gsm-noop-audited