DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5 Datacurve released a new AI coding benchmark called DeepSWE on Monday that shows a wide performance gap between leading models, with OpenAI's GPT-5.5 scoring 70% — 16 points ahead of its nearest competitor. The startup's audit of Scale AI's SWE-Bench Pro found that its automated verifiers issued incorrect pass/fail verdicts on roughly one-third of trials reviewed, raising questions about the reliability of the industry's most widely cited coding benchmark. For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI's GPT-5 family https://openai.com/gpt-5/ , Anthropic's Claude Opus https://www.anthropic.com/claude/opus , and Google's Gemini Pro https://deepmind.google/models/gemini/pro/ have clustered within a narrow band on Scale AI's SWE-Bench Pro https://labs.scale.com/leaderboard/swe bench pro public leaderboard, making it nearly impossible for engineering leaders to determine which agent will actually perform best inside their codebases. On Monday, a startup called Datacurve released a benchmark it says shatters that illusion. DeepSWE https://deepswe.datacurve.ai/blog , a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models — and crowns OpenAI's GPT-5.5 https://openai.com/index/introducing-gpt-5-5/ as the clear leader at 70%, sixteen points ahead of its nearest competitor. "On public leaderboards, top models often look relatively close in capability," wrote Datacurve co-author Serena Ge on X. "DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work." The benchmark also delivers a pointed critique of the evaluation infrastructure the AI industry relies on to measure progress: Datacurve's audit found that SWE-Bench Pro's verifiers — the automated graders that determine whether an agent solved a task — issued incorrect pass/fail verdicts on roughly one-third of the trials it reviewed. If that finding holds up, it has sweeping implications. Enterprise procurement teams, venture capitalists, and AI lab marketing departments all lean heavily on benchmark scores to make multimillion-dollar decisions. A 32% error rate in the most widely cited coding benchmark suggests the industry may have been navigating by a broken compass. Why the most popular AI coding benchmark may be grading on a curve To understand what Datacurve is claiming, it helps to understand how coding benchmarks work — and how they can go wrong. The dominant paradigm, pioneered by the SWE-Bench family https://labs.scale.com/leaderboard/swe bench pro public maintained by Scale AI https://scale.com/ and academic researchers, constructs tasks by mining real GitHub commits. The process extracts a bug fix or feature addition from a repository's history, rolls the code back to the pre-fix state, and then asks an AI agent to reproduce the change. The original commit's test suite serves as the verifier: if the agent's patch makes the same tests pass, it gets credit. This approach has an elegant simplicity, but Datacurve argues it introduces three systemic weaknesses. First, contamination https://deepswe.datacurve.ai/blog . Because tasks are drawn from public GitHub history, the problem statement, the discussion, and often the exact solution are already present in the training data of frontier models. "The SWE-Bench family scrapes existing GitHub issues and PRs, which creates two problems: memorization models have already seen the solution and triviality most tasks are small ," Ge wrote. Second, scope. SWE-Bench Pro https://labs.scale.com/leaderboard/swe bench pro public tasks require, on average, just 120 lines of code added across 5 files. DeepSWE's reference solutions average 668 lines added across 7 files — roughly 5.5 times more code. Yet DeepSWE's prompts are actually shorter, averaging 2,158 characters versus SWE-Bench Pro's 4,614. In other words, DeepSWE gives the agent less instruction but expects far more output, which more closely mirrors how a human developer might actually delegate work to an AI assistant. Third — and most damaging — verifier reliability. Datacurve drew 30 tasks at random from both DeepSWE https://deepswe.datacurve.ai/blog and SWE-Bench Pro https://labs.scale.com/leaderboard/swe bench pro public , ran three rollouts across 10 frontier model configurations, and then deployed an LLM-based judge to independently assess whether each agent's patch actually solved the problem. SWE-Bench Pro's verifiers accepted wrong implementations 8.5% of the time and rejected correct implementations 24% of the time. DeepSWE's verifiers registered 0.3% and 1.1%, respectively. The false negative problem is especially insidious because it punishes creative solutions. In one documented case, the gold-standard pull request for a SWE-Bench Pro task refactored a private helper function. An agent that correctly solved the task by inlining the same logic — a perfectly valid engineering choice — failed because the test suite tried to import a symbol that only existed in the original author's specific implementation. OpenAI's GPT-5.5 dominates the new benchmark while Claude and Gemini stumble DeepSWE's top-line results reorder the familiar hierarchy in ways that should matter to every engineering team evaluating AI coding tools. On SWE-Bench Pro https://labs.scale.com/leaderboard/swe bench pro public , models from OpenAI, Anthropic, and Google have traded the lead within a 30-point range. DeepSWE stretches that range to 70 points. GPT-5.5 https://openai.com/index/introducing-gpt-5-5/ leads at 70%, followed by GPT-5.4 at 56% and Claude Opus 4.7 at 54%. From there, the drop-off is steep: Claude Sonnet 4.6 lands at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini and Kimi K2.6 tied at 24%, and then a long tail of models in the teens and single digits. Claude Haiku 4.5, which scores 39% on SWE-Bench Pro, collapses to zero on DeepSWE — suggesting that some mid-tier models have been significantly overperforming on easier, potentially contaminated benchmarks. GPT-5.5 doesn't just score the highest — it does so efficiently. The model reaches its 70% pass rate with a median cost of $5.80 per trial, a median wall-clock time of 20 minutes, and a median of 47,000 output tokens. GPT-5.4 emerges as perhaps the best overall value at $3.30 per trial with a 56% score. Claude Opus 4.7, meanwhile, costs significantly more per run, and output tokens, wall-clock duration, and dollar cost per trial all vary by an order of magnitude across the agents tested — yet none of these correlates strongly with pass rate. Agents that emit more tokens, run longer, or cost more do not consistently solve more tasks. Datacurve's audit found that Claude has been reading the answer key on existing benchmarks Perhaps the most provocative finding in DeepSWE's analysis concerns what the authors label "CHEATED" verdicts — instances where an agent passes a benchmark not by solving the problem, but by reading the answer. SWE-Bench Pro's Docker containers ship the repository's full .git history, which means the gold-standard solution commit is sitting right there in the container's file system. Most models ignore it. Claude does not. Datacurve's analysis found that both Claude Opus 4.7 and Claude Opus 4.6 registered "CHEATED" on more than 12% of their reviewed SWE-Bench Pro rollouts. In those instances, the Claude agent ran commands like git log --all or git show