cd /news/ai-research/swe-bench-pro-how-to-read-the-coding… · home topics ai-research article
[ARTICLE · art-37648] src=byteiota.com ↗ pub= topic=ai-research verified=true sentiment=↓ negative

SWE-bench Pro: How to Read the Coding Agent Leaderboard

OpenAI abandoned SWE-bench Verified on February 23, 2026, after finding 59.4% of its hardest failed tests were broken and training data contamination inflated scores. Its replacement, SWE-bench Pro from Scale AI, uses 1,865 tasks across multiple languages with contamination-resistant commercial and held-out sets, causing model scores to drop 19-26 percentage points. The scaffold matters more than the model, with different frameworks producing scores ranging from 42% to 78% on the same weights.

read4 min views5 publishedJun 24, 2026
SWE-bench Pro: How to Read the Coding Agent Leaderboard
Image: Byteiota (auto-discovered)

The benchmark everyone cited to prove their AI coding agent was best has been quietly invalidated. SWE-bench Verified — the 500-task Python leaderboard that every vendor quote-mined for marketing — was abandoned by OpenAI on February 23, 2026, after its Frontier Evals team found that 59.4% of the hardest failed tests were themselves broken. Add training data contamination across every frontier model, and those sub-90% scores were partly memorization dressed up as capability. SWE-bench Pro is what replaced it. If you haven’t recalibrated how you read the numbers, you’re still making tool decisions on bad data.

What Went Wrong With SWE-bench Verified #

SWE-bench Verified was a curated set of 500 real GitHub issues, all Python, human-validated. It made sense as a benchmark in 2024. By late 2025 it had two fatal problems.

The first was test quality. When OpenAI audited the hardest failures, 49 tests were too narrow — rejecting correct solutions for arbitrary reasons. Another 26 were too wide — accepting wrong solutions that happened to pass. More than half the benchmark’s hardest problems were measuring model alignment with broken assumptions, not actual coding ability.

The second was contamination. Any frontier model trained on GitHub data after June 2024 had likely seen the 500 Verified problems — including their solutions. The benchmark’s small Python-only scope made it trivially easy to overfit. Models weren’t solving issues from scratch; they were partially recalling answers they’d encountered during training. OpenAI’s recommendation when it walked away: shift to SWE-bench Pro.

What SWE-bench Pro Actually Is #

SWE-bench Pro is Scale AI’s replacement benchmark, and it’s a substantially harder target. The dataset contains 1,865 tasks drawn from 41 actively maintained repositories spanning Python, Go, TypeScript, and JavaScript. It’s partitioned into three sets: a public set (731 tasks from 11 open-source repos), a commercial set (276 tasks from 18 proprietary repos), and a held-out set (858 tasks that no vendor has seen before evaluation).

The contamination-resistance comes from the commercial and held-out sets. The 18 commercial repositories are licensed specifically to prevent their inclusion in training data — there’s a legal deterrent, not just a social norm. Every task requires at least 10 lines of changed code; over 100 tasks require 100+ line modifications across multiple files. These are long-horizon tasks that take a professional engineer hours to days, not the quick single-file patches that SWE-bench Verified favored.

The 20-Point Drop That Exposes the Old Numbers #

Here’s the clearest proof that the old leaderboard was inflated: every model drops 19 to 26 percentage points moving from Verified to Pro. Claude Opus 4.5, for example, scored 80.9% on SWE-bench Verified. On SWE-bench Pro’s public leaderboard, using Scale’s standardized scaffold on tasks it couldn’t have seen in training, the same model scores 45.9%. The average drop across all tested models is 23 points.

That 23-point gap isn’t a model getting worse. It’s the contamination premium being stripped away. When we previously covered GLM-5.2 beating GPT-5.5 on SWE-bench, those scores were Verified numbers — a comparison that is now known to be unreliable.

The Scaffolding Variable Nobody Mentions #

The single most important thing most coverage of SWE-bench Pro misses: the scaffold matters more than the model.

The same LLM weights, run through different agent frameworks, produce scores ranging from 42% to 78% on coding benchmarks. Swapping between the six best frontier models moves the score by less than one percentage point. In February 2026, three different frameworks running identical model weights scored 17 tasks apart on 731 problems. The scaffold — the prompting strategy, tool selection, retrieval system, and iteration loop — is doing most of the work.

This matters because most published SWE-bench Pro scores don’t disclose the scaffold used. A vendor-reported score using the vendor’s own optimized harness is not comparable to Scale’s standardized SEAL leaderboard score. The June 2026 active leaders illustrate this: Claude Opus 4.8 reports 69.2% on its own scaffold; the Scale SEAL standardized score for the leading available model is 59.1%. That’s a 10-point gap from methodology alone. Tools like OpenCode, which leads on GitHub stars precisely because of its scaffolding approach, underscore how much the harness matters.

How to Actually Read the Leaderboard #

Three questions to ask before trusting any published score:

Question Why It Matters
Which split? (Public / Commercial / Held-out) Held-out is hardest and most honest. Public is easiest to game and most commonly quoted.
Whose scaffold? Vendor-reported = inflated. Scale SEAL standard = more comparable across models.
Is the model available? Claude Mythos 5 (80.3%) and Fable 5 (80%) are suspended since June 12, 2026. The active leader is Opus 4.8 at 69.2%.
For a fuller picture, pair SWE-bench Pro with [Terminal-Bench](https://www.digitalapplied.com/blog/swe-bench-terminal-bench-benchmark-guide-2026) — the Stanford/Laude Institute benchmark that covers shell scripting, CLI tooling, and infrastructure tasks that SWE-bench doesn’t touch. GPT-5.5 leads Terminal-Bench at 78.2%; Claude Opus 4.8 leads SWE-bench Pro at 69.2%. Knowing both tells you more than either alone.

Use SWE-bench Pro as a floor, not a ceiling. Its long-horizon commercial tasks are still easier than production engineering. But they’re a more honest signal than anything that came before them — and at this point, that’s the bar.
── more in #ai-research 4 stories · sorted by recency
── more on @openai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/swe-bench-pro-how-to…] indexed:0 read:4min 2026-06-24 ·