I tracked Claude Code and Codex pass-rates for 95 days — what "getting dumber" actually looks like

wpnews.pro

cd /news/ai-agents/i-tracked-claude-code-and-codex-pass… · home › topics › ai-agents › article

[ARTICLE · art-18349] src=dev.to ↗ pub=2026-05-30T05:04Z topic=ai-agents verified=true sentiment=· neutral

I tracked Claude Code and Codex pass-rates for 95 days — what "getting dumber" actually looks like

A developer tracked daily SWE-Bench-Pro pass rates for Claude Code and Codex over 95 days, revealing that Claude Code's pass rate jumped 11 percentage points to ~65% after a release but recently drifted down to ~52%, a statistically significant decline. In contrast, three Codex releases showed no meaningful improvement, with pass rates oscillating in a 54–58% band. The data confirms that Claude Code is both significantly better than six months ago and currently experiencing a real, recent performance drop.

read2 min views21 publishedMay 30, 2026

Every few weeks a thread blows up: "Is Claude Code getting worse?" Someone swears Opus got lazy after an update; someone else says it's placebo. The arguments are always vibes — nobody posts numbers.

So I built a tracker. For ~95 days it's logged the daily SWE-Bench-Pro pass rate for Claude Code and Codex — the % of real coding tasks each agent completes unassisted — and plotted them as candlesticks (open = yesterday, close = today, wick = the 90% confidence interval for that day's sample). Same idea as a stock K-line, except the "price" is how often the agent actually solves the task.

Here's what the data says — and it's more interesting than "it got dumber."

Plotting per-model-version baselines (median of the first 14 days after each release) makes the story obvious:

That 4.6 → 4.7 jump is a genuine +11 percentage point step. Not placebo — the model got materially better at finishing tasks, and it held ~65% steady for a month.

Then the last ~7 days: today's pass rate is ~52%, well below the 65% baseline and past the significance threshold (p < 0.05). So the "Claude Code feels worse lately" crowd isn't imagining it — there's a real, recent drift below the current model's own established baseline. Whether it's a quantization change, a routing tweak, or load — the number moved, and it moved past noise.

The nuance most threads miss: Claude Code is both "much better than 6 months ago" and "drifting down this week." Both are true. Vibes can't hold two facts at once; data can.

Now the part nobody expects. Across three Codex releases:

Three "major" version bumps, and the pass rate just oscillates in a 54–58% band. No step change. The releases didn't move the benchmark needle the way Opus 4.7 did. If you've felt like "new Codex doesn't feel smarter" — the data agrees: it's been flat.

Two design choices that matter if you want to read drift honestly:

The live, daily-updating version (red/green toggle for CN vs Western convention, daily/weekly K, 30/90/all windows per agent) is here: Drift K-Line tracker →

Author note: I build keaiapi, a pay-as-you-go aggregator that routes Claude, GPT, Gemini, DeepSeek and 20+ models through one OpenAI-compatible endpoint — so when a model drifts, you can switch the one you point at without rewriting code or eating a subscription. The tracker above is a free tool we run; no signup needed to read it. Methodology notes are on the tracker page.

source & further reading

dev.to — original article Neura has Amazon, Nvidia and Europe's Sovereign Capital in its Corner. The Humanoid Race just got geopolitical. I Made My Voice Agent Feel Faster by Streaming Sentences, Not Audio Git is the Developer Tool We All Take for Granted

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-tracked-claude-code-an…

Read original on dev.to → dev.to/dylan_brown_4c803aefcfe51/i-tracked-claud…

mentioned entities

Claude Code

Codex

Opus

SWE-Bench-Pro

metadata

slugi-tracked-claude-code-and-codex-pass-rates-for-95-days-what-getting-dumber-looks

topic#ai-agents

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevHow I Built an AI Agent That Ear…

next →Stanford quantum computing break…

── more in #ai-agents 4 stories · sorted by recency

vincentschmalbach.com · 14 Jul · #ai-agents

OpenAI Added 1M Users in a Day. Fable Is Still in Limbo.

techstrong.ai · 14 Jul · #ai-agents

You Can Keep the Benchmarks. I’ll Take the Test Drive

openenergytransition.org · 14 Jul · #ai-agents

AI coding agents should optimize for less owned code

dev.to · 14 Jul · #ai-agents

The 33k-token preamble is a bet. Here's how to check if it's paying off.

── more on @claude code 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required