cd /news/ai-agents/i-tracked-claude-code-and-codex-pass… · home topics ai-agents article
[ARTICLE · art-18349] src=dev.to pub= topic=ai-agents verified=true sentiment=· neutral

I tracked Claude Code and Codex pass-rates for 95 days — what "getting dumber" actually looks like

A developer tracked daily SWE-Bench-Pro pass rates for Claude Code and Codex over 95 days, revealing that Claude Code's pass rate jumped 11 percentage points to ~65% after a release but recently drifted down to ~52%, a statistically significant decline. In contrast, three Codex releases showed no meaningful improvement, with pass rates oscillating in a 54–58% band. The data confirms that Claude Code is both significantly better than six months ago and currently experiencing a real, recent performance drop.

read2 min publishedMay 30, 2026

Every few weeks a thread blows up: "Is Claude Code getting worse?" Someone swears Opus got lazy after an update; someone else says it's placebo. The arguments are always vibes — nobody posts numbers.

So I built a tracker. For ~95 days it's logged the daily SWE-Bench-Pro pass rate for Claude Code and Codex — the % of real coding tasks each agent completes unassisted — and plotted them as candlesticks (open = yesterday, close = today, wick = the 90% confidence interval for that day's sample). Same idea as a stock K-line, except the "price" is how often the agent actually solves the task.

Here's what the data says — and it's more interesting than "it got dumber."

Plotting per-model-version baselines (median of the first 14 days after each release) makes the story obvious:

That 4.6 → 4.7 jump is a genuine +11 percentage point step. Not placebo — the model got materially better at finishing tasks, and it held ~65% steady for a month.

Then the last ~7 days: today's pass rate is ~52%, well below the 65% baseline and past the significance threshold (p < 0.05). So the "Claude Code feels worse lately" crowd isn't imagining it — there's a real, recent drift below the current model's own established baseline. Whether it's a quantization change, a routing tweak, or load — the number moved, and it moved past noise.

The nuance most threads miss: Claude Code is both "much better than 6 months ago" and "drifting down this week." Both are true. Vibes can't hold two facts at once; data can.

Now the part nobody expects. Across three Codex releases:

Three "major" version bumps, and the pass rate just oscillates in a 54–58% band. No step change. The releases didn't move the benchmark needle the way Opus 4.7 did. If you've felt like "new Codex doesn't feel smarter" — the data agrees: it's been flat.

Two design choices that matter if you want to read drift honestly:

The live, daily-updating version (red/green toggle for CN vs Western convention, daily/weekly K, 30/90/all windows per agent) is here: Drift K-Line tracker →

Author note: I build keaiapi, a pay-as-you-go aggregator that routes Claude, GPT, Gemini, DeepSeek and 20+ models through one OpenAI-compatible endpoint — so when a model drifts, you can switch the one you point at without rewriting code or eating a subscription. The tracker above is a free tool we run; no signup needed to read it. Methodology notes are on the tracker page.

── more in #ai-agents 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-tracked-claude-cod…] indexed:0 read:2min 2026-05-30 ·