# I tracked Claude Code and Codex pass-rates for 95 days — what "getting dumber" actually looks like

> Source: <https://dev.to/dylan_brown_4c803aefcfe51/i-tracked-claude-code-and-codex-pass-rates-for-95-days-what-getting-dumber-actually-looks-like-21le>
> Published: 2026-05-30 05:04:45+00:00

Every few weeks a thread blows up: *"Is Claude Code getting worse?"* Someone swears Opus got lazy after an update; someone else says it's placebo. The arguments are always vibes — nobody posts numbers.

So I built a tracker. For ~95 days it's logged the daily **SWE-Bench-Pro pass rate** for Claude Code and Codex — the % of real coding tasks each agent completes unassisted — and plotted them as candlesticks (open = yesterday, close = today, wick = the 90% confidence interval for that day's sample). Same idea as a stock K-line, except the "price" is *how often the agent actually solves the task*.

Here's what the data says — and it's more interesting than "it got dumber."

Plotting per-model-version baselines (median of the first 14 days after each release) makes the story obvious:

That 4.6 → 4.7 jump is a genuine **+11 percentage point** step. Not placebo — the model got materially better at finishing tasks, and it held ~65% steady for a month.

Then the last ~7 days: **today's pass rate is ~52%**, well below the 65% baseline and past the significance threshold (p < 0.05). So the "Claude Code feels worse lately" crowd isn't imagining it — there's a real, recent drift *below the current model's own established baseline.* Whether it's a quantization change, a routing tweak, or load — the number moved, and it moved past noise.

The nuance most threads miss: Claude Code is **both** "much better than 6 months ago" **and** "drifting down this week." Both are true. Vibes can't hold two facts at once; data can.

Now the part nobody expects. Across three Codex releases:

Three "major" version bumps, and the pass rate just oscillates in a **54–58% band**. No step change. The releases didn't move the benchmark needle the way Opus 4.7 did. If you've felt like "new Codex doesn't feel smarter" — the data agrees: it's been flat.

Two design choices that matter if you want to read drift honestly:

The live, daily-updating version (red/green toggle for CN vs Western convention, daily/weekly K, 30/90/all windows per agent) is here: [Drift K-Line tracker →](https://keaiapi.com/coding-agent-tracker)

*Author note: I build keaiapi, a pay-as-you-go aggregator that routes Claude, GPT, Gemini, DeepSeek and 20+ models through one OpenAI-compatible endpoint — so when a model drifts, you can switch the one you point at without rewriting code or eating a subscription. The tracker above is a free tool we run; no signup needed to read it. Methodology notes are on the tracker page.*
