Every few weeks a thread blows up: "Is Claude Code getting worse?" Someone swears Opus got lazy after an update; someone else says it's placebo. The arguments are always vibes — nobody posts numbers.
So I built a tracker. For ~95 days it's logged the daily SWE-Bench-Pro pass rate for Claude Code and Codex — the % of real coding tasks each agent completes unassisted — and plotted them as candlesticks (open = yesterday, close = today, wick = the 90% confidence interval for that day's sample). Same idea as a stock K-line, except the "price" is how often the agent actually solves the task.
Here's what the data says — and it's more interesting than "it got dumber."
Plotting per-model-version baselines (median of the first 14 days after each release) makes the story obvious:
That 4.6 → 4.7 jump is a genuine +11 percentage point step. Not placebo — the model got materially better at finishing tasks, and it held ~65% steady for a month.
Then the last ~7 days: today's pass rate is ~52%, well below the 65% baseline and past the significance threshold (p < 0.05). So the "Claude Code feels worse lately" crowd isn't imagining it — there's a real, recent drift below the current model's own established baseline. Whether it's a quantization change, a routing tweak, or load — the number moved, and it moved past noise.
The nuance most threads miss: Claude Code is both "much better than 6 months ago" and "drifting down this week." Both are true. Vibes can't hold two facts at once; data can.
Now the part nobody expects. Across three Codex releases:
Three "major" version bumps, and the pass rate just oscillates in a 54–58% band. No step change. The releases didn't move the benchmark needle the way Opus 4.7 did. If you've felt like "new Codex doesn't feel smarter" — the data agrees: it's been flat.
Two design choices that matter if you want to read drift honestly:
The live, daily-updating version (red/green toggle for CN vs Western convention, daily/weekly K, 30/90/all windows per agent) is here: Drift K-Line tracker →
Author note: I build keaiapi, a pay-as-you-go aggregator that routes Claude, GPT, Gemini, DeepSeek and 20+ models through one OpenAI-compatible endpoint — so when a model drifts, you can switch the one you point at without rewriting code or eating a subscription. The tracker above is a free tool we run; no signup needed to read it. Methodology notes are on the tracker page.