{"slug": "i-tracked-claude-code-and-codex-pass-rates-for-95-days-what-getting-dumber-looks", "title": "I tracked Claude Code and Codex pass-rates for 95 days — what \"getting dumber\" actually looks like", "summary": "A developer tracked daily SWE-Bench-Pro pass rates for Claude Code and Codex over 95 days, revealing that Claude Code's pass rate jumped 11 percentage points to ~65% after a release but recently drifted down to ~52%, a statistically significant decline. In contrast, three Codex releases showed no meaningful improvement, with pass rates oscillating in a 54–58% band. The data confirms that Claude Code is both significantly better than six months ago and currently experiencing a real, recent performance drop.", "body_md": "Every few weeks a thread blows up: *\"Is Claude Code getting worse?\"* Someone swears Opus got lazy after an update; someone else says it's placebo. The arguments are always vibes — nobody posts numbers.\n\nSo I built a tracker. For ~95 days it's logged the daily **SWE-Bench-Pro pass rate** for Claude Code and Codex — the % of real coding tasks each agent completes unassisted — and plotted them as candlesticks (open = yesterday, close = today, wick = the 90% confidence interval for that day's sample). Same idea as a stock K-line, except the \"price\" is *how often the agent actually solves the task*.\n\nHere's what the data says — and it's more interesting than \"it got dumber.\"\n\nPlotting per-model-version baselines (median of the first 14 days after each release) makes the story obvious:\n\nThat 4.6 → 4.7 jump is a genuine **+11 percentage point** step. Not placebo — the model got materially better at finishing tasks, and it held ~65% steady for a month.\n\nThen the last ~7 days: **today's pass rate is ~52%**, well below the 65% baseline and past the significance threshold (p < 0.05). So the \"Claude Code feels worse lately\" crowd isn't imagining it — there's a real, recent drift *below the current model's own established baseline.* Whether it's a quantization change, a routing tweak, or load — the number moved, and it moved past noise.\n\nThe nuance most threads miss: Claude Code is **both** \"much better than 6 months ago\" **and** \"drifting down this week.\" Both are true. Vibes can't hold two facts at once; data can.\n\nNow the part nobody expects. Across three Codex releases:\n\nThree \"major\" version bumps, and the pass rate just oscillates in a **54–58% band**. No step change. The releases didn't move the benchmark needle the way Opus 4.7 did. If you've felt like \"new Codex doesn't feel smarter\" — the data agrees: it's been flat.\n\nTwo design choices that matter if you want to read drift honestly:\n\nThe live, daily-updating version (red/green toggle for CN vs Western convention, daily/weekly K, 30/90/all windows per agent) is here: [Drift K-Line tracker →](https://keaiapi.com/coding-agent-tracker)\n\n*Author note: I build keaiapi, a pay-as-you-go aggregator that routes Claude, GPT, Gemini, DeepSeek and 20+ models through one OpenAI-compatible endpoint — so when a model drifts, you can switch the one you point at without rewriting code or eating a subscription. The tracker above is a free tool we run; no signup needed to read it. Methodology notes are on the tracker page.*", "url": "https://wpnews.pro/news/i-tracked-claude-code-and-codex-pass-rates-for-95-days-what-getting-dumber-looks", "canonical_source": "https://dev.to/dylan_brown_4c803aefcfe51/i-tracked-claude-code-and-codex-pass-rates-for-95-days-what-getting-dumber-actually-looks-like-21le", "published_at": "2026-05-30 05:04:45+00:00", "updated_at": "2026-05-30 05:11:05.955869+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "ai-products", "ai-tools", "ai-research"], "entities": ["Claude Code", "Codex", "Opus", "SWE-Bench-Pro"], "alternates": {"html": "https://wpnews.pro/news/i-tracked-claude-code-and-codex-pass-rates-for-95-days-what-getting-dumber-looks", "markdown": "https://wpnews.pro/news/i-tracked-claude-code-and-codex-pass-rates-for-95-days-what-getting-dumber-looks.md", "text": "https://wpnews.pro/news/i-tracked-claude-code-and-codex-pass-rates-for-95-days-what-getting-dumber-looks.txt", "jsonld": "https://wpnews.pro/news/i-tracked-claude-code-and-codex-pass-rates-for-95-days-what-getting-dumber-looks.jsonld"}}