{"slug": "claude-code-degraded-before-opus-4-8-release", "title": "Claude Code Degraded Before Opus 4.8 Release", "summary": "Anthropic's Claude Code agent suffered a statistically significant five-day performance degradation immediately before the Opus 4.8 model release, with pass rates dropping from a 65% baseline to as low as 50%. The drop coincided with the deployment of Claude Code version 2.1.150 and was accompanied by a 60% spike in tool calls and reduced input tokens, before recovering fully when version 2.1.153 launched alongside Opus 4.8 on May 28. The findings, captured by daily SWE-Bench-Pro tracking, suggest a harness issue tied to a CLI update rather than a model regression, raising questions about pre-release performance patterns.", "body_md": "# Claude Code degraded for the week before Opus 4.8's release\n\nOur SWE-Bench-Pro tracker caught a statistically significant, weeklong drop in Claude Code's pass rate just before Opus 4.8 shipped, and the recovery that followed.\n\nWe run [Claude Code](/trackers/claude-code) against a curated subset of [SWE-Bench-Pro](/explorers/swe-bench-pro) every day, in the\nplain CLI with the current SOTA model and no custom harness. The point is to catch the thing that benchmarks published at launch can’t:\nsilent, day-to-day changes in how a model actually performs in production.\n\nIn the week leading up to the Opus 4.8 release, the tracker caught exactly that. Opus 4.7’s pass rate dropped well below its established baseline and stayed there for five consecutive days, a sustained and statistically significant degradation, before recovering the moment Opus 4.8 took over.\n\n## What the tracker showed\n\nBelow is the tracker through today. The dashed line is the Opus 4.7 baseline (65%) and the shaded band is the significance threshold. Anything below the band is a statistically meaningful drop. You can see the five-day collapse in late May, followed by a sharp rebound once Opus 4.8 shipped on May 28.\n\n### Daily Trend\n\nPass rate over time\n\nToggle 95% CI to view uncertainty around each point.\n\nDashed line at 65% baseline with ±13.0% significance threshold\n\n### Weekly Trend\n\nAggregated 7-day pass rate\n\nThe same uncertainty toggle applies here for 7-day windows.\n\nDashed line at 65% baseline with ±4.3% significance threshold\n\n## Other Clues\n\nWe track a variety of other per-run metrics alongside pass rate, and most of them, including output tokens and average runtime, were unaffected through the dip. Two stood out: tool calls and input tokens both showed a clear pattern over the same window.\n\n### Other Metrics\n\nDaily benchmark resource and execution trends\n\n#### Tool Calls\n\nTotal tool invocations by day\n\n#### Input Tokens\n\nDaily total input usage\n\nTool calls spiked by roughly 60% per task across the degraded days while input tokens dropped, then both snapped back the moment Opus 4.8\ntook over. You can explore these series live on the [tracker](/trackers/claude-code).\n\n## The 2.1.150 release\n\nEvery benchmark run records the Claude Code CLI version it ran on, and lining those versions up against the daily pass rate is the most suggestive clue of all. The drop starts on the exact day version 2.1.150 was installed, holds across 2.1.150 and 2.1.152, and lifts the day 2.1.153 lands.\n\n| Date | Claude Code | Daily pass rate |\n|---|---|---|\n| May 21 | 2.1.148 | 64% |\nMay 22 | 2.1.150 | 50% |\nMay 23 | 2.1.150 | 54% |\nMay 24 | 2.1.150 | 50% |\nMay 25 | 2.1.150 | 50% |\nMay 26 | 2.1.152 | 52% |\n| May 27 | 2.1.153 | 66% |\n| May 28 | 2.1.156 | 72% |\n\n## Conclusion\n\nThis appears to be a harness issue, not a model regression. The drop tracks the Claude Code version rather than the model, the onset lines up with a CLI update while Opus 4.7 was unchanged, and the agent started making far more tool calls per task. The clues point to something introduced around Claude Code 2.1.150 and 2.1.151.\n\nThe issue appears to have been resolved, but this is not the first time a degradation has aligned before a new model release, and leaves open questions. We will continue to track frontier agents to detect future degradations in the future.\n\nInterested in promoting your AI product or service? Reach out to us at [[email protected]](/cdn-cgi/l/email-protection#fa929f969695ba979b889d9394969b98d49b93)", "url": "https://wpnews.pro/news/claude-code-degraded-before-opus-4-8-release", "canonical_source": "https://marginlab.ai/blog/claude-code-degraded-before-opus-4-8/", "published_at": "2026-05-29 12:43:59+00:00", "updated_at": "2026-05-29 13:17:43.068821+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-tools", "ai-research", "ai-agents"], "entities": ["Claude Code", "Opus 4.8", "Opus 4.7", "SWE-Bench-Pro"], "alternates": {"html": "https://wpnews.pro/news/claude-code-degraded-before-opus-4-8-release", "markdown": "https://wpnews.pro/news/claude-code-degraded-before-opus-4-8-release.md", "text": "https://wpnews.pro/news/claude-code-degraded-before-opus-4-8-release.txt", "jsonld": "https://wpnews.pro/news/claude-code-degraded-before-opus-4-8-release.jsonld"}}