Claude Code Degraded Before Opus 4.8 Release Anthropic's Claude Code agent suffered a statistically significant five-day performance degradation immediately before the Opus 4.8 model release, with pass rates dropping from a 65% baseline to as low as 50%. The drop coincided with the deployment of Claude Code version 2.1.150 and was accompanied by a 60% spike in tool calls and reduced input tokens, before recovering fully when version 2.1.153 launched alongside Opus 4.8 on May 28. The findings, captured by daily SWE-Bench-Pro tracking, suggest a harness issue tied to a CLI update rather than a model regression, raising questions about pre-release performance patterns. Claude Code degraded for the week before Opus 4.8's release Our SWE-Bench-Pro tracker caught a statistically significant, weeklong drop in Claude Code's pass rate just before Opus 4.8 shipped, and the recovery that followed. We run Claude Code /trackers/claude-code against a curated subset of SWE-Bench-Pro /explorers/swe-bench-pro every day, in the plain CLI with the current SOTA model and no custom harness. The point is to catch the thing that benchmarks published at launch can’t: silent, day-to-day changes in how a model actually performs in production. In the week leading up to the Opus 4.8 release, the tracker caught exactly that. Opus 4.7’s pass rate dropped well below its established baseline and stayed there for five consecutive days, a sustained and statistically significant degradation, before recovering the moment Opus 4.8 took over. What the tracker showed Below is the tracker through today. The dashed line is the Opus 4.7 baseline 65% and the shaded band is the significance threshold. Anything below the band is a statistically meaningful drop. You can see the five-day collapse in late May, followed by a sharp rebound once Opus 4.8 shipped on May 28. Daily Trend Pass rate over time Toggle 95% CI to view uncertainty around each point. Dashed line at 65% baseline with ±13.0% significance threshold Weekly Trend Aggregated 7-day pass rate The same uncertainty toggle applies here for 7-day windows. Dashed line at 65% baseline with ±4.3% significance threshold Other Clues We track a variety of other per-run metrics alongside pass rate, and most of them, including output tokens and average runtime, were unaffected through the dip. Two stood out: tool calls and input tokens both showed a clear pattern over the same window. Other Metrics Daily benchmark resource and execution trends Tool Calls Total tool invocations by day Input Tokens Daily total input usage Tool calls spiked by roughly 60% per task across the degraded days while input tokens dropped, then both snapped back the moment Opus 4.8 took over. You can explore these series live on the tracker /trackers/claude-code . The 2.1.150 release Every benchmark run records the Claude Code CLI version it ran on, and lining those versions up against the daily pass rate is the most suggestive clue of all. The drop starts on the exact day version 2.1.150 was installed, holds across 2.1.150 and 2.1.152, and lifts the day 2.1.153 lands. | Date | Claude Code | Daily pass rate | |---|---|---| | May 21 | 2.1.148 | 64% | May 22 | 2.1.150 | 50% | May 23 | 2.1.150 | 54% | May 24 | 2.1.150 | 50% | May 25 | 2.1.150 | 50% | May 26 | 2.1.152 | 52% | | May 27 | 2.1.153 | 66% | | May 28 | 2.1.156 | 72% | Conclusion This appears to be a harness issue, not a model regression. The drop tracks the Claude Code version rather than the model, the onset lines up with a CLI update while Opus 4.7 was unchanged, and the agent started making far more tool calls per task. The clues point to something introduced around Claude Code 2.1.150 and 2.1.151. The issue appears to have been resolved, but this is not the first time a degradation has aligned before a new model release, and leaves open questions. We will continue to track frontier agents to detect future degradations in the future. Interested in promoting your AI product or service? Reach out to us at email protected /cdn-cgi/l/email-protection fa929f969695ba979b889d9394969b98d49b93