cd /news/large-language-models/claude-code-degraded-before-opus-4-8… · home topics large-language-models article
[ARTICLE · art-17630] src=marginlab.ai pub= topic=large-language-models verified=true sentiment=↓ negative

Claude Code Degraded Before Opus 4.8 Release

Anthropic's Claude Code agent suffered a statistically significant five-day performance degradation immediately before the Opus 4.8 model release, with pass rates dropping from a 65% baseline to as low as 50%. The drop coincided with the deployment of Claude Code version 2.1.150 and was accompanied by a 60% spike in tool calls and reduced input tokens, before recovering fully when version 2.1.153 launched alongside Opus 4.8 on May 28. The findings, captured by daily SWE-Bench-Pro tracking, suggest a harness issue tied to a CLI update rather than a model regression, raising questions about pre-release performance patterns.

read3 min publishedMay 29, 2026

Our SWE-Bench-Pro tracker caught a statistically significant, weeklong drop in Claude Code's pass rate just before Opus 4.8 shipped, and the recovery that followed.

We run Claude Code against a curated subset of SWE-Bench-Pro every day, in the plain CLI with the current SOTA model and no custom harness. The point is to catch the thing that benchmarks published at launch can’t: silent, day-to-day changes in how a model actually performs in production.

In the week leading up to the Opus 4.8 release, the tracker caught exactly that. Opus 4.7’s pass rate dropped well below its established baseline and stayed there for five consecutive days, a sustained and statistically significant degradation, before recovering the moment Opus 4.8 took over.

What the tracker showed #

Below is the tracker through today. The dashed line is the Opus 4.7 baseline (65%) and the shaded band is the significance threshold. Anything below the band is a statistically meaningful drop. You can see the five-day collapse in late May, followed by a sharp rebound once Opus 4.8 shipped on May 28.

Daily Trend

Pass rate over time

Toggle 95% CI to view uncertainty around each point.

Dashed line at 65% baseline with ±13.0% significance threshold

Weekly Trend

Aggregated 7-day pass rate

The same uncertainty toggle applies here for 7-day windows.

Dashed line at 65% baseline with ±4.3% significance threshold

Other Clues #

We track a variety of other per-run metrics alongside pass rate, and most of them, including output tokens and average runtime, were unaffected through the dip. Two stood out: tool calls and input tokens both showed a clear pattern over the same window.

Other Metrics

Daily benchmark resource and execution trends

Tool Calls

Total tool invocations by day

Input Tokens

Daily total input usage

Tool calls spiked by roughly 60% per task across the degraded days while input tokens dropped, then both snapped back the moment Opus 4.8

took over. You can explore these series live on the tracker.

The 2.1.150 release #

Every benchmark run records the Claude Code CLI version it ran on, and lining those versions up against the daily pass rate is the most suggestive clue of all. The drop starts on the exact day version 2.1.150 was installed, holds across 2.1.150 and 2.1.152, and lifts the day 2.1.153 lands.

Date Claude Code Daily pass rate
May 21 2.1.148 64%
May 22 2.1.150 50%
May 23 2.1.150 54%
May 24 2.1.150 50%
May 25 2.1.150 50%
May 26 2.1.152 52%
May 27 2.1.153 66%
May 28 2.1.156 72%

Conclusion #

This appears to be a harness issue, not a model regression. The drop tracks the Claude Code version rather than the model, the onset lines up with a CLI update while Opus 4.7 was unchanged, and the agent started making far more tool calls per task. The clues point to something introduced around Claude Code 2.1.150 and 2.1.151.

The issue appears to have been resolved, but this is not the first time a degradation has aligned before a new model release, and leaves open questions. We will continue to track frontier agents to detect future degradations in the future.

Interested in promoting your AI product or service? Reach out to us at [email protected]

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/claude-code-degraded…] indexed:0 read:3min 2026-05-29 ·