Claude Code Degraded Before Opus 4.8 Release

wpnews.pro

cd /news/large-language-models/claude-code-degraded-before-opus-4-8… · home › topics › large-language-models › article

[ARTICLE · art-17630] src=marginlab.ai ↗ pub=2026-05-29T12:43Z topic=large-language-models verified=true sentiment=↓ negative

Claude Code Degraded Before Opus 4.8 Release

Anthropic's Claude Code agent suffered a statistically significant five-day performance degradation immediately before the Opus 4.8 model release, with pass rates dropping from a 65% baseline to as low as 50%. The drop coincided with the deployment of Claude Code version 2.1.150 and was accompanied by a 60% spike in tool calls and reduced input tokens, before recovering fully when version 2.1.153 launched alongside Opus 4.8 on May 28. The findings, captured by daily SWE-Bench-Pro tracking, suggest a harness issue tied to a CLI update rather than a model regression, raising questions about pre-release performance patterns.

read3 min views19 publishedMay 29, 2026

Our SWE-Bench-Pro tracker caught a statistically significant, weeklong drop in Claude Code's pass rate just before Opus 4.8 shipped, and the recovery that followed.

We run Claude Code against a curated subset of SWE-Bench-Pro every day, in the plain CLI with the current SOTA model and no custom harness. The point is to catch the thing that benchmarks published at launch can’t: silent, day-to-day changes in how a model actually performs in production.

In the week leading up to the Opus 4.8 release, the tracker caught exactly that. Opus 4.7’s pass rate dropped well below its established baseline and stayed there for five consecutive days, a sustained and statistically significant degradation, before recovering the moment Opus 4.8 took over.

What the tracker showed #

Below is the tracker through today. The dashed line is the Opus 4.7 baseline (65%) and the shaded band is the significance threshold. Anything below the band is a statistically meaningful drop. You can see the five-day collapse in late May, followed by a sharp rebound once Opus 4.8 shipped on May 28.

Daily Trend

Pass rate over time

Toggle 95% CI to view uncertainty around each point.

Dashed line at 65% baseline with ±13.0% significance threshold

Weekly Trend

Aggregated 7-day pass rate

The same uncertainty toggle applies here for 7-day windows.

Dashed line at 65% baseline with ±4.3% significance threshold

Other Clues #

We track a variety of other per-run metrics alongside pass rate, and most of them, including output tokens and average runtime, were unaffected through the dip. Two stood out: tool calls and input tokens both showed a clear pattern over the same window.

Other Metrics

Daily benchmark resource and execution trends

Tool Calls

Total tool invocations by day

Input Tokens

Daily total input usage

Tool calls spiked by roughly 60% per task across the degraded days while input tokens dropped, then both snapped back the moment Opus 4.8

took over. You can explore these series live on the tracker.

The 2.1.150 release #

Every benchmark run records the Claude Code CLI version it ran on, and lining those versions up against the daily pass rate is the most suggestive clue of all. The drop starts on the exact day version 2.1.150 was installed, holds across 2.1.150 and 2.1.152, and lifts the day 2.1.153 lands.

Date	Claude Code	Daily pass rate
May 21	2.1.148	64%
May 22	2.1.150	50%
May 23	2.1.150	54%
May 24	2.1.150	50%
May 25	2.1.150	50%
May 26	2.1.152	52%
May 27	2.1.153	66%
May 28	2.1.156	72%

Conclusion #

This appears to be a harness issue, not a model regression. The drop tracks the Claude Code version rather than the model, the onset lines up with a CLI update while Opus 4.7 was unchanged, and the agent started making far more tool calls per task. The clues point to something introduced around Claude Code 2.1.150 and 2.1.151.

The issue appears to have been resolved, but this is not the first time a degradation has aligned before a new model release, and leaves open questions. We will continue to track frontier agents to detect future degradations in the future.

Interested in promoting your AI product or service? Reach out to us at [email protected]

source & further reading

marginlab.ai — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/claude-code-degraded-bef…

Read original on marginlab.ai → marginlab.ai/blog/claude-code-degraded-before-op…

mentioned entities

Claude Code

Opus 4.8

Opus 4.7

SWE-Bench-Pro

metadata

slugclaude-code-degraded-before-opus-4-8-release

topic#large-language-models

secondary4 topics

sentimentnegative

canonicalmarginlab.ai

navigation

← prevHPE, SMCI surge after Dell’s Q1 …

next →AI and all-flash demand send Net…

── more in #large-language-models 4 stories · sorted by recency

sourcefeed.dev · 14 Jul · #large-language-models

Microsoft's CLI Agents: Social Spread, Real Lift, Real Cost

news.ycombinator.com · 14 Jul · #large-language-models

Anthropic banned all my accounts, what now?

artificialanalysis.ai · 14 Jul · #large-language-models

Harvey LAB-AA: evaluating AI agents on real-world legal work

byteiota.com · 13 Jul · #large-language-models

Claude API July 2026: Rate Limits Up, Keys Expire, Act by July 24

── more on @claude code 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required