# Claude Code Degraded Before Opus 4.8 Release

> Source: <https://marginlab.ai/blog/claude-code-degraded-before-opus-4-8/>
> Published: 2026-05-29 12:43:59+00:00

# Claude Code degraded for the week before Opus 4.8's release

Our SWE-Bench-Pro tracker caught a statistically significant, weeklong drop in Claude Code's pass rate just before Opus 4.8 shipped, and the recovery that followed.

We run [Claude Code](/trackers/claude-code) against a curated subset of [SWE-Bench-Pro](/explorers/swe-bench-pro) every day, in the
plain CLI with the current SOTA model and no custom harness. The point is to catch the thing that benchmarks published at launch can’t:
silent, day-to-day changes in how a model actually performs in production.

In the week leading up to the Opus 4.8 release, the tracker caught exactly that. Opus 4.7’s pass rate dropped well below its established baseline and stayed there for five consecutive days, a sustained and statistically significant degradation, before recovering the moment Opus 4.8 took over.

## What the tracker showed

Below is the tracker through today. The dashed line is the Opus 4.7 baseline (65%) and the shaded band is the significance threshold. Anything below the band is a statistically meaningful drop. You can see the five-day collapse in late May, followed by a sharp rebound once Opus 4.8 shipped on May 28.

### Daily Trend

Pass rate over time

Toggle 95% CI to view uncertainty around each point.

Dashed line at 65% baseline with ±13.0% significance threshold

### Weekly Trend

Aggregated 7-day pass rate

The same uncertainty toggle applies here for 7-day windows.

Dashed line at 65% baseline with ±4.3% significance threshold

## Other Clues

We track a variety of other per-run metrics alongside pass rate, and most of them, including output tokens and average runtime, were unaffected through the dip. Two stood out: tool calls and input tokens both showed a clear pattern over the same window.

### Other Metrics

Daily benchmark resource and execution trends

#### Tool Calls

Total tool invocations by day

#### Input Tokens

Daily total input usage

Tool calls spiked by roughly 60% per task across the degraded days while input tokens dropped, then both snapped back the moment Opus 4.8
took over. You can explore these series live on the [tracker](/trackers/claude-code).

## The 2.1.150 release

Every benchmark run records the Claude Code CLI version it ran on, and lining those versions up against the daily pass rate is the most suggestive clue of all. The drop starts on the exact day version 2.1.150 was installed, holds across 2.1.150 and 2.1.152, and lifts the day 2.1.153 lands.

| Date | Claude Code | Daily pass rate |
|---|---|---|
| May 21 | 2.1.148 | 64% |
May 22 | 2.1.150 | 50% |
May 23 | 2.1.150 | 54% |
May 24 | 2.1.150 | 50% |
May 25 | 2.1.150 | 50% |
May 26 | 2.1.152 | 52% |
| May 27 | 2.1.153 | 66% |
| May 28 | 2.1.156 | 72% |

## Conclusion

This appears to be a harness issue, not a model regression. The drop tracks the Claude Code version rather than the model, the onset lines up with a CLI update while Opus 4.7 was unchanged, and the agent started making far more tool calls per task. The clues point to something introduced around Claude Code 2.1.150 and 2.1.151.

The issue appears to have been resolved, but this is not the first time a degradation has aligned before a new model release, and leaves open questions. We will continue to track frontier agents to detect future degradations in the future.

Interested in promoting your AI product or service? Reach out to us at [[email protected]](/cdn-cgi/l/email-protection#fa929f969695ba979b889d9394969b98d49b93)
