SWE-Bench-Pro

mentions 2 type Organization feed RSS

// recent coverage 2 mentions

05:04

2026-05-30

dev.to

ai-agents

I tracked Claude Code and Codex pass-rates for 95 days — what "getting dumber" actually looks like

A developer tracked daily SWE-Bench-Pro pass rates for Claude Code and Codex over 95 days, revealing that Claude Code's pass rate jumped 11 percentage points to ~65% after a release but recently drift…

12:43

2026-05-29

marginlab.ai

large-language-models

Claude Code Degraded Before Opus 4.8 Release

Anthropic's Claude Code agent suffered a statistically significant five-day performance degradation immediately before the Opus 4.8 model release, with pass rates dropping from a 65% baseline to as lo…

// co-occurs with top 5 entities

Claude Code 2 Opus 4.8 1 Opus 4.7 1 Codex 1 Opus 1