{"slug": "claude-opus-4-8-is-out-the-benchmark-isn-t-why-i-m-switching", "title": "Claude Opus 4.8 is out. The benchmark isn't why I'm switching.", "summary": "Anthropic released Claude Opus 4.8 today, a model that a developer says is worth switching to not because of benchmark scores but because it is roughly four times less likely than its predecessor to let a code flaw pass without flagging it. The model proactively points out uncertainty, questions sketchy inputs, and pushes back on unsound plans, addressing what the developer calls the real bottleneck in agentic workflows: silent failure. Databricks reported 61% lower token costs compared to Opus 4.7 on their workloads, as the new model uses tools more efficiently and takes fewer steps.", "body_md": "Anthropic shipped Claude Opus 4.8 today. The benchmark numbers went up, as they always do. But that's not why I'm switching my default model, and I want to explain the part that actually changed how I work.\n\nHere's the official comparison:\n\nThe highlights:\n\nAnd one I'll call out honestly: on **Terminal-Bench 2.1, Opus 4.8 scores 74.6% and GPT-5.5 wins at 78.2%**. 4.8 jumped a lot from its predecessor (66.1%), but it isn't first on that one. Pick your model for what you actually do.\n\nOpus 4.8 is roughly **4x less likely than 4.7 to let a code flaw pass without flagging it.** It proactively points out uncertainty, questions sketchy inputs, and pushes back on plans it thinks are unsound.\n\nThat sounds small. It isn't.\n\nWhen you hand work to an agent, raw capability was never the real bottleneck — *silent failure* was. The model that writes a subtle off-by-one and says nothing costs you more than the model that's slightly worse but says \"I'm not sure this input is ever non-null, can you confirm?\"\n\nConcretely:\n\nBefore:it writes a function that looks clean, ships a hidden edge-case bug, says nothing. You find it in production.\n\nAfter:it writes the same function and adds \"there's an edge case here I'm not confident about — double-check the input is non-empty,\" or flat-out tells you your plan has a hole.\n\nFor anyone treating Claude as a coworker that ships work unattended, that calibrated honesty is worth more than a few benchmark points.\n\n`system`\n\nentries mid-array without breaking the prompt cacheDatabricks reported **61% lower token cost** vs 4.7 on their workloads, because 4.8 uses tools more efficiently and takes fewer steps.\n\nModel ID is `claude-opus-4-8`\n\n, available everywhere today.\n\nThe next moat in agents isn't IQ. It's calibrated honesty — the model that tells you when it's unsure is the one you can actually delegate to. That's the upgrade I care about here.\n\n*Numbers and image from Anthropic's announcement. Full evals are in the system card.*", "url": "https://wpnews.pro/news/claude-opus-4-8-is-out-the-benchmark-isn-t-why-i-m-switching", "canonical_source": "https://dev.to/hunter_g_50e2ec233acd07b5/claude-opus-48-is-out-the-benchmark-isnt-why-im-switching-5flo", "published_at": "2026-05-29 00:00:36+00:00", "updated_at": "2026-05-29 00:11:00.374090+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-products", "ai-agents", "ai-safety"], "entities": ["Anthropic", "Claude Opus 4.8", "GPT-5.5", "Terminal-Bench 2.1"], "alternates": {"html": "https://wpnews.pro/news/claude-opus-4-8-is-out-the-benchmark-isn-t-why-i-m-switching", "markdown": "https://wpnews.pro/news/claude-opus-4-8-is-out-the-benchmark-isn-t-why-i-m-switching.md", "text": "https://wpnews.pro/news/claude-opus-4-8-is-out-the-benchmark-isn-t-why-i-m-switching.txt", "jsonld": "https://wpnews.pro/news/claude-opus-4-8-is-out-the-benchmark-isn-t-why-i-m-switching.jsonld"}}