# Claude Opus 4.8 is out. The benchmark isn't why I'm switching.

> Source: <https://dev.to/hunter_g_50e2ec233acd07b5/claude-opus-48-is-out-the-benchmark-isnt-why-im-switching-5flo>
> Published: 2026-05-29 00:00:36+00:00

Anthropic shipped Claude Opus 4.8 today. The benchmark numbers went up, as they always do. But that's not why I'm switching my default model, and I want to explain the part that actually changed how I work.

Here's the official comparison:

The highlights:

And one I'll call out honestly: on **Terminal-Bench 2.1, Opus 4.8 scores 74.6% and GPT-5.5 wins at 78.2%**. 4.8 jumped a lot from its predecessor (66.1%), but it isn't first on that one. Pick your model for what you actually do.

Opus 4.8 is roughly **4x less likely than 4.7 to let a code flaw pass without flagging it.** It proactively points out uncertainty, questions sketchy inputs, and pushes back on plans it thinks are unsound.

That sounds small. It isn't.

When you hand work to an agent, raw capability was never the real bottleneck — *silent failure* was. The model that writes a subtle off-by-one and says nothing costs you more than the model that's slightly worse but says "I'm not sure this input is ever non-null, can you confirm?"

Concretely:

Before:it writes a function that looks clean, ships a hidden edge-case bug, says nothing. You find it in production.

After:it writes the same function and adds "there's an edge case here I'm not confident about — double-check the input is non-empty," or flat-out tells you your plan has a hole.

For anyone treating Claude as a coworker that ships work unattended, that calibrated honesty is worth more than a few benchmark points.

`system`

entries mid-array without breaking the prompt cacheDatabricks reported **61% lower token cost** vs 4.7 on their workloads, because 4.8 uses tools more efficiently and takes fewer steps.

Model ID is `claude-opus-4-8`

, available everywhere today.

The next moat in agents isn't IQ. It's calibrated honesty — the model that tells you when it's unsure is the one you can actually delegate to. That's the upgrade I care about here.

*Numbers and image from Anthropic's announcement. Full evals are in the system card.*