Anthropic shipped Claude Sonnet 5 on June 30 with a blunt pitch: Opus-class agentic performance at 60% of the price. On Terminal-Bench 2.1 — the benchmark that measures real terminal and tool-use workflows — it beats Opus 4.8 outright. On SWE-Bench Pro it closes the gap to within six points. The introductory pricing of $2/$10 per million tokens expires August 31. Here is what you need to know before you swap model IDs.
The Benchmark Story #
Numbers first, because the claim needs to hold up.
Sonnet 5 scores 63.2% on SWE-Bench Pro versus Opus 4.8’s 69.2% — a gap that still matters for the hardest coding tasks, but is no longer enough to justify a near-double price premium for most use cases. The bigger story is Terminal-Bench 2.1, where Sonnet 5 posts 80.4% against Opus 4.8’s 74.6%. If your agents spend significant time running shell commands and terminal sessions, the mid-tier model now wins outright.
Additional benchmarks reinforce the pattern: 85.2% on SWE-Bench Verified, 84.7% on BrowseComp single-agent (86.6% multi-agent), and 81.2% on OSWorld-Verified. Sonnet 5 beats Sonnet 4.6 on every published benchmark without exception.
The context window also jumped: 1 million tokens, matching Opus 4.8, up from Sonnet 4.6’s 200k. If you were running Opus just for long-context tasks, that reason is now gone.
| Model | SWE-Bench Pro | Terminal-Bench 2.1 | Input ($/M) | Output ($/M) | Context |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 58.1% | — | $3 | $15 | 200k |
| Claude Sonnet 5 | 63.2% | 80.4% | $3* | $15* | 1M |
| Claude Opus 4.8 | 69.2% | 74.6% | $5 | $25 | 1M |
The Pricing Window Is Real — and It Closes August 31 #
Through August 31, Sonnet 5 runs at $2 input / $10 output per million tokens. After that, standard pricing kicks in at $3 / $15. Opus 4.8 stays at $5 / $25.
The introductory period is a migration incentive. At standard pricing, Sonnet 5 will cost the same as Sonnet 4.6 — but with a 1M context window and meaningfully better agentic performance. That is the durable value proposition. The next eight weeks just make it cheaper to test.
Five Breaking Changes You Cannot Skip #
Do not just swap the model ID. Sonnet 5 has five breaking changes relative to Sonnet 4.6, and some of them will silently change your app’s behavior rather than throw an error. The full list is in Anthropic’s migration docs.
1. Adaptive Thinking Is On by Default
On Sonnet 4.6, omitting a thinking
field meant no thinking ran. On Sonnet 5, the same request runs with adaptive thinking. If your app assumes no reasoning overhead, you need to explicitly disable it:
response = client.messages.create(
model="claude-sonnet-5",
max_tokens=4096,
thinking={"type": "disabled"}, # required if you want 4.6 behavior
messages=[...]
)
2. Manual Budget Tokens Gone — Use Effort Instead
The thinking: {type: "enabled", budget_tokens: N}
pattern is no longer accepted. Anthropic replaced it with an effort
parameter: low
, medium
, high
, max
, or xhigh
. The model manages token allocation internally based on effort level.
3. Sampling Parameters Are Out
Non-default values for temperature
, top_p
, and top_k
are no longer accepted. If your prompts relied on temperature tuning for creativity or determinism, that lever is gone. Audit your API calls before migrating.
4. The Tokenizer Changed — Recount Everything
Sonnet 5 uses a new tokenizer. The same input text produces approximately 30% more tokens than on Sonnet 4.6. This affects your context window capacity, your cost estimates, and any max_tokens
limits you tuned for the previous model. Re-run all token counts before going to production.
5. Legacy Beta Headers
If you are migrating from Claude 4.1 or earlier (not just 4.6), remove legacy beta headers from your requests. Sonnet 5 also handles the model_context_window_exceeded
stop reason differently than older models.
The Effort Level Trap #
Here is where teams will bleed money without discipline. The effort
parameter controls how much thinking budget the model allocates. Higher effort means better results — and more tokens.
At effort: "xhigh"
, Sonnet 5’s performance approaches Opus 4.8 at medium-to-high effort. But running xhigh on Sonnet 5 can cost more than running Opus 4.8 at a comparable quality point. The cheaper model stops being cheaper if you dial it to maximum.
The practical default: start at effort: "medium"
for most agentic tasks. Escalate to high
only when you see quality degradation. Reserve xhigh
for tasks where you have confirmed that medium and high are not cutting it — then benchmark whether Opus 4.8 would be more economical.
response = client.messages.create(
model="claude-sonnet-5",
max_tokens=8000,
extra_body={"effort": "medium"},
messages=[...]
)
response = client.messages.create(
model="claude-sonnet-5",
max_tokens=16000,
extra_body={"effort": "high"},
messages=[...]
)
When to Use Sonnet 5 vs Opus 4.8 #
Use Sonnet 5 for: high-volume agentic pipelines, terminal and shell automation, browser-based tasks, long-context document processing, anything running hundreds of times per day.
Keep Opus 4.8 for: correctness-critical one-shot coding tasks where a 6-point SWE-Bench gap matters, and workflows where you have benchmarked the quality difference and it justifies the $5/$25 price.
The model-vs-effort decision is no longer simple. Sonnet 5 at high effort can match or beat Opus 4.8 at low effort on several benchmarks. Run your own tests on real workloads before locking in a migration path. The TechCrunch breakdown and MarkTechPost’s benchmark comparison both include detailed evaluation data worth reviewing before making that call.
Where It Is Available #
Sonnet 5 is GA on the Claude API, AWS Bedrock, and Microsoft Foundry (GA July 1). Google Cloud Vertex AI and GitHub Copilot are also supported. It is available on OpenRouter. The model ID is claude-sonnet-5
, with dated snapshot IDs available for pinning in production via the models overview page.
Bottom Line #
Sonnet 5 is the right default model for most developer teams right now. The performance-to-price ratio at medium effort beats everything else Anthropic offers. The introductory pricing makes this month the right time to test. But do not migrate without handling the breaking changes — adaptive thinking on by default and the tokenizer shift are both silent breaks that will cost you before you notice them.