cd /news/large-language-models/claude-sonnet-5-strong-agentic-perfo… · home topics large-language-models article
[ARTICLE · art-45741] src=artificialanalysis.ai ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Claude Sonnet 5: strong agentic performance at a higher cost per task

Anthropic's Claude Sonnet 5 achieves a score of 53 on the Artificial Analysis Intelligence Index, matching GPT-5.5 with high reasoning but costing $2.29 per task—15% more than Opus 4.8—due to increased token usage. The model excels in agentic knowledge work, outperforming Opus 4.8 on AA-Briefcase and GDPval-AA, but lags behind larger models on heavy reasoning benchmarks.

read3 min views1 publishedJun 30, 2026
Claude Sonnet 5: strong agentic performance at a higher cost per task
Image: source

All articles June 30, 2026

Claude Sonnet 5 achieves 53 on the Artificial Analysis Intelligence Index, but without promotional pricing will cost more per task than Opus 4.8

We supported Anthropic to evaluate Claude Sonnet 5 ahead of release: with max effort it improves 6 points over Sonnet 4.6 to achieve the same Intelligence Index as GPT-5.5 with high reasoning, but remains behind Opus 4.7 and 4.8

Key takeaways:

Claude Sonnet 5 is the #5 model on the Artificial Analysis Intelligence Index, only 2-3 points behind GPT-5.5 (xhigh) and Opus 4.8 (max)

With max effort, Sonnet 5 works harder than previous Anthropic models: it used ~40% more output tokens per Intelligence Index task than Sonnet 4.6, and ~3x the agentic turns for our knowledge work evaluations AA-Briefcase and GDPval-AA. This behavior scales well with the ‘effort’ setting, with the max effort using around 6x more turns than low effort on GDPval-AA

➤ Claude Sonnet 5 costs more per task than Opus 4.8 before accounting for promotional pricing: Claude Sonnet 5 costs $2.29 per task on the Intelligence Index, a ~2x increase compared to Sonnet 4.6 and ~15% more than Claude Opus 4.8. This is driven entirely by increased token usage. Sonnet 5 retains the same $3/$15 per 1M input/output token pricing as Sonnet 4.6 (compared to $5/$25 for Opus 4.8), however Anthropic is offering a one-third reduction to $2/$10 until September 1. Our results use standard $3/$15 pricing

Sonnet 5 matches or outperforms Opus 4.8 on agentic knowledge work tasks: on both AA-Briefcase and GDPval-AA, Claude Sonnet 5 sits just ahead of Opus 4.8, trailing only Claude Fable 5 (which is not currently generally available). These benchmarks test the ability of models to produce accurate and well-presented professional outputs using our open source reference agent harness, Stirrup

For reasoning and knowledge-heavy tasks, Sonnet still sits behind its larger siblings: despite substantial gains across many evaluations, heavy reasoning and knowledge benchmarks still show Opus 4.8 ahead of Sonnet 5. On CritPt, a frontier physics reasoning benchmark developed by researchers at Argonne and UIUC, Sonnet 5 scores 17% - this is 14 points higher than its predecessor, but behind GLM-5.2, Claude Opus and Fable, and GPT-5.5 (xhigh and Pro)

➤ Sonnet 5 also showed significant improvements over Sonnet 4.6 on Terminal-Bench v2.1 (+9 points), Humanity’s Last Exam (+10 points), and SciCode (+7 points), with relatively flat scores elsewhere

Other key model details:

➤ Context window of 1 million tokens (equivalent to Sonnet 4.6)

➤ Pricing of $3/$15 per 1M tokens of input/output (reduced to $2/$10 until September 1); cache pricing remains at a 25% premium for cache writes ($3.75 per million tokens) with 5-minute time to live, and 90% discount for cache hits ($0.3 per million tokens)

➤ Effort remains the recommended way of configuring model performance and latency. Sonnet 5 adds an additional ‘xhigh’ effort setting relative to Sonnet 4.6, matching the 5 effort levels available on Opus 4.8 (max, xhigh, high, medium, low)

Compare Claude Sonnet 5 with other leading models at: https://artificialanalysis.ai/models/claude-sonnet-5

Read the latest

Measuring time per task in AA-Briefcase

Agentic knowledge work can take frontier models over 20 minutes per task, as measured in AA-Briefcase, our new benchmark

June 24, 2026

Announcing the Artificial Analysis Speech to Speech Index

Announcing the Artificial Analysis Speech to Speech Index, our new synthesis metric for native Speech to Speech model quality, comprising of Big Bench Audio, Full Duplex Bench, and 𝜏-Voice

June 23, 2026

Announcing AA-Briefcase: a frontier knowledge work evaluation

AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files, combining rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality.

June 18, 2026

── more in #large-language-models 4 stories · sorted by recency
── more on @anthropic 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/claude-sonnet-5-stro…] indexed:0 read:3min 2026-06-30 ·