Anthropic's Claude 3.7 Sonnet Improves Coding and Reasoning

wpnews.pro

cd /news/large-language-models/anthropic-s-claude-3-7-sonnet-improv… · home › topics › large-language-models › article

[ARTICLE · art-16597] src=letsdatascience.com ↗ pub=2026-05-28T15:37Z topic=large-language-models verified=true sentiment=↑ positive

Anthropic's Claude 3.7 Sonnet Improves Coding and Reasoning

Anthropic released Claude 3.7 Sonnet in February 2025, a mid-tier model that achieves 80.8% on the SWE-bench Verified benchmark for real-world GitHub bug fixes. The model adds an Extended Thinking mode and stronger multi-step reasoning, positioning it as a cost-effective alternative to flagship Opus models for code-heavy developer workflows.

read3 min views11 publishedMay 28, 2026

Anthropic's Claude 3.7 Sonnet, released in February 2025, is a mid-tier model that multiple outlets report as materially stronger on coding tasks than earlier Sonnet releases. SmashingApps reports Claude 3.7 Sonnet achieves 80.8% on SWE-bench Verified, a benchmark of real-world GitHub bug fixes; that article also attributes stronger multi-step reasoning and a new "Extended Thinking" mode to the release. LLM-stats and MorphLLM provide comparative data showing Opus-tier models generally outperform Sonnet on many benchmarks while Sonnet offers lower per-token pricing, per LLM-stats. Editorial analysis: For practitioners, the practical takeaway is that Sonnet-class models continue to trade cost for close-to-Opus performance on developer tasks, making them attractive for code-heavy workflows where price-performance matters.

What happened

Anthropic released the model family upgrade that public coverage identifies as Claude 3.7 Sonnet (release date reported as February 2025). SmashingApps reports that Claude 3.7 Sonnet achieves 80.8% on SWE-bench Verified, a benchmark that measures whether a model can correctly fix real GitHub issues with test validation. SmashingApps also describes the model as adding an Extended Thinking mode and stronger multi-step reasoning compared with prior Sonnet iterations. LLM-stats and MorphLLM publish comparative tables showing Sonnet's relative position inside Anthropic's tiering and versus competitors such as GPT-4o and Anthropic's Opus family.

Technical details

Per public benchmark aggregators cited in coverage, Sonnet sits in Anthropic's middle tier: providers and aggregators list three Claude tiers-Haiku (latency/volume), Sonnet (cost-performance), and Opus (flagship performance). LLM-stats reports context window and pricing differentials: for example, the Sonnet tier is reported as cheaper per input/output token than Opus, and Opus models are reported to provide larger context windows (LLM-stats comparison). MorphLLM's aggregation includes multiple SWE-bench Verified scores across Claude generations and flags contamination and self-reporting caveats on provider-published benchmark numbers.

Editorial analysis: Industry-pattern observations: Models in a middle "workhorse" tier often aim to maximize price-performance for engineering tasks. Observers compiling benchmark suites typically see Opus-equivalent architectures retain small performance edges while Sonnet-class models close much of the gap for routine developer workflows. These trade-offs matter when teams choose between latency/cost and top-end benchmark performance.

Context and significance

Editorial analysis: For practitioners: The combination of higher SWE-bench-like scores and an Extended Thinking mode, as reported, suggests Sonnet-class models are being positioned by public coverage as better suited for multi-file debugging and multi-step code edits where reasoning across several steps matters. Aggregators such as LLM-stats and MorphLLM show Opus models generally lead on benchmark suites while Sonnet remains materially cheaper per token, which changes the cost calculus for production systems that call models frequently for code generation or repair.

What to watch

Editorial analysis: Indicators an observer should monitor include independent third-party SWE-bench Pro or other contamination-mitigated evaluations, provider transparency on training-set overlap with benchmark corpora, and published pricing/context changes from Anthropic. Also watch head-to-head blind preference tests and developer-reported end-to-end metric changes (bug-fix rates, PR acceptance, engineer time saved) rather than isolated benchmark numbers.

Limitations of public data

What's reported in public aggregations varies: SmashingApps attributes a 80.8% SWE-bench Verified score to Claude 3.7 Sonnet, while MorphLLM and LLM-stats present slightly different score tables across Sonnet and Opus generations and explicitly note contamination and provider self-reporting caveats. Those discrepancies mean absolute rankings should be treated cautiously; relative trends across many tests are more robust than any single published score.

Editorial analysis: Practical recommendation for teams: evaluate Sonnet-class models on representative internal developer tasks and consider cost-per-fix metrics rather than relying solely on public leaderboard positions.

Scoring Rationale #

This is a notable model-tier update relevant to practitioners who run code-generation workloads: it refines price-performance trade-offs but is not a frontier paradigm shift. Aggregator discrepancies and contamination caveats reduce headline certainty.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

source & further reading

letsdatascience.com — original article Court Reprimands Lawyer for AI Hallucinations in Briefs Ghostcommit: PNG prompt-injection makes AI agents leak repository secrets Google Expands Gemini Ad Agents In India

~/api · this article 200

$curl api.wpnews.pro/v1/news/anthropic-s-claude-3-7-s…

Read original on letsdatascience.com → letsdatascience.com/news/anthropics-claude-37-so…

mentioned entities

Anthropic

Claude 3.7 Sonnet

SmashingApps

LLM-stats

MorphLLM

SWE-bench Verified

metadata

sluganthropic-s-claude-3-7-sonnet-improves-coding-and-reasoning

topic#large-language-models

secondary3 topics

sentimentpositive

canonicalletsdatascience.com

navigation

← prevGoogle Launches Gemini Omni To C…

next →Pope Leo XIV Urges Government Re…

── more in #large-language-models 4 stories · sorted by recency

economictimes.indiatimes.com · 12 Jul · #large-language-models

Anthropic extends Fable 5 access through July 19

dev.to · 12 Jul · #large-language-models

AI can't run your company yet. Here's the math, and what to automate instead.

lilting.ch · 12 Jul · #large-language-models

Court

support.claude.com · 12 Jul · #large-language-models

Claude Code get 50% more weekly limit

── more on @anthropic 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required