Sonnet 5 review: I ran 64 generations to find out if it's worth it

wpnews.pro

cd /news/large-language-models/sonnet-5-review-i-ran-64-generations… · home › topics › large-language-models › article

[ARTICLE · art-45732] src=lennysnewsletter.com ↗ pub=2026-06-30T23:22Z topic=large-language-models verified=true sentiment=· neutral

Sonnet 5 review: I ran 64 generations to find out if it's worth it

Anthropic released Sonnet 5, and Claire Vo tested it against four other frontier models using a custom eval harness built with Claude Code. The blind comparison across PRD quality, prototype generation, agentic tasks, and agent personality revealed unexpected results, with model recommendations varying by task.

read2 min views1 publishedJun 30, 2026

Sonnet 5 review: I ran 64 generations to find out if it's worth it — Image: Lennysnewsletter (auto-discovered)

I’ve been testing every major frontier model release since the start of the year, and when Anthropic dropped Sonnet 5, I wanted more than a vibe check. I got tired of one-off tests I couldn’t repeat or compare over time, so I built something better: the How I AI Bench, a repeatable eval harness I constructed live using Claude Code while recording this episode. I ran Sonnet 5 blind against four other frontier models (Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro) across PRD quality, prototype generation, agentic task completion, and agent personality. The results were not what I expected.

Listen or watch on YouTube, Spotify, or Apple Podcasts

What you’ll learn:

What Anthropic claims Sonnet 5 improves over Sonnet 4.6, and where the benchmark data actually backs that up

How I built the How I AI Bench in under 45 minutes using Claude Code, starting from my own stored session history

Why I combined human vibe scoring (70%) with LLM as judge scoring (30%) instead of trusting either alone

How to set up a local HTML scoring page so you can rate AI outputs on gut feel and export those scores as JSON

Which model I recommend for PRDs, which for complex prototypes, and which for chatting with an agent daily

Brought to you by:

** Runway**—The creative AI platform for images, video and more

** Hyperagent**—Deploy fleets of agents that handle real work

In this episode, we cover:

([00:00](https://www.youtube.com/watch?v=yJ-1LB2hF-Q)) Sonnet 5 is out

([01:55](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=115s)) What Anthropic claims

([04:02](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=242s)) Why I’m done with one-off vibe checks

([05:05](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=305s)) Building the How I AI Bench live with Claude Code

([07:42](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=462s)) The scoring system

([10:43](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=643s)) Agent voice eval

([11:57](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=717s)) Quick recap

([13:58](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=838s)) Results: The How I AI index leaderboard

([21:21](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=1281s)) What I’m improving for the next run

([22:16](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=1336s)) Generating a Claire-weighted index

([23:53](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=1433s)) Model-by-task recommendations

Tools referenced:

• Claude Sonnet 5: [https://www.anthropic.com/news/claude-sonnet-5](https://www.anthropic.com/news/claude-sonnet-5)

• Claude Opus 4.8: [https://www.anthropic.com/news/claude-opus-4-8](https://www.anthropic.com/news/claude-opus-4-8)

• GPT-5.5 (OpenAI): [https://openai.com/index/introducing-gpt-5-5/](https://openai.com/index/introducing-gpt-5-5/)

• Gemini 3 Pro (Google DeepMind): [https://deepmind.google/models/gemini/pro/](https://deepmind.google/models/gemini/pro/)

• Cursor: [https://www.cursor.com/](https://www.cursor.com/)

Other references:

• SWE-bench Pro (agentic coding benchmark referenced): https://www.swebench.com/

Where to find Claire Vo:

ChatPRD: [https://www.chatprd.ai/](https://www.chatprd.ai/)

Website: [https://clairevo.com/](https://clairevo.com/)

LinkedIn: [https://www.linkedin.com/in/clairevo/](https://www.linkedin.com/in/clairevo/)

Production and marketing by [https://penname.co/](https://penname.co/). For inquiries about sponsoring the podcast, email [[email protected]](/cdn-cgi/l/email-protection).

source & further reading

lennysnewsletter.com — original article How top PMs increase their leverage with AI 🎙️ How I AI: GLM-5.2 review & How Gusto built a new product line with Claude Code No Figma. No Jira. No docs. How Gusto built a new product line with Claude Code | Eddie Kim (CTO)

~/api · this article 200

$curl api.wpnews.pro/v1/news/sonnet-5-review-i-ran-64…

Read original on lennysnewsletter.com → www.lennysnewsletter.com/p/sonnet-5-review-i-ran…

mentioned entities

Anthropic

Claude Sonnet 5

Claude Opus 4.8

GPT-5.5

Gemini 3 Pro

OpenAI

Google DeepMind

Claire Vo

metadata

slugsonnet-5-review-i-ran-64-generations-to-find-out-if-it-s-worth-it

topic#large-language-models

secondary4 topics

sentimentneutral

canonicallennysnewsletter.com

navigation

← prevAI Gives Fast Answers, Which Can…

── more in #large-language-models 4 stories · sorted by recency

cryptobriefing.com · 30 Jun · #large-language-models

Claude Sonnet 5 launches with competitive pricing and improved coding score

clawpatrol.dev · 30 Jun · #large-language-models

Claw Patrol Security firewall for agents

zandrey.com · 30 Jun · #large-language-models

I like Claude Desktop, so I created my own

runtimewire.com · 30 Jun · #large-language-models

Commerce is expected to lift Anthropic Fable 5 export controls tonight

── more on @anthropic 3 stories trending now

wpnews · 30 May · #ai-tools

I was wasting 10 minutes every Claude session. So I built a fix.

wpnews · 27 May · #machine-learning

hunting for headroom on modded-nanoGPT (WR #82)

wpnews · 2 Jun · #ai-products

Microsoft launches Discovery platform for scientific R&D with Ginkgo Bioworks partnership

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required