Sonnet 5 review: I ran 64 generations to find out if it's worth it

Anthropic released Sonnet 5, and Claire Vo tested it against four other frontier models using a custom eval harness built with Claude Code. The blind comparison across PRD quality, prototype generation, agentic tasks, and agent personality revealed unexpected results, with model recommendations varying by task.

I’ve been testing every major frontier model release since the start of the year, and when Anthropic dropped Sonnet 5, I wanted more than a vibe check. I got tired of one-off tests I couldn’t repeat or compare over time, so I built something better: the How I AI Bench, a repeatable eval harness I constructed live using Claude Code while recording this episode. I ran Sonnet 5 blind against four other frontier models Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro across PRD quality, prototype generation, agentic task completion, and agent personality. The results were not what I expected. Listen or watch on YouTube, Spotify, or Apple Podcasts What you’ll learn: What Anthropic claims Sonnet 5 improves over Sonnet 4.6, and where the benchmark data actually backs that up How I built the How I AI Bench in under 45 minutes using Claude Code, starting from my own stored session history Why I combined human vibe scoring 70% with LLM as judge scoring 30% instead of trusting either alone How to set up a local HTML scoring page so you can rate AI outputs on gut feel and export those scores as JSON Which model I recommend for PRDs, which for complex prototypes, and which for chatting with an agent daily Brought to you by: Runway —The creative AI platform for images, video and more Hyperagent —Deploy fleets of agents that handle real work In this episode, we cover: 00:00 https://www.youtube.com/watch?v=yJ-1LB2hF-Q Sonnet 5 is out 01:55 https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=115s What Anthropic claims 04:02 https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=242s Why I’m done with one-off vibe checks 05:05 https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=305s Building the How I AI Bench live with Claude Code 07:42 https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=462s The scoring system 10:43 https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=643s Agent voice eval 11:57 https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=717s Quick recap 13:58 https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=838s Results: The How I AI index leaderboard 21:21 https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=1281s What I’m improving for the next run 22:16 https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=1336s Generating a Claire-weighted index 23:53 https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=1433s Model-by-task recommendations Tools referenced: • Claude Sonnet 5: https://www.anthropic.com/news/claude-sonnet-5 https://www.anthropic.com/news/claude-sonnet-5 • Claude Opus 4.8: https://www.anthropic.com/news/claude-opus-4-8 https://www.anthropic.com/news/claude-opus-4-8 • GPT-5.5 OpenAI : https://openai.com/index/introducing-gpt-5-5/ https://openai.com/index/introducing-gpt-5-5/ • Gemini 3 Pro Google DeepMind : https://deepmind.google/models/gemini/pro/ https://deepmind.google/models/gemini/pro/ • Cursor: https://www.cursor.com/ https://www.cursor.com/ Other references: • SWE-bench Pro agentic coding benchmark referenced : https://www.swebench.com/ https://www.swebench.com/ Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ https://www.chatprd.ai/ Website: https://clairevo.com/ https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ https://www.linkedin.com/in/clairevo/ Production and marketing by https://penname.co/ https://penname.co/ . For inquiries about sponsoring the podcast, email email protected /cdn-cgi/l/email-protection .