{"slug": "sonnet-5-review-i-ran-64-generations-to-find-out-if-it-s-worth-it", "title": "Sonnet 5 review: I ran 64 generations to find out if it's worth it", "summary": "Anthropic released Sonnet 5, and Claire Vo tested it against four other frontier models using a custom eval harness built with Claude Code. The blind comparison across PRD quality, prototype generation, agentic tasks, and agent personality revealed unexpected results, with model recommendations varying by task.", "body_md": "I’ve been testing every major frontier model release since the start of the year, and when Anthropic dropped Sonnet 5, I wanted more than a vibe check. I got tired of one-off tests I couldn’t repeat or compare over time, so I built something better: the How I AI Bench, a repeatable eval harness I constructed live using Claude Code while recording this episode. I ran Sonnet 5 blind against four other frontier models (Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro) across PRD quality, prototype generation, agentic task completion, and agent personality. The results were not what I expected.\n\n**Listen or watch on YouTube, Spotify, or Apple Podcasts**\n\n### What you’ll learn:\n\nWhat Anthropic claims Sonnet 5 improves over Sonnet 4.6, and where the benchmark data actually backs that up\n\nHow I built the How I AI Bench in under 45 minutes using Claude Code, starting from my own stored session history\n\nWhy I combined human vibe scoring (70%) with LLM as judge scoring (30%) instead of trusting either alone\n\nHow to set up a local HTML scoring page so you can rate AI outputs on gut feel and export those scores as JSON\n\nWhich model I recommend for PRDs, which for complex prototypes, and which for chatting with an agent daily\n\n### Brought to you by:\n\n** Runway**—The creative AI platform for images, video and more\n\n** Hyperagent**—Deploy fleets of agents that handle real work\n\n### In this episode, we cover:\n\n([00:00](https://www.youtube.com/watch?v=yJ-1LB2hF-Q)) Sonnet 5 is out\n\n([01:55](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=115s)) What Anthropic claims\n\n([04:02](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=242s)) Why I’m done with one-off vibe checks\n\n([05:05](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=305s)) Building the How I AI Bench live with Claude Code\n\n([07:42](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=462s)) The scoring system\n\n([10:43](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=643s)) Agent voice eval\n\n([11:57](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=717s)) Quick recap\n\n([13:58](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=838s)) Results: The How I AI index leaderboard\n\n([21:21](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=1281s)) What I’m improving for the next run\n\n([22:16](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=1336s)) Generating a Claire-weighted index\n\n([23:53](https://www.youtube.com/watch?v=yJ-1LB2hF-Q&t=1433s)) Model-by-task recommendations\n\n### Tools referenced:\n\n• Claude Sonnet 5: [https://www.anthropic.com/news/claude-sonnet-5](https://www.anthropic.com/news/claude-sonnet-5)\n\n• Claude Opus 4.8: [https://www.anthropic.com/news/claude-opus-4-8](https://www.anthropic.com/news/claude-opus-4-8)\n\n• GPT-5.5 (OpenAI): [https://openai.com/index/introducing-gpt-5-5/](https://openai.com/index/introducing-gpt-5-5/)\n\n• Gemini 3 Pro (Google DeepMind): [https://deepmind.google/models/gemini/pro/](https://deepmind.google/models/gemini/pro/)\n\n• Cursor: [https://www.cursor.com/](https://www.cursor.com/)\n\n### Other references:\n\n• SWE-bench Pro (agentic coding benchmark referenced): [https://www.swebench.com/](https://www.swebench.com/)\n\n### Where to find Claire Vo:\n\nChatPRD: [https://www.chatprd.ai/](https://www.chatprd.ai/)\n\nWebsite: [https://clairevo.com/](https://clairevo.com/)\n\nLinkedIn: [https://www.linkedin.com/in/clairevo/](https://www.linkedin.com/in/clairevo/)\n\nProduction and marketing by [https://penname.co/](https://penname.co/). For inquiries about sponsoring the podcast, email [[email protected]](/cdn-cgi/l/email-protection).", "url": "https://wpnews.pro/news/sonnet-5-review-i-ran-64-generations-to-find-out-if-it-s-worth-it", "canonical_source": "https://www.lennysnewsletter.com/p/sonnet-5-review-i-ran-64-generations", "published_at": "2026-06-30 23:22:23+00:00", "updated_at": "2026-06-30 23:28:47.521798+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-products", "ai-tools", "ai-research"], "entities": ["Anthropic", "Claude Sonnet 5", "Claude Opus 4.8", "GPT-5.5", "Gemini 3 Pro", "OpenAI", "Google DeepMind", "Claire Vo"], "alternates": {"html": "https://wpnews.pro/news/sonnet-5-review-i-ran-64-generations-to-find-out-if-it-s-worth-it", "markdown": "https://wpnews.pro/news/sonnet-5-review-i-ran-64-generations-to-find-out-if-it-s-worth-it.md", "text": "https://wpnews.pro/news/sonnet-5-review-i-ran-64-generations-to-find-out-if-it-s-worth-it.txt", "jsonld": "https://wpnews.pro/news/sonnet-5-review-i-ran-64-generations-to-find-out-if-it-s-worth-it.jsonld"}}