Your weekly listens from How I AI, part of the Lennyâs Podcast Network
Claude Fable 5 review: what the new Mythos model gets right (and very wrong)
Listen now on YouTube ⢠Spotify ⢠Apple Podcasts
Claire puts Claude Fable 5, Anthropicâs first generally available Mythos-class model, through a series of real-world tests: product specs, agent workflows, design tasks, vision tasks, and multi-agent orchestration. She breaks down what Anthropic is claiming, where the model genuinely feels like a leap forward, and where it surprisingly falls short.
Biggest takeaways:
**Fable 5 is Anthropicâs first âMythos-classâ model to reach general availability, and itâs crushing benchmarks across the board.**It hit 80% on SWBench Pro, significantly outperforming Opus 4.8, GPT-4.5, and Gemini 3.1 Pro. Claire found the model excels in specific areas while falling short in others that matter for everyday product work.The model is expensive by design: $10 per million input tokens and $50 per million output tokens. Thatâs a new tier above Opus, and it consumes tokens at roughly twice the rate of other models. You need to be strategic about when to deploy this level of intelligence versus using cheaper models like Sonnet or Opus for simpler tasks.Fable 5 works like a âseasoned engineerââwhich is both its superpower and its Achillesâ heel. Itâs thorough, autonomous, and will investigate every corner of a problem to be 120% sure itâs shipping the right thing. Sometimes you need a model thatâs a little less thorough, a little âdumber,â to actually ship something useful quickly.The model is exceptionally good at vision tasks, particularly document formatting and PDF parsing. Claire tested it on creating handwriting worksheets for her 7-year-old and found it dramatically outperformed Opus 4.8âbetter spacing, clearer layout, appropriate white space. This extends to other vision tasks where you want something to look good or need to parse complex documents.The writing is nearly unreadable for specs and PRDs. Claire found that Fable 5 produces extremely detailed, technically complete documents that are almost impossible to parse. It gets wrapped around the axle on details, creates big blocks of dense paragraphs with internal references, and makes it hard to see the forest for the trees.Design output is shockingly bad, at least for one-shot design tasks. When Claire asked Fable to design a skills registry, it produced fundamentally terrible design: gray, black, red, simple outlines. This was a real surprise given the modelâs benchmark performance.The model is conservative on execution and takes âminimalâ very literally. When Claire asked it to ship an MVP that would deliver customer value, Fable produced something extremely narrow and not actually that useful. This conservatism may stem from the safety guardrails built into the model.Fable 5 includes specific safeguards for cybersecurity, biology, chemistry, and distillation tasks. Instead of blocking you entirely, it uses a new âfallbackâ conceptâif you get classified into one of these categories, it gracefully falls back to Opus 4.8. Anthropic reports that 95% of sessions donât hit a fallback, and they maintain a 30-day retention policy solely to catch misuse.Multi-agent orchestration is technically possible but not yet reliable. Claire tested the dynamic workflows and subagent capabilities extensively and had some successful multi-agent runs, but also encountered frequent stalls and errors. She walked away from her laptop and came back to find subagents had stalled after about three hours.The key insight: match model intelligence to task complexity. Claire recommends using it for hard technical problems where extreme detail matters, long-horizon work, and vision tasks. But for front-end work, strategy, specs, and design, other models in the ecosystem will serve you better and cost less.This is âbaby Mythos,â not the full Mythos model. Fable 5 has guardrails that the unrestricted Mythos model (available only to Project Glasswing partners) doesnât have. The underlying model is the same, but Fable is tuned for safety and general availability.
Blog from this episode:
How I AI: My Honest Review of Claude Fable 5: https://www.chatprd.ai/how-i-ai/claude-fable-5-review
How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal
Listen now on YouTube ⢠Spotify ⢠Apple Podcasts
Brought to you by:
Claire sits down with Ankur Goyal, the founder and CEO of Braintrust, to unpack how top engineering teams are using AI agents, evals, and CI to ship better software faster. They get into why agents are now capable of tackling hard infrastructure problems, how to decide what work sits âbelow the agent line,â and why evals are quickly becoming the modern version of a PRD. Ankurâs core message: the best teams wonât just use AI to write more code; theyâll build the feedback loops, benchmarks, and systems that let AI improve the quality of the product itself.
Biggest takeaways:
Thereâs no staff engineer running as many rigorous benchmarks as someone using an agent. Ankur viscerally disagrees with engineers who say AI canât handle complicated problems. While models might not be perfect at writing highly concurrent code, they excel at running exhaustive experimentsâtesting every column store format, every execution engine, every optimization strategy. The baseline of rigor you get from agents is incredible, and thereâs simply no excuse anymore to skip benchmarks because theyâre tedious.The agent line keeps going upâand you need to identify whatâs below it. Many interactions, decisions, and directions that feel like they need human judgment actually fit âbelow the agent line.â If you took the information from a meeting and gave it to an agent, would it solve the same problem? Increasingly, the answer is yes. The best teams push this line higher by building smart skills and integrations that expand what agents can handle autonomously.Practical quality beats theoretical quality every time. In theory, a human engineer with infinite time and focus might produce better code than an AI agent. In practice, humans lose context over days, have decaying attention spans on hard-but-tedious problems, and skip benchmarks they know they should run. AI agents maintain consistent focus, run every test, and can work on problems continuously for days or weeks. The practical quality of AI-assisted engineering is higher because of sustained rigor, not because the code is theoretically better.You can now bite off much harder technical problems than before. Companies historically avoid major infrastructure changes because the cost of testing alternatives is prohibitively high and the unknown unknowns are risky. With AI agents, you can exhaustively test six different database solutions, run thousands of benchmarks on production-scale data, and make informed decisions about platform shifts that would have been impossible before. The business case for deep technical work becomes much easier when agents do the heavy lifting.Run four to six foreground agents simultaneouslyâthatâs the human concurrency limit. Ankur runs different agents working on different problems. This matches the personal concurrency limit most people can manage; you canât effectively context switch between more than that. Some agents run locally, and others run remotely on cloud infrastructure with production-scale data. The key is isolation: each agent has its own environment, ports, and services.Evals are the modern PRDâthey definewhatsuccess looks like, nothowto achieve it. Machine learning shifts programming from defining implementation details to defining success criteria. Just like the best PRDs include user stories and examples, the best evals include concrete test cases and scoring functions. The difference is that evals quantify success in ways that can be automatically measured and improved. This lets you focus on outcomes while AI figures out the implementation.Build a feedback loop that automatically turns real-world data into evals. For AI product teams, the #1 engineering priority isnât prompt engineering or picking an agent frameworkâitâs building a pipeline that summons real-world data and converts it into evals. This is the same principle as investing in CI for traditional software: youâre building the platform that lets agents do the work engineers used to do manually. Without this feedback loop, youâre stuck in whack-a-mole mode, fixing individual cases without systematic improvement.Quantify your designerâs taste so it scales across your product. Ankur runs hundreds of evals to improve things quantitatively, then asks David (their tastemaker designer) for a vibe check every few days. When David destroys his work, Ankur captures the feedback (âDavid thinks itâs OK to show both languages as long as . . .â) and improves the scoring functions to encode Davidâs palette. This doesnât replace David; it amplifies him. Theyâre able to apply Davidâs quality bar to more things than he could ever review manually.**Product building is now carving, not constructing. Itâs extremely fast to create something with too many features, too many buttons, and too much code.**The hard part is removing stuff. When customers complain, Braintrust removes the thing causing confusion 90% of the time, making the system work better by eliminating complexity. This is the opposite of traditional product development, where you carefully add features one by one.Invest in CI to earn the ability to move fasterâitâs the platform for AI-powered engineering. Every engineer is now building a platform upon which agents do the work engineers used to do manually. For traditional software, that platform is CI. If you feel constrained by velocity, donât ship crappy stuff faster. Instead, and improve CI so you earn the ability to move faster safely. The same principle applies to AI products: build the eval pipeline first, then let agents optimize within that system.When agents fail, close the session and improve the evalsâdonât yell or bribe. Ankurâs back-pocket strategy is remarkably disciplined: he doesnât try to prompt his way out of problems. He closes the session, improves the evaluation criteria or success metrics, and starts fresh. Sometimes this means hand-writing code to better understand the problem (like when he spent a weekend hand-writing a 3,000-line eval that had become trash through vibe coding). The solution is always better evals, not better prompting.
Blog and detailed workflow walkthroughs from this episode:
**Blog: **Ankur Goyalâs Playbook for Agent-Driven Benchmarking and AI Evals https://www.chatprd.ai/how-i-ai/ankur-goyals-playbook-for-agent-driven-benchmarking-and-ai-evals
Workflows:
âł How to Scale Expert Judgment in AI Systems with a Human Feedback Loop: https://www.chatprd.ai/how-i-ai/workflows/how-to-scale-expert-judgment-in-ai-systems-with-a-human-feedback-loop
âł How to Use AI Coding Agents for Exhaustive Infrastructure Benchmarking: https://www.chatprd.ai/how-i-ai/workflows/how-to-use-ai-coding-agents-for-exhaustive-infrastructure-benchmarking
If youâre enjoying these episodes, reply and let me know what youâd love to learn more about: AI workflows, hiring, growth, product strategyâanything.
Catch you next week,
Lenny
P.S. Want every new episode delivered the moment it drops? Hit âFollowâ on your favorite podcast app.