🎙️ How I AI: Claude Fable 5 review & How Braintrust uses AI agents, evals, and CI to ship better software

wpnews.pro

Your weekly listens from How I AI, part of the Lenny’s Podcast Network

Claude Fable 5 review: what the new Mythos model gets right (and very wrong)

Listen now on YouTube • Spotify • Apple Podcasts

Claire puts Claude Fable 5, Anthropic’s first generally available Mythos-class model, through a series of real-world tests: product specs, agent workflows, design tasks, vision tasks, and multi-agent orchestration. She breaks down what Anthropic is claiming, where the model genuinely feels like a leap forward, and where it surprisingly falls short.

Biggest takeaways:

**Fable 5 is Anthropic’s first “Mythos-class” model to reach general availability, and it’s crushing benchmarks across the board.**It hit 80% on SWBench Pro, significantly outperforming Opus 4.8, GPT-4.5, and Gemini 3.1 Pro. Claire found the model excels in specific areas while falling short in others that matter for everyday product work.The model is expensive by design: $10 per million input tokens and $50 per million output tokens. That’s a new tier above Opus, and it consumes tokens at roughly twice the rate of other models. You need to be strategic about when to deploy this level of intelligence versus using cheaper models like Sonnet or Opus for simpler tasks.Fable 5 works like a “seasoned engineer”—which is both its superpower and its Achilles’ heel. It’s thorough, autonomous, and will investigate every corner of a problem to be 120% sure it’s shipping the right thing. Sometimes you need a model that’s a little less thorough, a little “dumber,” to actually ship something useful quickly.The model is exceptionally good at vision tasks, particularly document formatting and PDF parsing. Claire tested it on creating handwriting worksheets for her 7-year-old and found it dramatically outperformed Opus 4.8—better spacing, clearer layout, appropriate white space. This extends to other vision tasks where you want something to look good or need to parse complex documents.The writing is nearly unreadable for specs and PRDs. Claire found that Fable 5 produces extremely detailed, technically complete documents that are almost impossible to parse. It gets wrapped around the axle on details, creates big blocks of dense paragraphs with internal references, and makes it hard to see the forest for the trees.Design output is shockingly bad, at least for one-shot design tasks. When Claire asked Fable to design a skills registry, it produced fundamentally terrible design: gray, black, red, simple outlines. This was a real surprise given the model’s benchmark performance.The model is conservative on execution and takes “minimal” very literally. When Claire asked it to ship an MVP that would deliver customer value, Fable produced something extremely narrow and not actually that useful. This conservatism may stem from the safety guardrails built into the model.Fable 5 includes specific safeguards for cybersecurity, biology, chemistry, and distillation tasks. Instead of blocking you entirely, it uses a new “fallback” concept—if you get classified into one of these categories, it gracefully falls back to Opus 4.8. Anthropic reports that 95% of sessions don’t hit a fallback, and they maintain a 30-day retention policy solely to catch misuse.Multi-agent orchestration is technically possible but not yet reliable. Claire tested the dynamic workflows and subagent capabilities extensively and had some successful multi-agent runs, but also encountered frequent stalls and errors. She walked away from her laptop and came back to find subagents had stalled after about three hours.The key insight: match model intelligence to task complexity. Claire recommends using it for hard technical problems where extreme detail matters, long-horizon work, and vision tasks. But for front-end work, strategy, specs, and design, other models in the ecosystem will serve you better and cost less.This is “baby Mythos,” not the full Mythos model. Fable 5 has guardrails that the unrestricted Mythos model (available only to Project Glasswing partners) doesn’t have. The underlying model is the same, but Fable is tuned for safety and general availability.

Blog from this episode:

How I AI: My Honest Review of Claude Fable 5: https://www.chatprd.ai/how-i-ai/claude-fable-5-review

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Listen now on YouTube • Spotify • Apple Podcasts

Brought to you by:

Claire sits down with Ankur Goyal, the founder and CEO of Braintrust, to unpack how top engineering teams are using AI agents, evals, and CI to ship better software faster. They get into why agents are now capable of tackling hard infrastructure problems, how to decide what work sits “below the agent line,” and why evals are quickly becoming the modern version of a PRD. Ankur’s core message: the best teams won’t just use AI to write more code; they’ll build the feedback loops, benchmarks, and systems that let AI improve the quality of the product itself.

Biggest takeaways:

There’s no staff engineer running as many rigorous benchmarks as someone using an agent. Ankur viscerally disagrees with engineers who say AI can’t handle complicated problems. While models might not be perfect at writing highly concurrent code, they excel at running exhaustive experiments—testing every column store format, every execution engine, every optimization strategy. The baseline of rigor you get from agents is incredible, and there’s simply no excuse anymore to skip benchmarks because they’re tedious.The agent line keeps going up—and you need to identify what’s below it. Many interactions, decisions, and directions that feel like they need human judgment actually fit “below the agent line.” If you took the information from a meeting and gave it to an agent, would it solve the same problem? Increasingly, the answer is yes. The best teams push this line higher by building smart skills and integrations that expand what agents can handle autonomously.Practical quality beats theoretical quality every time. In theory, a human engineer with infinite time and focus might produce better code than an AI agent. In practice, humans lose context over days, have decaying attention spans on hard-but-tedious problems, and skip benchmarks they know they should run. AI agents maintain consistent focus, run every test, and can work on problems continuously for days or weeks. The practical quality of AI-assisted engineering is higher because of sustained rigor, not because the code is theoretically better.You can now bite off much harder technical problems than before. Companies historically avoid major infrastructure changes because the cost of testing alternatives is prohibitively high and the unknown unknowns are risky. With AI agents, you can exhaustively test six different database solutions, run thousands of benchmarks on production-scale data, and make informed decisions about platform shifts that would have been impossible before. The business case for deep technical work becomes much easier when agents do the heavy lifting.Run four to six foreground agents simultaneously—that’s the human concurrency limit. Ankur runs different agents working on different problems. This matches the personal concurrency limit most people can manage; you can’t effectively context switch between more than that. Some agents run locally, and others run remotely on cloud infrastructure with production-scale data. The key is isolation: each agent has its own environment, ports, and services.Evals are the modern PRD—they definewhatsuccess looks like, nothowto achieve it. Machine learning shifts programming from defining implementation details to defining success criteria. Just like the best PRDs include user stories and examples, the best evals include concrete test cases and scoring functions. The difference is that evals quantify success in ways that can be automatically measured and improved. This lets you focus on outcomes while AI figures out the implementation.Build a feedback loop that automatically turns real-world data into evals. For AI product teams, the #1 engineering priority isn’t prompt engineering or picking an agent framework—it’s building a pipeline that summons real-world data and converts it into evals. This is the same principle as investing in CI for traditional software: you’re building the platform that lets agents do the work engineers used to do manually. Without this feedback loop, you’re stuck in whack-a-mole mode, fixing individual cases without systematic improvement.Quantify your designer’s taste so it scales across your product. Ankur runs hundreds of evals to improve things quantitatively, then asks David (their tastemaker designer) for a vibe check every few days. When David destroys his work, Ankur captures the feedback (“David thinks it’s OK to show both languages as long as . . .”) and improves the scoring functions to encode David’s palette. This doesn’t replace David; it amplifies him. They’re able to apply David’s quality bar to more things than he could ever review manually.**Product building is now carving, not constructing. It’s extremely fast to create something with too many features, too many buttons, and too much code.**The hard part is removing stuff. When customers complain, Braintrust removes the thing causing confusion 90% of the time, making the system work better by eliminating complexity. This is the opposite of traditional product development, where you carefully add features one by one.Invest in CI to earn the ability to move faster—it’s the platform for AI-powered engineering. Every engineer is now building a platform upon which agents do the work engineers used to do manually. For traditional software, that platform is CI. If you feel constrained by velocity, don’t ship crappy stuff faster. Instead, and improve CI so you earn the ability to move faster safely. The same principle applies to AI products: build the eval pipeline first, then let agents optimize within that system.When agents fail, close the session and improve the evals—don’t yell or bribe. Ankur’s back-pocket strategy is remarkably disciplined: he doesn’t try to prompt his way out of problems. He closes the session, improves the evaluation criteria or success metrics, and starts fresh. Sometimes this means hand-writing code to better understand the problem (like when he spent a weekend hand-writing a 3,000-line eval that had become trash through vibe coding). The solution is always better evals, not better prompting.

Blog and detailed workflow walkthroughs from this episode:

**Blog: **Ankur Goyal’s Playbook for Agent-Driven Benchmarking and AI Evals https://www.chatprd.ai/how-i-ai/ankur-goyals-playbook-for-agent-driven-benchmarking-and-ai-evals

Workflows:

↳ How to Scale Expert Judgment in AI Systems with a Human Feedback Loop: https://www.chatprd.ai/how-i-ai/workflows/how-to-scale-expert-judgment-in-ai-systems-with-a-human-feedback-loop

↳ How to Use AI Coding Agents for Exhaustive Infrastructure Benchmarking: https://www.chatprd.ai/how-i-ai/workflows/how-to-use-ai-coding-agents-for-exhaustive-infrastructure-benchmarking

If you’re enjoying these episodes, reply and let me know what you’d love to learn more about: AI workflows, hiring, growth, product strategy—anything.

Catch you next week,

Lenny

P.S. Want every new episode delivered the moment it drops? Hit “Follow” on your favorite podcast app.

source & further reading

lennysnewsletter.com — original article 11 products I love, free for a year—the biggest Product Pass expansion in 2 years 🎙️ How I AI: Claude Opus 5 Review + Browser use in Codex + How Cursor and a Raspberry Pi makes AI fun From zero coding background to hardware hacker: How Cursor + a Raspberry Pi makes AI fun

🎙️ How I AI: Claude Fable 5 review & How Braintrust uses AI agents, evals, and CI to ship better software

Your weekly listens from How I AI, part of the Lenny’s Podcast Network

Claude Fable 5 review: what the new Mythos model gets right (and very wrong)

Biggest takeaways:

Blog from this episode:

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Biggest takeaways:

Blog and detailed workflow walkthroughs from this episode:

Run your AI side-project on zahid.host