How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Ankur Goyal, founder and CEO of Braintrust, explained how AI agents can run exhaustive benchmarks and perform deep technical work like database optimization, and argued that evals are essential for shipping better software. He demonstrated using Codex to run week-long experiments and shared frameworks like the 'agent line' for deciding when to delegate to agents.

In this episode, I sit down with Ankur Goyal , founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs in my audience. We get into how coding agents can take on deeply technical architecture and infrastructure work that no single human engineer could tackle before, and then we demystify evals so you can use them to make your AI products better without touching the implementation. Listen or watch on YouTube, Spotify, or Apple Podcasts What you’ll learn: How Ankur uses Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines to speed up slow queries Why he argues there’s no excuse to skip rigorous benchmarking now that agents can run them tirelessly The “agent line” framework: how to decide which decisions, directions, and interactions you can hand off to an agent How I think about the practical vs. theoretical quality of AI on hard technical problems, and why human attention decays on tedious work Why evals are the modern version of a PRD, and how to encode “what good looks like” so a model can figure out the “how” How to build a scoring function live and let an agent improve your prompt inside a safe playground How Ankur turned his designer David’s taste into a repeatable eval so quality scales beyond one person Why fixing your CI is the highest-leverage way to speed up engineering velocity Brought to you by: Guru —The AI layer of truth Persona —Trusted identity verification for any use case In this episode, we cover: 00:00 https://www.youtube.com/watch?v=QE 1hRLsehM Introduction to Ankur Goyal 03:00 https://www.youtube.com/watch?v=QE 1hRLsehM&t=180s Using AI agents for database optimization 06:10 https://www.youtube.com/watch?v=QE 1hRLsehM&t=370s Running exhaustive benchmarks with coding agents 09:03 https://www.youtube.com/watch?v=QE 1hRLsehM&t=543s Why staff engineers are wrong about AI limitations 11:30 https://www.youtube.com/watch?v=QE 1hRLsehM&t=690s The “agent line” framework for delegation 14:00 https://www.youtube.com/watch?v=QE 1hRLsehM&t=840s Ankur’s workflow: running 4 to 6 concurrent agents 17:16 https://www.youtube.com/watch?v=QE 1hRLsehM&t=1036s Technical setup: foreground agents, background agents, and cloud environments 20:32 https://www.youtube.com/watch?v=QE 1hRLsehM&t=1232s Spending time with AI tools 23:06 https://www.youtube.com/watch?v=QE 1hRLsehM&t=1386s Demystifying evals 26:02 https://www.youtube.com/watch?v=QE 1hRLsehM&t=1562s Live demo: Building an eval for documentation answers 30:20 https://www.youtube.com/watch?v=QE 1hRLsehM&t=1820s The alternative to evals: vibe checks and whack-a-mole 32:09 https://www.youtube.com/watch?v=QE 1hRLsehM&t=1929s Capturing designer taste in scoring functions 33:13 https://www.youtube.com/watch?v=QE 1hRLsehM&t=1993s Quick recap 33:44 https://www.youtube.com/watch?v=QE 1hRLsehM&t=2024s Managing velocity and throughput 35:40 https://www.youtube.com/watch?v=QE 1hRLsehM&t=2140s Why CI/CD investment is critical for AI-accelerated teams 37:30 https://www.youtube.com/watch?v=QE 1hRLsehM&t=2250s Ankur’s prompting strategy when agents fail 39:10 https://www.youtube.com/watch?v=QE 1hRLsehM&t=2350s Closing thoughts and how to connect Tools referenced: • Braintrust: https://www.braintrust.dev/ https://www.braintrust.dev/ • Codex: https://openai.com/codex/ https://openai.com/codex/ • GPT 5.4: https://developers.openai.com/api/docs/models/gpt-5.4 https://developers.openai.com/api/docs/models/gpt-5.4 • Claude: https://claude.ai/ https://claude.ai/ Other references: • GPT 5.5 just did what no other model could: https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model • Paul Graham’s Maker vs. Manager Schedule: http://www.paulgraham.com/makersschedule.html http://www.paulgraham.com/makersschedule.html • tmux: https://github.com/tmux/tmux https://github.com/tmux/tmux • Chris Tate at Vercel: https://www.linkedin.com/in/ctatedev/ https://www.linkedin.com/in/ctatedev/ Where to find Ankur Goyal: LinkedIn: https://www.linkedin.com/in/ankrgyl/ https://www.linkedin.com/in/ankrgyl/ Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ https://www.chatprd.ai/ Website: https://clairevo.com/ https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/ https://www.linkedin.com/in/clairevo/ Production and marketing by https://penname.co/ https://penname.co/ . For inquiries about sponsoring the podcast, email email protected /cdn-cgi/l/email-protection .