How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

wpnews.pro

cd /news/ai-agents/how-braintrust-uses-ai-agents-evals-… · home › topics › ai-agents › article

[ARTICLE · art-27927] src=lennysnewsletter.com ↗ pub=2026-06-15T12:04Z topic=ai-agents verified=true sentiment=↑ positive

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Ankur Goyal, founder and CEO of Braintrust, explained how AI agents can run exhaustive benchmarks and perform deep technical work like database optimization, and argued that evals are essential for shipping better software. He demonstrated using Codex to run week-long experiments and shared frameworks like the 'agent line' for deciding when to delegate to agents.

read2 min views23 publishedJun 15, 2026

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs in my audience. We get into how coding agents can take on deeply technical architecture and infrastructure work that no single human engineer could tackle before, and then we demystify evals so you can use them to make your AI products better without touching the implementation.

Listen or watch on YouTube, Spotify, or Apple Podcasts

What you’ll learn:

How Ankur uses Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines to speed up slow queries

Why he argues there’s no excuse to skip rigorous benchmarking now that agents can run them tirelessly

The “agent line” framework: how to decide which decisions, directions, and interactions you can hand off to an agent

How I think about the practical vs. theoretical quality of AI on hard technical problems, and why human attention decays on tedious work

Why evals are the modern version of a PRD, and how to encode “what good looks like” so a model can figure out the “how”

How to build a scoring function live and let an agent improve your prompt inside a safe playground

How Ankur turned his designer David’s taste into a repeatable eval so quality scales beyond one person

Why fixing your CI is the highest-leverage way to speed up engineering velocity

Brought to you by:

** Guru**—The AI layer of truth

** Persona**—Trusted identity verification for any use case

In this episode, we cover:

([00:00](https://www.youtube.com/watch?v=QE_1hRLsehM)) Introduction to Ankur Goyal

([03:00](https://www.youtube.com/watch?v=QE_1hRLsehM&t=180s)) Using AI agents for database optimization

([06:10](https://www.youtube.com/watch?v=QE_1hRLsehM&t=370s)) Running exhaustive benchmarks with coding agents

([09:03](https://www.youtube.com/watch?v=QE_1hRLsehM&t=543s)) Why staff engineers are wrong about AI limitations

([11:30](https://www.youtube.com/watch?v=QE_1hRLsehM&t=690s)) The “agent line” framework for delegation

([14:00](https://www.youtube.com/watch?v=QE_1hRLsehM&t=840s)) Ankur’s workflow: running 4 to 6 concurrent agents

([17:16](https://www.youtube.com/watch?v=QE_1hRLsehM&t=1036s)) Technical setup: foreground agents, background agents, and cloud environments

([20:32](https://www.youtube.com/watch?v=QE_1hRLsehM&t=1232s)) Spending time with AI tools

([23:06](https://www.youtube.com/watch?v=QE_1hRLsehM&t=1386s)) Demystifying evals

([26:02](https://www.youtube.com/watch?v=QE_1hRLsehM&t=1562s)) Live demo: Building an eval for documentation answers

([30:20](https://www.youtube.com/watch?v=QE_1hRLsehM&t=1820s)) The alternative to evals: vibe checks and whack-a-mole

([32:09](https://www.youtube.com/watch?v=QE_1hRLsehM&t=1929s)) Capturing designer taste in scoring functions

([33:13](https://www.youtube.com/watch?v=QE_1hRLsehM&t=1993s)) Quick recap

([33:44](https://www.youtube.com/watch?v=QE_1hRLsehM&t=2024s)) Managing velocity and throughput

([35:40](https://www.youtube.com/watch?v=QE_1hRLsehM&t=2140s)) Why CI/CD investment is critical for AI-accelerated teams

([37:30](https://www.youtube.com/watch?v=QE_1hRLsehM&t=2250s)) Ankur’s prompting strategy when agents fail

([39:10](https://www.youtube.com/watch?v=QE_1hRLsehM&t=2350s)) Closing thoughts and how to connect

Tools referenced:

• Braintrust: [https://www.braintrust.dev/](https://www.braintrust.dev/)

• Codex: [https://openai.com/codex/](https://openai.com/codex/)

• GPT 5.4: [https://developers.openai.com/api/docs/models/gpt-5.4](https://developers.openai.com/api/docs/models/gpt-5.4)

• Claude: [https://claude.ai/](https://claude.ai/)

Other references:

• GPT 5.5 just did what no other model could: [https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model](https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model)

• Paul Graham’s Maker vs. Manager Schedule: [http://www.paulgraham.com/makersschedule.html](http://www.paulgraham.com/makersschedule.html)

• tmux: [https://github.com/tmux/tmux](https://github.com/tmux/tmux)

• Chris Tate at Vercel: [https://www.linkedin.com/in/ctatedev/](https://www.linkedin.com/in/ctatedev/)

Where to find Ankur Goyal:

LinkedIn: https://www.linkedin.com/in/ankrgyl/

Where to find Claire Vo:

ChatPRD: [https://www.chatprd.ai/](https://www.chatprd.ai/)

Website: [https://clairevo.com/](https://clairevo.com/)

LinkedIn: [https://www.linkedin.com/in/clairevo/](https://www.linkedin.com/in/clairevo/)

Production and marketing by [https://penname.co/](https://penname.co/). For inquiries about sponsoring the podcast, email [[email protected]](/cdn-cgi/l/email-protection).

source & further reading

lennysnewsletter.com — original article 11 products I love, free for a year—the biggest Product Pass expansion in 2 years 🎙️ How I AI: Claude Opus 5 Review + Browser use in Codex + How Cursor and a Raspberry Pi makes AI fun From zero coding background to hardware hacker: How Cursor + a Raspberry Pi makes AI fun

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-braintrust-uses-ai-a…

Read original on lennysnewsletter.com → www.lennysnewsletter.com/p/how-braintrust-uses-a…

mentioned entities

Braintrust

Ankur Goyal

Notion

Stripe

Vercel

Zapier

Codex

OpenAI

metadata

slughow-braintrust-uses-ai-agents-evals-and-ci-to-ship-better-software-ankur-goyal

topic#ai-agents

secondary4 topics

sentimentpositive

canonicallennysnewsletter.com

navigation

← prevLLM councils show groupthink

next →OWASP AISVS 1.0: The AI Security…

── more in #ai-agents 4 stories · sorted by recency

runtimewire.com · 31 Jul · #ai-agents

Conductor launches multiplayer cloud workspaces that keep coding agents running

github.com · 30 Jul · #ai-agents

Show HN: IncomeOS – income from many small sources, in one screen and an MCP

firerun.io · 30 Jul · #ai-agents

Vercel Now Lets You Sign In With ChatGPT

1password.com · 30 Jun · #ai-agents

Braintrust's Ankur Goyal: Code review doesn't cover prompts

── more on @braintrust 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 30 Jul · #artificial-intelligence

Oracle expands AI offerings with access to Google’s Gemini models, intensifying the cloud AI arms race

wpnews · 30 Jul · #artificial-intelligence

Apple to join Samsung in AI glasses race against Meta

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required