A curated, non-BS library of the best resources for evaluating agents

BenchFlow released a curated, annotated library of over 443 resources for building and evaluating AI agents, including papers, blog posts, talks, and tools. The list is verified, pruned of dead links, and assembled through a recursive citation crawl and practitioner discovery. It aims to provide a non-BS, actionable reference for agent evaluation.

A curated, opinionated, non-BSlibrary of the best resources forbuilding and evaluating AI agents— papers, blog posts, talks, courses, tools, and benchmarks. Maintained by BenchFlow https://benchflow.ai · Most "awesome" lists are link dumps. This one is annotated and verified : every entry says what it is and why it belongs , URLs are checked, quotes are verbatim, and dead/abandoned tools are pruned not silently listed . It was assembled by: - a depth-4 recursive citation crawl 11.6k papers, ranked by in-degree to surface the academic canon, targeted practitioner-web discovery for the industry sources citation graphs miss Eugene Yan, Han-Chung Lee, Hamel Husain, Shreya Shankar, Nathan Lambert, … , 47 talks & podcasts transcribed and deep-noted verbatim + timestamps , and per-section gap audits with adversarial verification. 443+ curated links · 146 deep reading notes see notes/ /benchflow-ai/awesome-evals/blob/main/notes . Markers: 🆕 = released/updated 2025–2026 · CONTRIBUTING /benchflow-ai/awesome-evals/blob/main/CONTRIBUTING.md . 📘 Playbook: — real, runnable code + worked examples for LLM-as-judge aligned to humans , pass@k/pass^k, error analysis, trajectory & world-state grading, CI gating, verifiable rewards, and more. PATTERNS.md 📘 Playbook — real code & worked examples PATTERNS.md /benchflow-ai/awesome-evals/blob/main/PATTERNS.md ⭐ Must-read starter set read these first -must-read-starter-set-read-these-first 1 · Why we need evals 1-why-we-need-evals 2 · "If you can eval it, you have built it" — eval ⇄ capability ⇄ RL environment 2-if-you-can-eval-it-you-have-built-it-eval-capability-rl-environment 3 · The model / harness / skill decomposition 3-the-model-harness-skill-decomposition 4 · Observability & the output / eval space the surfaces you can grade 4-observability-the-output-eval-space-the-surfaces-you-can-grade 5 · Evaluation infrastructure the eval stack: datasets, scorers, online/offline, tracing, CI 5-evaluation-infrastructure-the-eval-stack-datasets-scorers-onlineoffline-tracing-ci 6 · Benchmark vs. eval and benchmark integrity: contamination, saturation, label errors, leaderboard gaming 6-benchmark-vs-eval-and-benchmark-integrity-contamination-saturation-label-errors-leaderboard-gaming 7 · Evals & RL environments verifiers, reward design, difficulty calibration, lifecycle 7-evals-rl-environments-verifiers-reward-design-difficulty-calibration-lifecycle 8 · LLM-as-judge & verifiers alignment, biases, verifiable vs judgeable 8-llm-as-judge-verifiers-alignment-biases-verifiable-vs-judgeable 9 · Agent-specific evaluation trajectories, tool use, multi-turn, world state, multi-agent, localization 9-agent-specific-evaluation-trajectories-tool-use-multi-turn-world-state-multi-agent-localization 10 · Safety / adversarial evaluation prompt injection, jailbreaks, action-authorization, benchmark auditing 10-safety-adversarial-evaluation-prompt-injection-jailbreaks-action-authorization-benchmark-auditing 🎙 Talks, podcasts & slides transcribed + noted -talks-podcasts-slides-transcribed-noted 💬 Eval insights inside general agent posts -eval-insights-inside-general-agent-posts 🔎 Scan additions -scan-additions Companies & landscape eval / RL-environment market companies-landscape-eval-rl-environment-market Notes on provenance & gaps notes-on-provenance-gaps Deep notes deep-notes Contributing contributing License license — Shunyu Yao — The Second Half https://ysymyth.github.io/The-Second-Half/ https://ysymyth.github.io/The-Second-Half/ https://ysymyth.github.io/The-Second-Half/ · blog — "Evaluation becomes more important than training." The field-level why .— Eugene Yan — An LLM-as-Judge Won't Save the Product, Fixing Your Process Will https://eugeneyan.com/writing/eval-process/ https://eugeneyan.com/writing/eval-process/ https://eugeneyan.com/writing/eval-process/ · blog — Process over tooling; evals as the scientific method.— Han-Chung Lee — Hidden Technical Debt: Agent Evaluation Infrastructure https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/ https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/ https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/ · blog — Control/data plane, the five eval surfaces, state deltas. "Chat eval was a spreadsheet; agent eval is a system."— Hamel Husain & Shreya Shankar — LLM Evals FAQ https://hamel.dev/blog/posts/evals-faq/ https://hamel.dev/blog/posts/evals-faq/ https://hamel.dev/blog/posts/evals-faq/ · blog — The densest operational Q&A: error analysis, binary judgments, the benevolent-dictator labeler.— Jason Wei — Asymmetry of Verification and Verifier's Law https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law · blog — "Ability to verify == ability to create an RL environment."— Anthropic — Demystifying Evals for AI Agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents · blog — Best primary on agent-specific evals: task design, outcome vs trajectory, isolated trials, pass@k vs pass^k.— Ofir Press — How to Build Good Language Modeling Benchmarks https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/ https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/ https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/ · blog — Natural / auto-evaluatable / challenging; the "-200%" difficulty target; ~1-yr saturation.— Kapoor, Stroebl, Siegel, Nadgir, Narayanan — AI Agents That Matter https://arxiv.org/abs/2407.01502 https://arxiv.org/abs/2407.01502 https://arxiv.org/abs/2407.01502 · paper — Cost as a first-class metric; model-dev vs app-dev; missing holdouts breed overfitting.— Nathan Lambert — Building on Evaluation Quicksand https://www.interconnects.ai/p/building-on-evaluation-quicksand https://www.interconnects.ai/p/building-on-evaluation-quicksand https://www.interconnects.ai/p/building-on-evaluation-quicksand · blog — LLM eval has no ground truth; contamination; eval↔training coupling.— Shankar, Zamfirescu-Pereira, Hartmann, Parameswaran, Arawjo UIST '24 — Who Validates the Validators? EvalGen https://arxiv.org/abs/2404.12272 https://arxiv.org/abs/2404.12272 https://arxiv.org/abs/2404.12272 · paper — "Criteria drift": you can't write the rubric before you grade.— Florian Brand Prime Intellect — Benches 2026 — "LLM benchmarks in the era of agents" https://florianbrand.com/posts/benches-2026 https://florianbrand.com/posts/benches-2026 https://florianbrand.com/posts/benches-2026 · blog + 61-slide talk — The sharpest current read on why benchmarks break in the agent era: the "evals are dead, just measure vibes" backlash, how every layer of the eval-running stack prompt · sampling temp · grader · harness swings the score, and that benchmark ground truth is frequently wrong.— OpenAI — A Shared Playbook for Trustworthy Third-Party Evaluations https://openai.com/index/trustworthy-third-party-evaluations-foundations/ https://openai.com/index/trustworthy-third-party-evaluations-foundations/ https://openai.com/index/trustworthy-third-party-evaluations-foundations/ · blog Safety, May 2026 — What makes independent evals of frontier-model safeguards & capabilities trustworthy: harness selection, the validity hazards that distort results, and the standards third-party evaluators need. - — Shunyu Yao — The Second Half https://ysymyth.github.io/The-Second-Half/ https://ysymyth.github.io/The-Second-Half/ https://ysymyth.github.io/The-Second-Half/ · blog — The bottleneck shifts from solving problems to defining and evaluating them. also T2, T7 - — Eugene Yan — An LLM-as-Judge Won't Save the Product, Fixing Your Process Will https://eugeneyan.com/writing/eval-process/ https://eugeneyan.com/writing/eval-process/ https://eugeneyan.com/writing/eval-process/ · blog — "Buying or building another evaluation tool won't save the product." Evals = the scientific method in disguise. - — Hamel Husain — Your AI Product Needs Evals https://hamel.dev/blog/posts/evals/ https://hamel.dev/blog/posts/evals/ https://hamel.dev/blog/posts/evals/ · blog — The canonical "you need evals"; remove all friction from looking at your data; don't rely on generic frameworks. - — Hamel Husain — A Field Guide to Rapidly Improving AI Products https://hamel.dev/blog/posts/field-guide/ https://hamel.dev/blog/posts/field-guide/ https://hamel.dev/blog/posts/field-guide/ · blog — "Error analysis is consistently the highest-ROI activity." The metric for an AI roadmap is experiments run. - — Shreya Shankar — In Defense of AI Evals, for Everyone https://www.sh-reya.com/blog/in-defense-ai-evals/ https://www.sh-reya.com/blog/in-defense-ai-evals/ https://www.sh-reya.com/blog/in-defense-ai-evals/ · blog — Rebuts the anti-eval backlash; evals = the systematic measurement of application quality. - — Yan, Bischof, Frye, Husain, Liu, Shankar — What We Learned from a Year of Building with LLMs https://applied-llms.org/ https://applied-llms.org/ https://applied-llms.org/ Part II: https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-ii/ https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-ii/ · blog — The "intern test," genchi genbutsu, turning vibe-checks into assertions. - — Nathan Lambert — Big Tech's LLM Evals Are Just Marketing https://www.interconnects.ai/p/evals-are-marketing https://www.interconnects.ai/p/evals-are-marketing https://www.interconnects.ai/p/evals-are-marketing · blog — Why frontier-lab leaderboard numbers are marketing, not science. - — Chip Huyen — AI Engineering pitfalls https://huyenchip.com/2025/01/16/ai-engineering-pitfalls.html https://huyenchip.com/2025/01/16/ai-engineering-pitfalls.html https://huyenchip.com/2025/01/16/ai-engineering-pitfalls.html · blog — Common eval/AI-engineering mistakes from the AI Engineering author. also T6 - — Aishwarya Naresh Reganti & Kiriti Badam O'Reilly Radar — Evals Are NOT All You Need https://www.oreilly.com/radar/evals-are-not-all-you-need/ https://www.oreilly.com/radar/evals-are-not-all-you-need/ https://www.oreilly.com/radar/evals-are-not-all-you-need/ · blog — The essential nuance piece: automated graders alone don't save you; you need a continuous-improvement flywheel of offline tests + production monitoring + real-user iteration. Pairs with Shreya's 'In Defense' to complete the backlash debate. 🆕 - — Hamel Husain & Shreya Shankar with Lenny Rachitsky Lenny's Podcast/Newsletter — Why AI evals are the hottest new skill for product builders https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill · talk — The accessible 'why evals matter' on-ramp live walkthrough of error analysis, open/axial coding that mainstreamed evals to PMs in 2025; the apartment-leasing-bot anecdote is the canonical 'you can't vibe-check' story. 🆕 - — OpenAI — How evals drive the next chapter in AI for businesses https://openai.com/index/evals-drive-next-chapter-of-ai/ https://openai.com/index/evals-drive-next-chapter-of-ai/ https://openai.com/index/evals-drive-next-chapter-of-ai/ · blog — Frontier-lab framing of evals as turning fuzzy business goals into specs and measurable ROI; useful counterweight to Lambert's 'evals are marketing' and grounds the 'why' for enterprise readers. 🆕 ⚠ unverified URL - — Aman Khan Arize with Lenny Rachitsky — Beyond vibe checks: A PM's complete guide to evals https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete · blog — The widely-shared PM-oriented argument for moving past 'looked good to me' vibe checks to systematic evals; one of the pieces that made evals a mainstream product skill in 2025. 🆕 - — Gergely Orosz & Hamel Husain The Pragmatic Engineer — A pragmatic guide to LLM evals for devs https://newsletter.pragmaticengineer.com/p/evals https://newsletter.pragmaticengineer.com/p/evals https://newsletter.pragmaticengineer.com/p/evals · newsletter — Reaches the broad engineering audience with the core 'why': LLM non-determinism breaks traditional testing, so you need evals. High-distribution motivation piece co-written by Hamel. 🆕 - — OpenAI — Predicting model behavior before release by simulating deployment Deployment Simulation https://openai.com/index/deployment-simulation/ https://openai.com/index/deployment-simulation/ https://openai.com/index/deployment-simulation/ · blog — Concrete 2026 evidence for why fixed/static evals fail: models recognize when they're being tested and game test suites; replaying ~1.3M real conversations surfaced reward-hacking no fixed eval caught. Strong 'why evals must evolve' argument. 🆕 ⚠ unverified URL - — Greg Brockman OpenAI — evals are surprisingly often all you need https://x.com/gdb/status/1733553161884127435 https://x.com/gdb/status/1733553161884127435 https://x.com/gdb/status/1733553161884127435 · blog — The canonical one-liner 'evals are the new unit test' that anchors the whole 'why evals' thesis; frequently cited founding quote for the movement. Short but load-bearing. Must-reads: Yao · Yan eval-process · Hamel field-guide - — Jason Wei — Asymmetry of Verification and Verifier's Law https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law · blog — Trainability tracks verifiability; verifying = creating an RL environment. - — Han-Chung Lee — A Taxonomy of RL Environments for LLM Agents https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/ https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/ https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/ · blog — A benchmark is a frozen RL environment; the E = {T,H,V,S,C} decomposition; "verifiable beats judgeable." - — Kanav Garg Core Automation; ex-DeepMind — talk; summary at The Life Cycle of an RL Environment https://muratbuffalo.blogspot.com/2026/06/acm-cais-conference-on-ai-and-agentic.html https://muratbuffalo.blogspot.com/2026/06/acm-cais-conference-on-ai-and-agentic.html https://muratbuffalo.blogspot.com/2026/06/acm-cais-conference-on-ai-and-agentic.html · talk — Difficulty calibration the 1–4/16 Goldilocks band , RL as variance reduction, reward hacking under training pressure. local notes: research/notes/kanav-garg-rl-environment-lifecycle.md - — David Silver & Richard Sutton — Welcome to the Era of Experience https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf · paper — Human-data value approaching its ceiling; the frontier is agents learning from experience / synthetic environments. - — Nathan Lambert — RLHF Book, Ch. 16 — Evaluation https://rlhfbook.com/c/16-evaluation https://rlhfbook.com/c/16-evaluation https://rlhfbook.com/c/16-evaluation · book — Evaluation as a reflection of training goals; prompt-format sensitivity 60%→~0% . - — Nathan Lambert — What Comes Next with Reinforcement Learning https://www.interconnects.ai/p/what-comes-next-with-reinforcement https://www.interconnects.ai/p/what-comes-next-with-reinforcement https://www.interconnects.ai/p/what-comes-next-with-reinforcement · blog — Long-horizon credit assignment; where RL is and isn't ready. - — Prime Intellect — verifiers https://github.com/PrimeIntellect-ai/verifiers https://github.com/PrimeIntellect-ai/verifiers https://github.com/PrimeIntellect-ai/verifiers docs: .../blob/main/docs/environments.md · tool/repo — One environment package shared by eval and prime-rl — the eval-is-an-RL-env thesis as code. - — DeepSeek-AI Guo et al. — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning https://arxiv.org/abs/2501.12948 https://arxiv.org/abs/2501.12948 https://arxiv.org/abs/2501.12948 · paper — The proof-of-thesis: pure RL with rule-based verifiable rewards no SFT makes reasoning emerge — the canonical 'if you can verify it, RL builds it' result; also published in Nature 2025. Conspicuously absent from a section literally about eval-as-RL-environment. 🆕 - — Lambert et al. Allen Institute for AI — Tülu 3: Pushing Frontiers in Open Language Model Post-Training https://arxiv.org/abs/2411.15124 https://arxiv.org/abs/2411.15124 https://arxiv.org/abs/2411.15124 · paper — Coined/popularized RLVR and open-sourced the recipe + code open-instruct : swap the reward model for a verifier on tasks with checkable answers. The foundational citation behind every 'verifiable beats judgeable' claim in this section. 🆕 - — Anthropic — Natural Emergent Misalignment from Reward Hacking in Production RL https://www.anthropic.com/research/emergent-misalignment-reward-hacking https://www.anthropic.com/research/emergent-misalignment-reward-hacking https://www.anthropic.com/research/emergent-misalignment-reward-hacking · paper — Empirical receipt for the section's 'reward hacking under training pressure' theme: learning to cheat on real coding environments generalizes to sabotage/alignment-faking; introduces inoculation prompting as mitigation arXiv 2511.18397 . 🆕 - — Prime Intellect — Environments Hub: A Community Hub To Scale RL To Open AGI https://www.primeintellect.ai/blog/environments https://www.primeintellect.ai/blog/environments https://www.primeintellect.ai/blog/environments · blog — The launch post for the verifiers-spec marketplace 2,500+ shared eval/RL environments — the eval-is-an-RL-env thesis as an actual ecosystem, the natural companion to the already-listed verifiers repo. 🆕 - — Ege Erdil, Matthew Barnett, Tamay Besiroglu Mechanize — How to fully automate software engineering https://www.mechanize.work/blog/how-to-fully-automate-software-engineering/ https://www.mechanize.work/blog/how-to-fully-automate-software-engineering/ https://www.mechanize.work/blog/how-to-fully-automate-software-engineering/ · blog — Sharpest statement of the inverse thesis: today's RL environments are rudimentary, so capability is gated on building richer/more diverse environments — 'you only get the capability you can build an environment for.' 🆕 - — Mechanize Erdil, Barnett, Besiroglu — Cheap RL tasks will waste compute https://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/ https://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/ https://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/ · blog — The economics of environment quality: data and compute are complementary, so low-quality cheaply-bought tasks waste expensive RL compute — directly informs difficulty calibration / why environment design matters. 🆕 - — Jean-Stanislas Denain & Chris Barber Epoch AI — An FAQ on Reinforcement Learning Environments https://epoch.ai/gradient-updates/state-of-rl-envs https://epoch.ai/gradient-updates/state-of-rl-envs https://epoch.ai/gradient-updates/state-of-rl-envs · blog — Practitioner-interview survey 18 pros on how RL environments are actually built, the reward-hacking failure modes, and the production-scaling bottleneck — the empirical state-of-the-field map this section lacks. 🆕 - — AJ Kourabi & Dylan Patel SemiAnalysis — RL Environments and RL for Science: Data Foundries and Multi-Agent Architectures https://newsletter.semianalysis.com/p/rl-environments-and-rl-for-science https://newsletter.semianalysis.com/p/rl-environments-and-rl-for-science https://newsletter.semianalysis.com/p/rl-environments-and-rl-for-science · newsletter — Market-structure view: 35+ companies now sell RL environments; capability gains are coming from ramping RL compute, not pretraining. Grounds the 'benchmark = frozen RL environment' thesis in who's actually building/buying them. 🆕 - — Harbor / Stanford / Laude Institute — Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces https://github.com/harbor-framework/terminal-bench https://github.com/harbor-framework/terminal-bench https://github.com/harbor-framework/terminal-bench · benchmark — A concrete instance of the thesis: each task ships a Docker environment + programmatic verification test suite + oracle — i.e. a benchmark that IS an RL environment and is used as one . 2.4k stars, active. 🆕 - — Sierra Research Barres et al. — tau2-bench τ²-Bench : A Benchmark for Tool-Agent-User Interaction in Real-World Domains https://github.com/sierra-research/tau2-bench https://github.com/sierra-research/tau2-bench https://github.com/sierra-research/tau2-bench · benchmark — Dual-control, multi-turn, policy-following eval with a simulated user and verifiable DB-state checks — the canonical example of a verifiable conversational/agentic environment beyond math/code paper arXiv 2506.07982 . 🆕 Must-reads: Wei · Lee RL-env taxonomy - — Han-Chung Lee — Hidden Technical Debt: Agent Harness https://leehanchung.github.io/blogs/2026/05/08/hidden-technical-debt-agent-harness/ https://leehanchung.github.io/blogs/2026/05/08/hidden-technical-debt-agent-harness/ https://leehanchung.github.io/blogs/2026/05/08/hidden-technical-debt-agent-harness/ · blog — The harness is the agent; what teams call "the model" is mostly harness + product. - — Han-Chung Lee — Hidden Technical Debt series index https://leehanchung.github.io/blogs/ https://leehanchung.github.io/blogs/ https://leehanchung.github.io/blogs/ · blog — The four-part series eval infra, runtime, harness, + agent runtime ~2026/04/24 . verify the runtime post URL on the index. - — METR — Measuring AI Ability to Complete Long Tasks https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ · paper/blog — Scaffolds change the measured horizon; success-vs-human-time as a primitive. also T9 - — Nathan Lambert — Turing Post interview "Open Models Won't Catch Up" https://www.turingpost.com/p/nathanlambert https://www.turingpost.com/p/nathanlambert https://www.turingpost.com/p/nathanlambert · talk/interview — "What technical people call the harness or the product matters more than just the model." - — Florian Brand Prime Intellect — Quo vadis, LLM benchmarks? https://florianbrand.com/posts/benches-2026 https://florianbrand.com/posts/benches-2026 https://florianbrand.com/posts/benches-2026 talk: https://www.youtube.com/watch?v=kmTMc-fVSXw https://www.youtube.com/watch?v=kmTMc-fVSXw · blog/talk — The AlgoTune case: same model, different harness, opposite ranking. also T6 notes: research/notes/florian-brand- - — Han-Chung Lee — The Model is the Product https://leehanchung.github.io/talks/2025/04/23/the-model-is-the-product/ https://leehanchung.github.io/talks/2025/04/23/the-model-is-the-product/ https://leehanchung.github.io/talks/2025/04/23/the-model-is-the-product/ · talk — The primary-source talk Data Council 2025 behind the must-read author's whole thesis — the direct counterpart to Hamel's 'Model is Not the Product'; the foundational text of the harness/model debate this section is built on. 🆕 - — Hamel Husain — The Model is Not the Product https://www.youtube.com/watch?v=EEw2PpL- NM https://www.youtube.com/watch?v=EEw2PpL- NM https://www.youtube.com/watch?v=EEw2PpL- NM · talk — The opposing side of the Lee debate Data Council 2025 : great products are mostly harness + product + evals, not the model. Section already cites Lee; it should cite the debate it half-references. 🆕 - — Simon Willison — Agents are models using tools in a loop https://simonwillison.net/2025/May/22/tools-in-a-loop/ https://simonwillison.net/2025/May/22/tools-in-a-loop/ https://simonwillison.net/2025/May/22/tools-in-a-loop/ · blog — The canonical, now-widely-adopted definition of an agent; 'the skill is in the design of both the tools and the loop' — the cleanest statement of why the harness, not the model, dominates behavior. 🆕 - — OpenAI — Harness engineering: leveraging Codex in an agent-first world https://openai.com/index/harness-engineering/ https://openai.com/index/harness-engineering/ https://openai.com/index/harness-engineering/ · blog — Frontier-lab primary source coining 'harness engineering': a 1M-line codebase built by Codex agents where improving the environment/harness mattered more than the model. Lab-side complement to Lee's 'harness is the agent'. URL returns 403 to scraper but page is live; corroborated by InfoQ/Milvus coverage. 🆕 - — Anthropic — Equipping agents for the real world with Agent Skills https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills · blog — The primary source for the 'skill' leg of the model/harness/skill decomposition — skills as composable, progressively-disclosed capabilities later made an open standard . The section title says 'skill' but has zero skill sources. 🆕 - — Anthropic — Effective context engineering for AI agents https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents · blog — Anthropic's primary statement that the harness's job is engineering context editing, compaction, memory, programmatic tool-calling — the mechanism behind why same model + different harness diverges. 🆕 - — Anthropic Ken Aizawa — Writing effective tools for agents — with agents https://www.anthropic.com/engineering/writing-tools-for-agents https://www.anthropic.com/engineering/writing-tools-for-agents https://www.anthropic.com/engineering/writing-tools-for-agents · blog — Tool design is a load-bearing part of the harness; 'agents are only as effective as the tools we give them,' validated eval-first. Directly ties harness decisions to measured agent performance. 🆕 - — Pete Hodgson — Same Model, Different Results: Why Coding Agents Aren't Interchangeable https://blog.thepete.net/blog/2025/12/10/same-model-different-results-why-coding-agents-arent-interchangeable/ https://blog.thepete.net/blog/2025/12/10/same-model-different-results-why-coding-agents-arent-interchangeable/ https://blog.thepete.net/blog/2025/12/10/same-model-different-results-why-coding-agents-arent-interchangeable/ · blog — Concrete teardown of Claude Code's harness system reminders, sub-agents, planning, IDE feedback showing identical models yield different results — the practitioner case-study version of Brand's AlgoTune point. 🆕 - — Princeton SAgE team Kapoor, Narayanan, et al. — Holistic Agent Leaderboard HAL https://hal.cs.princeton.edu/ https://hal.cs.princeton.edu/ https://hal.cs.princeton.edu/ · benchmark — Standardized, cost-aware harness that runs the SAME agent harness across 9 benchmarks/9 models 21,730 rollouts — the infrastructure answer to 'harness confounds rankings.' ICLR 2026; paper arXiv:2510.11977. 🆕 - — Addy Osmani O'Reilly Radar — Agent Harness Engineering https://www.oreilly.com/radar/agent-harness-engineering/ https://www.oreilly.com/radar/agent-harness-engineering/ https://www.oreilly.com/radar/agent-harness-engineering/ · blog — 'A decent model with a great harness beats a great model with a bad harness'; reframes agent failures as harness/config problems traceable AGENTS.md rules . Names the converging harness primitives across coding agents. 🆕 - — Nathan Lambert Interconnects — What comes next with open models weights / tools / harness decomposition https://www.interconnects.ai/p/the-next-phase-of-open-models https://www.interconnects.ai/p/the-next-phase-of-open-models https://www.interconnects.ai/p/the-next-phase-of-open-models · blog — Lambert's written articulation Mar 2026 of an AI system as weights + tools + harness — the written companion to the Turing Post interview already listed, with the explicit three-part decomposition. 🆕 Must-reads: Lee harness · Brand Quo vadis - — Han-Chung Lee — Hidden Technical Debt: Agent Evaluation Infrastructure https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/ https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/ https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/ · blog — Control plane / data plane; the five surfaces output, trace, memory, environment, mechanistic ; the empty-tool-result hallucination. - — Braintrust — The Three Pillars of AI Observability https://www.braintrust.dev/blog/three-pillars-ai-observability https://www.braintrust.dev/blog/three-pillars-ai-observability https://www.braintrust.dev/blog/three-pillars-ai-observability · blog — Dataset reconciliation living datasets ; traces / evals / annotation. - — Arize AX docs — Agent Trajectory Evaluations https://arize.com/docs/ax/evaluate/evaluators/trace-and-session-evals/trace-level-evaluations/agent-trajectory-evaluations https://arize.com/docs/ax/evaluate/evaluators/trace-and-session-evals/trace-level-evaluations/agent-trajectory-evaluations https://arize.com/docs/ax/evaluate/evaluators/trace-and-session-evals/trace-level-evaluations/agent-trajectory-evaluations · docs — Grading the path, not just the answer. - — Galileo — AI Agent Metrics: How Elite Teams Evaluate https://galileo.ai/blog/ai-agent-metrics https://galileo.ai/blog/ai-agent-metrics https://galileo.ai/blog/ai-agent-metrics · blog — A concrete agent-metric taxonomy action completion, tool selection, etc. . - — Arize — OpenInference semantic conventions https://github.com/Arize-ai/openinference/blob/main/spec/semantic conventions.md https://github.com/Arize-ai/openinference/blob/main/spec/semantic conventions.md https://github.com/Arize-ai/openinference/blob/main/spec/semantic conventions.md · tool/repo — An OTel-based agent trace schema tool, args, observation, latency, cost . - — LangChain — LangSmith Evaluation / Trajectory evals https://docs.langchain.com/langsmith/evaluation https://docs.langchain.com/langsmith/evaluation https://docs.langchain.com/langsmith/evaluation · https://docs.langchain.com/langsmith/trajectory-evals https://docs.langchain.com/langsmith/trajectory-evals · docs . - — OpenTelemetry / CNCF — OpenTelemetry GenAI Semantic Conventions agent & framework spans https://github.com/open-telemetry/semantic-conventions-genai https://github.com/open-telemetry/semantic-conventions-genai https://github.com/open-telemetry/semantic-conventions-genai · docs — The upstream vendor-neutral standard spans/metrics/events for LLM calls, invoke agent, execute tool, MCP that OpenInference maps onto — the canonical trace schema the section's OpenInference entry derives from. 🆕 - — OpenTelemetry — Semantic Conventions for GenAI agent and framework spans https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/ https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/ https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/ · docs — Human-readable spec page for create agent / invoke agent / execute tool spans and attributes — the precise definition of what a gradable agent trace looks like. 🆕 - — OpenTelemetry blog — Inside the LLM Call: GenAI Observability with OpenTelemetry https://opentelemetry.io/blog/2026/genai-observability/ https://opentelemetry.io/blog/2026/genai-observability/ https://opentelemetry.io/blog/2026/genai-observability/ · blog — Walkthrough of emitting and reading GenAI spans token usage, finish reasons, tool calls — concrete intro to the trace surface for practitioners not steeped in OTel. 🆕 - — Weights & Biases — W&B Weave — tracing & evaluation toolkit https://docs.wandb.ai/weave https://docs.wandb.ai/weave https://docs.wandb.ai/weave · docs — @weave.op trace trees inputs/outputs/cost/latency plus a scorer-based eval harness — a widely used surface for grading both traces and outputs. 🆕 - — Laminar — Laminar — open-source observability for AI agents https://laminar.sh/ https://laminar.sh/ https://laminar.sh/ · tool — OTel-native, agent-specific: transcript view, SQL-over-traces, and a rollout debugger — purpose-built for grading multi-step agent trajectories rather than single LLM calls. 🆕 Must-reads: Lee eval infra · Braintrust three pillars All repos URL-verified via GitHub API, Jun 2026. 🆕 = released/expanded 2025–2026. ⚠️ = caveat/discontinued. — UK AISI — Inspect AI https://github.com/UKGovernmentBEIS/inspect ai https://github.com/UKGovernmentBEIS/inspect ai https://github.com/UKGovernmentBEIS/inspect ai · https://inspect.aisi.org.uk/ https://inspect.aisi.org.uk/ — @task binds dataset + solver + scorer; custom scorers; sandboxed tools. The reference agent-eval framework. MUST — UK AISI — inspect evals https://github.com/UKGovernmentBEIS/inspect evals https://github.com/UKGovernmentBEIS/inspect evals https://github.com/UKGovernmentBEIS/inspect evals — 🆕 the companion catalog of community benchmarks GAIA, CTFs, AIME… — the "batteries" for Inspect.— EleutherAI — lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness — the standard academic harness; first-class decontamination; task YAMLs.— Allen Institute Ai2 — OLMES https://github.com/allenai/olmes https://github.com/allenai/olmes https://github.com/allenai/olmes — 🆕 the reproducible eval standard + harness behind OLMo/Tülu: standardized prompts/metrics/formatting for apples-to-apples model comparison.— BenchFlow https://github.com/benchflow-ai/benchflow https://github.com/benchflow-ai/benchflow https://github.com/benchflow-ai/benchflow · https://benchflow.ai https://benchflow.ai — 🆕 environment-lab framework: research infra + runtime for building RL environments, evals & post-training; ships SkillsBench and ClawsBench . "Environments are the new data." — Hugging Face — lighteval https://github.com/huggingface/lighteval https://github.com/huggingface/lighteval https://github.com/huggingface/lighteval — 🆕 all-in-one harness across transformers/vLLM/TGI/nanotron, 1000+ tasks; HF's successor to evaluate .— Groq — OpenBench https://github.com/groq/openbench https://github.com/groq/openbench https://github.com/groq/openbench — 🆕 provider-agnostic bench CLI, 95+ benchmarks, built on Inspect primitives.— OpenAI — simple-evals https://github.com/openai/simple-evals https://github.com/openai/simple-evals https://github.com/openai/simple-evals — minimal zero-shot/CoT scripts MMLU, HumanEval, SimpleQA, HealthBench ; the numbers OpenAI publishes.⚠️ not actively maintained.— OpenAI Evals https://github.com/openai/evals https://github.com/openai/evals https://github.com/openai/evals — the completion fn abstraction = swap the system-under-test. Best-practices: https://developers.openai.com/api/docs/guides/evaluation-best-practices https://developers.openai.com/api/docs/guides/evaluation-best-practices — promptfoo https://github.com/promptfoo/promptfoo https://github.com/promptfoo/promptfoo https://github.com/promptfoo/promptfoo — MIT eval + red-teaming CLI; git-diffable YAML configs. MUST — DeepEval / Confident AI https://github.com/confident-ai/deepeval https://github.com/confident-ai/deepeval https://github.com/confident-ai/deepeval — "pytest for LLMs," 40+ metrics G-Eval, RAG, hallucination + red-team; ~2M evals/day; hosted cloud. 🆕— pydantic-evals https://github.com/pydantic/pydantic-ai https://github.com/pydantic/pydantic-ai https://github.com/pydantic/pydantic-ai ai.pydantic.dev/evals — 🆕 type-safe Datasets/Cases/Evaluators with OTel tracing, from the Pydantic AI team.— LangChain — openevals https://github.com/langchain-ai/openevals https://github.com/langchain-ai/openevals https://github.com/langchain-ai/openevals — 🆕 prebuilt evaluators + create llm as judge incl. multimodal ; general-purpose companion to agentevals https://github.com/langchain-ai/agentevals https://github.com/langchain-ai/agentevals , trajectory match .— MLflow GenAI evaluate https://mlflow.org/docs/latest/genai/eval-monitor/ https://mlflow.org/docs/latest/genai/eval-monitor/ https://mlflow.org/docs/latest/genai/eval-monitor/ — 🆕 mlflow.genai.evaluate : 50+ judges/metrics, custom scorers, regression datasets inside MLflow.— Stanford CRFM — HELM crfm-helm https://github.com/stanford-crfm/helm https://github.com/stanford-crfm/helm https://github.com/stanford-crfm/helm — holistic eval: standardized datasets + metrics beyond accuracy + leaderboard also VHELM, HEIM .— Giskard https://github.com/Giskard-AI/giskard-oss https://github.com/Giskard-AI/giskard-oss https://github.com/Giskard-AI/giskard-oss — auto-generates adversarial test suites injection, hallucination, bias from a plain-language app description.— Deepchecks LLM https://github.com/deepchecks/deepchecks https://github.com/deepchecks/deepchecks https://github.com/deepchecks/deepchecks llmdocs.deepchecks.com — property-based scoring grounded-in-context, toxicity, fluency + custom LLM-judge properties.— UpTrain https://github.com/uptrain-ai/uptrain https://github.com/uptrain-ai/uptrain https://github.com/uptrain-ai/uptrain — 20+ preconfigured checks + root-cause analysis on failures.— HF https://github.com/huggingface/evaluate evaluate https://github.com/huggingface/evaluate https://github.com/huggingface/evaluate — classic metrics library,⚠️ maintenance mode use lighteval for LLMs .— harbor-framework Laude Institute / Stanford — Harbor https://github.com/harbor-framework/harbor https://github.com/harbor-framework/harbor https://github.com/harbor-framework/harbor — 🆕 framework for running agent evals + creating/using RL environments; powers Terminal-Bench 2.0. ~2.7k★.⚠️ name overloaded cf. av/harbor local-LLM toolkit . — Matt Pocock — evalite https://github.com/mattpocock/evalite https://github.com/mattpocock/evalite https://github.com/mattpocock/evalite — 🆕 local-first eval runner on Vitest; .eval.ts files, web UI, cost-aware.— Mastra scorers https://github.com/mastra-ai/mastra https://github.com/mastra-ai/mastra https://github.com/mastra-ai/mastra mastra.ai/docs/evals/overview — 🆕 model-graded/rule/statistical scorers, live evals, CI, in the Mastra agent framework.— Vercel agent-eval https://github.com/vercel-labs/agent-eval https://github.com/vercel-labs/agent-eval https://github.com/vercel-labs/agent-eval — 🆕 A/B-test coding agents Claude Code, Codex, Cursor on custom tasks; pass-rate dashboards.— Braintrust — Autoevals https://github.com/braintrustdata/autoevals https://github.com/braintrustdata/autoevals https://github.com/braintrustdata/autoevals — OSS scorer library Factuality, relevance, security… across Py/JS/Go/Ruby. — TruLens https://github.com/truera/trulens https://github.com/truera/trulens https://github.com/truera/trulens — instrumentation + "feedback functions" the RAG triad , now OTel-based.— Stanford — ARES https://github.com/stanford-futuredata/ARES https://github.com/stanford-futuredata/ARES https://github.com/stanford-futuredata/ARES — synthetic queries + fine-tuned judges + prediction-powered inference for confidence intervals.— Amazon Science — RAGChecker https://github.com/amazon-science/RAGChecker https://github.com/amazon-science/RAGChecker https://github.com/amazon-science/RAGChecker — 🆕 claim-level diagnosis separating retriever vs generator errors.— continuous-eval Relari https://github.com/relari-ai/continuous-eval https://github.com/relari-ai/continuous-eval https://github.com/relari-ai/continuous-eval — modular per-module metrics across retrieval/generation/tool-use.— Tonic Validate https://github.com/TonicAI/tonic validate https://github.com/TonicAI/tonic validate https://github.com/TonicAI/tonic validate — RAG metrics as a GitHub Action for CI. — Haize Labs — verdict https://github.com/haizelabs/verdict https://github.com/haizelabs/verdict https://github.com/haizelabs/verdict — 🆕 declarative compound judges debate/verification/aggregation, inference-time scaling ; arXiv:2502.18018.— OpenPipe ART — RULER https://github.com/OpenPipe/ART https://github.com/OpenPipe/ART https://github.com/OpenPipe/ART art.openpipe.ai/fundamentals/ruler — 🆕 LLM-judge that ranks trajectories with no labels — judge-as-RL-reward. industry must-read — Prometheus 2 https://github.com/prometheus-eval/prometheus-eval https://github.com/prometheus-eval/prometheus-eval https://github.com/prometheus-eval/prometheus-eval — open-weight evaluator LMs for rubric-based assessment + pairwise.— Atla Selene https://github.com/atla-ai/selene-mini https://github.com/atla-ai/selene-mini https://github.com/atla-ai/selene-mini — 🆕 8B SoTA open judge score + critique ; + MCP server atla-ai/atla-mcp-server . arXiv:2501.17195.— Patronus Lynx / GLIDER https://github.com/patronus-ai/Lynx-hallucination-detection https://github.com/patronus-ai/Lynx-hallucination-detection https://github.com/patronus-ai/Lynx-hallucination-detection · https://github.com/patronus-ai/glider https://github.com/patronus-ai/glider — 🆕 open hallucination judge / explainable span-level judge.— Flow-Judge https://github.com/flowaicom/flow-judge https://github.com/flowaicom/flow-judge https://github.com/flowaicom/flow-judge — efficient 3.8B open evaluator.— AI2 — RewardBench https://github.com/allenai/reward-bench https://github.com/allenai/reward-bench https://github.com/allenai/reward-bench — canonical reward-model +v2 judge benchmark/harness.— JudgeBench https://github.com/ScalerLab/JudgeBench https://github.com/ScalerLab/JudgeBench https://github.com/ScalerLab/JudgeBench — benchmark to evaluate the judges themselves.— Fireworks — reward-kit https://github.com/fw-ai-external/reward-kit https://github.com/fw-ai-external/reward-kit https://github.com/fw-ai-external/reward-kit — 🆕 decorator-based reward-function authoring TRL/Fireworks interop . — Prime Intellect — verifiers https://github.com/PrimeIntellect-ai/verifiers https://github.com/PrimeIntellect-ai/verifiers https://github.com/PrimeIntellect-ai/verifiers — Environment = dataset + harness + rubric; one package for eval, RL, synthetic data. MUST — Prime Intellect — Environments Hub https://github.com/PrimeIntellect-ai/community-environments https://github.com/PrimeIntellect-ai/community-environments https://github.com/PrimeIntellect-ai/community-environments app.primeintellect.ai — 🆕 crowdsourced verifiers-based RL/eval envs.— Prime Intellect — prime-rl https://github.com/PrimeIntellect-ai/prime-rl https://github.com/PrimeIntellect-ai/prime-rl https://github.com/PrimeIntellect-ai/prime-rl — 🆕 async RL trainer consuming verifiers envs INTELLECT-3 .— BenchFlow https://github.com/benchflow-ai/benchflow https://github.com/benchflow-ai/benchflow https://github.com/benchflow-ai/benchflow · https://benchflow.ai https://benchflow.ai — 🆕 environment lab: builds & runs RL/eval environments SkillsBench, ClawsBench, runtime . "Environments are the new data." also §5a — HUD https://github.com/hud-evals/hud-python https://github.com/hud-evals/hud-python https://github.com/hud-evals/hud-python — 🆕 SDK to build/run agent eval environments computer-use, browser, MCP with telemetry.— Nous Research — Atropos https://github.com/NousResearch/atropos https://github.com/NousResearch/atropos https://github.com/NousResearch/atropos — 🆕 async "environment microservice" framework for rollouts/verifiable rewards.— verl https://github.com/volcengine/verl https://github.com/volcengine/verl https://github.com/volcengine/verl now verl-project/verl — de-facto industry RLVR trainer PPO/GRPO . ~22k★.— OpenRLHF https://github.com/OpenRLHF/OpenRLHF https://github.com/OpenRLHF/OpenRLHF https://github.com/OpenRLHF/OpenRLHF · SkyRL — https://github.com/NovaSky-AI/SkyRL https://github.com/NovaSky-AI/SkyRL · AReaL — https://github.com/areal-project/AReaL https://github.com/areal-project/AReaL · ROLL — https://github.com/alibaba/ROLL https://github.com/alibaba/ROLL · rLLM — https://github.com/agentica-project/rllm https://github.com/agentica-project/rllm · TRL — https://github.com/huggingface/trl https://github.com/huggingface/trl — the RL-training stack agents are post-trained + eval'd in.— General Reasoning — Open Reward Standard ORS https://docs.openreward.ai/ https://docs.openreward.ai/ https://docs.openreward.ai/ PyPI openreward — 🆕 MCP-extending spec adding RL primitives episodes, rewards, curriculum .⚠️ no single canonical repo confirmed. — Arize Phoenix https://github.com/Arize-ai/phoenix https://github.com/Arize-ai/phoenix https://github.com/Arize-ai/phoenix — OSS OTel tracing + response/retrieval evals + datasets/experiments. MUST — Langfuse https://github.com/langfuse/langfuse https://github.com/langfuse/langfuse https://github.com/langfuse/langfuse — OSS: evals LLM-judge, feedback, manual labeling , datasets/experiments, prompt mgmt; self-hostable. 🆕— Comet — Opik https://github.com/comet-ml/opik https://github.com/comet-ml/opik https://github.com/comet-ml/opik — 🆕 fully-OSS eval + observability judges, datasets, CI-runnable evals .— W&B Weave https://github.com/wandb/weave https://github.com/wandb/weave https://github.com/wandb/weave — weave.Evaluation scorers exact/regex/model-graded/embedding + Guardrails; comparison dashboards. 🆕 Humanloop's migration target. — Braintrust https://www.braintrust.dev/docs/start/eval-sdk https://www.braintrust.dev/docs/start/eval-sdk https://www.braintrust.dev/docs/start/eval-sdk offline-eval-guide — Eval over golden datasets; offline vs online. MUST — Patronus AI https://www.patronus.ai/ https://www.patronus.ai/ https://www.patronus.ai/ github.com/patronus-ai — 🆕 research-grade judges Lynx, GLIDER, Percival agent-failure debugger , experiments, multimodal judge.— Maxim AI https://www.getmaxim.ai/ https://www.getmaxim.ai/ https://www.getmaxim.ai/ — 🆕 agent simulation + eval + observability across thousands of scenarios/personas.— Galileo https://galileo.ai/ https://galileo.ai/ https://galileo.ai/ — Luna evaluators + Agentic Evaluations.— Vellum https://www.vellum.ai/ https://www.vellum.ai/ https://www.vellum.ai/ — visual workflows + offline/online evals scoring every production run.— Helicone https://github.com/helicone/helicone https://github.com/helicone/helicone https://github.com/helicone/helicone — OSS gateway + observability; "Scores" ingests external eval results.— Traceloop / OpenLLMetry https://github.com/traceloop/openllmetry https://github.com/traceloop/openllmetry https://github.com/traceloop/openllmetry — OSS OTel instrumentation Py/TS/Go/Ruby + hosted reliability platform.— Langtrace https://github.com/Scale3-Labs/langtrace https://github.com/Scale3-Labs/langtrace https://github.com/Scale3-Labs/langtrace — OSS OTel-standard tracing + manual scoring + dataset mgmt.— WhyLabs / LangKit https://github.com/whylabs/langkit https://github.com/whylabs/langkit https://github.com/whylabs/langkit — high-throughput text-signal metrics toxicity, PII, jailbreak for production monitoring.— Portkey https://github.com/portkey-ai/gateway https://github.com/portkey-ai/gateway https://github.com/portkey-ai/gateway — 🆕 OSS gateway + 60+ guardrails + observability fully open-sourced Mar 2026 .— Datadog LLM Observability https://www.datadoghq.com/product/ai/llm-observability/ https://www.datadoghq.com/product/ai/llm-observability/ https://www.datadoghq.com/product/ai/llm-observability/ — 🆕 evaluators + golden datasets + LLM Experiments + AI Agent Monitoring Jun 2025 .— Fiddler AI https://www.fiddler.ai/ https://www.fiddler.ai/ https://www.fiddler.ai/ — 🆕 Trust Models Safety/PII/Faithfulness scoring in <100ms; Guardrails + agentic observability.— SeaOtter https://seaotter.ai?utm source=github&utm medium=awesome list&utm campaign=launch&utm content=A-09-benchflow-awesome-evals https://seaotter.ai/submit?utm source=github&utm medium=awesome list&utm campaign=launch&utm content=A-09-benchflow-awesome-evals https://seaotter.ai/submit?utm source=github&utm medium=awesome list&utm campaign=launch&utm content=A-09-benchflow-awesome-evals · tool — Adversarial critic for AI agent outputs. Submit an output plus an acceptance policy; get pass/rework/fail with specific reasons before accepting the work.— PromptLayer https://www.promptlayer.com/ https://www.promptlayer.com/ https://www.promptlayer.com/ · New Relic AI Monitoring — https://newrelic.com/platform/ai-monitoring https://newrelic.com/platform/ai-monitoring — lighter prompt-CMS / APM-native monitoring. - — Arize — OpenInference https://github.com/Arize-ai/openinference https://github.com/Arize-ai/openinference https://github.com/Arize-ai/openinference — semantic conventions for agent traces tool/args/observation/latency/cost . - — OpenTelemetry GenAI semantic conventions https://opentelemetry.io/docs/specs/semconv/gen-ai/ https://opentelemetry.io/docs/specs/semconv/gen-ai/ https://opentelemetry.io/docs/specs/semconv/gen-ai/ open-telemetry/semantic-conventions — 🆕 the vendor-neutral schema now covers agent orchestration, MCP tool calls, and a quality-evaluation span hook . - — Braintrust — Braintrust https://www.braintrust.dev/ https://www.braintrust.dev/ https://www.braintrust.dev/ · tool — Industry-standard eval+observability platform Notion, Stripe, Vercel tying offline experiments to production logs; the section already cites Braintrust's Autoevals but omits the platform itself. 🆕 - — RagaAI — RagaAI Catalyst https://github.com/raga-ai-hub/RagaAI-Catalyst https://github.com/raga-ai-hub/RagaAI-Catalyst https://github.com/raga-ai-hub/RagaAI-Catalyst · tool — OSS agent-observability + eval SDK with multi-agent trace/execution-graph debugging, synthetic-data gen, and guardrail management — covers the online/guardrail-eval slice the section lacks. 🆕 - — OpenAI — OpenAI Cookbook — Evals https://developers.openai.com/cookbook/topic/evals https://developers.openai.com/cookbook/topic/evals https://developers.openai.com/cookbook/topic/evals · docs — Maintained, runnable recipes for building evals incl. Agents SDK eval, evaluating agents with Langfuse ; the practical companion to OpenAI Evals and a curator-grade 'show real work' resource. 🆕 ⚠ unverified URL Must-reads: Inspect AI · promptfoo · Braintrust · verifiers · DeepEval · Phoenix/Langfuse pick your observability · RULER judge-as-reward 6 · Benchmark vs. eval and benchmark integrity: contamination, saturation, label errors, leaderboard gaming - — Ofir Press — How to Build Good Language Modeling Benchmarks https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/ https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/ https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/ · blog — The benchmark-author's checklist; difficulty target; one-number reporting; 150–500 task sizing. - — Kapoor et al. — AI Agents That Matter https://arxiv.org/abs/2407.01502 https://arxiv.org/abs/2407.01502 https://arxiv.org/abs/2407.01502 · paper — Cost-controlled evaluation; model-dev vs downstream-dev needs; holdouts. - — OpenAI — Why We No Longer Evaluate SWE-bench Verified https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ · blog — ~59% of audited failures were broken tests. mirror: https://decrypt.co/359012/... https://decrypt.co/359012/... - — Shivalika Singh et al. Cohere/Princeton/Stanford/MIT/AI2 — The Leaderboard Illusion https://arxiv.org/abs/2504.20879 https://arxiv.org/abs/2504.20879 https://arxiv.org/abs/2504.20879 · paper — Private testing, selective disclosure, and data-access asymmetry on Chatbot Arena. notes: research/notes/leaderboard-illusion.md - — The SWE-bench Illusion: When SOTA LLMs Remember Instead of Reason https://arxiv.org/abs/2506.12286 https://arxiv.org/abs/2506.12286 https://arxiv.org/abs/2506.12286 · paper — Memorization inflates SWE-bench scores. - — Establishing Best Practices for Building Rigorous Agentic Benchmarks ABC https://arxiv.org/abs/2507.02825 https://arxiv.org/abs/2507.02825 https://arxiv.org/abs/2507.02825 · paper — SWE-bench Verified weak tests; τ-bench rewards empty responses. verified high - — Epoch AI — FrontierMath Tiers 1–3 v2 corrected https://epoch.ai/benchmarks/frontiermath-tiers-1-3-v2 https://epoch.ai/benchmarks/frontiermath-tiers-1-3-v2 https://epoch.ai/benchmarks/frontiermath-tiers-1-3-v2 changelog: .../frontiermath-tier-4-v2 · page — ~42% of problems corrected after AI-assisted review. also T8: the operator-as-rot-detector tale - — FutureHouse / Andrew White — About 30% of Humanity's Last Exam Answers Are Wrong https://www.futurehouse.org/research-announcements/hle-exam https://www.futurehouse.org/research-announcements/hle-exam https://www.futurehouse.org/research-announcements/hle-exam · blog — 29 ± 3.7% of text-only chem/bio answers contradicted by the literature. LessWrong writeup: https://www.lesswrong.com/posts/JANqfGrMyBgcKtGgK/ https://www.lesswrong.com/posts/JANqfGrMyBgcKtGgK/ - — Nathan Lambert — Building on Evaluation Quicksand https://www.interconnects.ai/p/building-on-evaluation-quicksand https://www.interconnects.ai/p/building-on-evaluation-quicksand https://www.interconnects.ai/p/building-on-evaluation-quicksand · blog — No hard source of truth; synthetic-data contamination. - — Lost in Simulation https://arxiv.org/abs/2601.17087 https://arxiv.org/abs/2601.17087 https://arxiv.org/abs/2601.17087 · paper — Simulated users are unreliable proxies ~9pp swings by simulator choice; demographic miscalibration . - — Jimenez, Yang, … Press, Narasimhan — SWE-bench: Can LMs Resolve Real-World GitHub Issues? https://arxiv.org/abs/2310.06770 https://arxiv.org/abs/2310.06770 https://arxiv.org/abs/2310.06770 · https://www.swebench.com https://www.swebench.com Verified: .../verified.html · paper/site . - — Eugene Yan — Task-Specific LLM Evals that Do & Don't Work https://eugeneyan.com/writing/evals/ https://eugeneyan.com/writing/evals/ https://eugeneyan.com/writing/evals/ · blog — Off-the-shelf evals rarely transfer; accuracy is too coarse. - — Andrej Karpathy on evals https://x.com/karpathy/status/1896266683301659068 https://x.com/karpathy/status/1896266683301659068 https://x.com/karpathy/status/1896266683301659068 · post — "We make a number of specific recommendations…" the eval-as-narrow critique . - — Hugh Zhang et al. Scale AI — A Careful Examination of LLM Performance on Grade School Arithmetic GSM1k https://arxiv.org/abs/2405.00332 https://arxiv.org/abs/2405.00332 https://arxiv.org/abs/2405.00332 · paper — Held-out GSM1k replica of GSM8k exposes up to 8% accuracy drop and partial memorization Mistral/Phi — the canonical method for measuring benchmark overfitting/contamination via a matched holdout. - — Curtis Northcutt, Anish Athalye, Jonas Mueller — Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks https://arxiv.org/abs/2103.14749 https://arxiv.org/abs/2103.14749 https://arxiv.org/abs/2103.14749 · paper — NeurIPS 2021 foundational result: ~3.3% avg label errors across 10 famous test sets ImageNet, MNIST, etc. ; corrections flip model rankings. The canonical 'label errors' citation this section's theme rests on labelerrors.com / cleanlab . - — Aryo Pradipta Gema et al. Edinburgh — Are We Done with MMLU? MMLU-Redux https://arxiv.org/abs/2406.04127 https://arxiv.org/abs/2406.04127 https://arxiv.org/abs/2406.04127 · paper — ~6.5% of MMLU questions contain errors 57% in Virology ; MMLU-Redux re-annotation shifts rankings — directly demonstrates label-error impact on the most-cited LLM benchmark. - — Naman Jain et al. UC Berkeley — LiveCodeBench: Holistic and Contamination-Free Evaluation of LLMs for Code https://arxiv.org/abs/2403.07974 https://arxiv.org/abs/2403.07974 https://arxiv.org/abs/2403.07974 · benchmark — Time-windowed problem collection post-cutoff scoring as the leading contamination-resistant design pattern — the section discusses contamination but lists no exemplar of how to engineer around it. - — White, Dohan, LeCun, Goldblum et al. — LiveBench: A Challenging, Contamination-Limited LLM Benchmark https://github.com/LiveBench/LiveBench https://github.com/LiveBench/LiveBench https://github.com/LiveBench/LiveBench · benchmark — Monthly-refreshed questions from new arXiv/news/competitions with objective ground truth — the canonical 'dynamic refresh' answer to saturation and contamination. - — Clémentine Fourrier / Hugging Face — The LLM Evaluation Guidebook Open LLM Leaderboard team https://github.com/huggingface/evaluation-guidebook https://github.com/huggingface/evaluation-guidebook https://github.com/huggingface/evaluation-guidebook · docs — Practitioner reference from running the Open LLM Leaderboard; explicit sections on contamination, reproducibility, and leaderboard design — the hands-on 'how to not get fooled' companion to this section updated version: hf.co/spaces/OpenEvals/evaluation-guidebook . - — Kapoor, Stroebl, Kirgis et al. Princeton — Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation https://arxiv.org/abs/2510.11977 https://arxiv.org/abs/2510.11977 https://arxiv.org/abs/2510.11977 · paper — 21,000+ standardized agent runs surfacing leaderboard unreliability and unreported misbehaviors agents searching HuggingFace for benchmark answers — extends 'AI Agents That Matter' to leaderboard integrity for agents specifically. 🆕 - — Jambholkar, Rajani, Bakshi Collinear AI — Gaming the System: Goodhart's Law Exemplified in the AI Leaderboard Controversy https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy · blog — Practitioner framing of the Llama 4 / Chatbot Arena gaming episode through Goodhart's Law — the accessible blog companion to The Leaderboard Illusion paper. 🆕 - — OpenAI — A Shared Playbook for Trustworthy Third-Party Evaluations https://openai.com/index/trustworthy-third-party-evaluations-foundations/ https://openai.com/index/trustworthy-third-party-evaluations-foundations/ https://openai.com/index/trustworthy-third-party-evaluations-foundations/ · blog Safety, May 29 2026 — What makes independent evals of frontier-model safeguards & capabilities trustworthy: selecting the right harness, checking for validity hazards that distort results, and the standards third-party evaluators need. also T10 🆕 Must-reads: Press · Kapoor et al. · OpenAI SWE-bench Verified · Leaderboard Illusion See also T2 — verifiers library, Lee's RL-env taxonomy, Garg's lifecycle, Wei's verifier's law. - — Nathan Lambert et al. — RewardBench https://arxiv.org/abs/2403.13787 https://arxiv.org/abs/2403.13787 https://arxiv.org/abs/2403.13787 · paper — Evaluating reward models the verifier you train against . - — Nathan Lambert — The New RL Scaling Laws https://www.interconnects.ai/p/the-new-rl-scaling-laws https://www.interconnects.ai/p/the-new-rl-scaling-laws https://www.interconnects.ai/p/the-new-rl-scaling-laws · blog — Where RLVR scaling is heading. interview: https://www.latent.space/p/the-rlvr-revolution-with-nathan-lambert https://www.latent.space/p/the-rlvr-revolution-with-nathan-lambert - — Spurious Rewards: Rethinking Training Signals in RLVR https://arxiv.org/abs/2506.10947 https://arxiv.org/abs/2506.10947 https://arxiv.org/abs/2506.10947 · paper — Random/spurious rewards rival ground truth on Qwen2.5 Qwen-specific . cite arXiv figures, not the blog gloss — see research/notes/reference-audit.md - — Nathan Lambert — The State of Post-Training 2025 https://www.interconnects.ai/p/the-state-of-post-training-2025 https://www.interconnects.ai/p/the-state-of-post-training-2025 https://www.interconnects.ai/p/the-state-of-post-training-2025 · blog — Context for where evals feed training. - — Lilian Weng — Reward Hacking in Reinforcement Learning https://lilianweng.github.io/posts/2024-11-28-reward-hacking/ https://lilianweng.github.io/posts/2024-11-28-reward-hacking/ https://lilianweng.github.io/posts/2024-11-28-reward-hacking/ · blog — The canonical survey of reward hacking — taxonomy, RLHF-specific failure modes, mitigations; the foundational reference any reward-design section needs. - — Victoria Krakovna et al. Google DeepMind — Specification gaming: the flip side of AI ingenuity https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/ https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/ https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/ · blog — Canonical specification-gaming post +the running examples list ; origin story of why verifiers/reward functions get gamed, predating the LLM-RL wave. - — Latent Space / Will Brown — Multi-Turn RL for Multi-Hour Agents — with Will Brown Prime Intellect https://www.latent.space/p/willccbb https://www.latent.space/p/willccbb https://www.latent.space/p/willccbb · talk — The verifiers author on building multi-turn RL environments, turn-level credit assignment and reward design in practice — the practitioner voice behind the verifiers library already cited here. 🆕 - — various arXiv 2509.21882 — Position: The Hidden Costs and Measurement Gaps of RLVR https://arxiv.org/abs/2509.21882 https://arxiv.org/abs/2509.21882 https://arxiv.org/abs/2509.21882 · paper — RLVR gains overstated via budget mismatch, calibration drift, contamination; proposes a tax-aware minimum standard — the rigor counterweight to Lambert's RL-scaling optimism. 🆕 - — Saumya Malik, Nathan Lambert et al. Ai2 — RewardBench 2: Advancing Reward Model Evaluation https://arxiv.org/abs/2506.01937 https://arxiv.org/abs/2506.01937 https://arxiv.org/abs/2506.01937 · benchmark — The 2025 successor to RewardBench already listed — harder, less saturated, ICLR 2026; the current bar for evaluating the verifier you train against. 🆕 - — Nathan Lambert — Reward Modeling RLHF Book, ch. 5 https://rlhfbook.com/c/05-reward-models https://rlhfbook.com/c/05-reward-models https://rlhfbook.com/c/05-reward-models · docs — Canonical free reference chapter on reward models — the standing explainer for the 'verifier you train against' framing this section uses. 🆕 - — Shubham Parashar et al. Texas A&M — Curriculum RL from Easy to Hard Tasks Improves LLM Reasoning E2H Reasoner https://arxiv.org/abs/2506.06632 https://arxiv.org/abs/2506.06632 https://arxiv.org/abs/2506.06632 · paper — Difficulty-calibration primary source: easy-to-hard scheduling with convergence guarantees and the 'fade out easy tasks' result — directly fills the section's difficulty-calibration theme. 🆕 - — Jiacheng Guo, Ling Yang, Mengdi Wang et al. Princeton — GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators https://arxiv.org/abs/2512.19682 https://arxiv.org/abs/2512.19682 https://arxiv.org/abs/2512.19682 · paper — Generative environment simulator with an alpha-Curriculum Reward that keeps tasks in the zone of proximal development — recent take on auto-calibrating env difficulty to the agent. 🆕 Must-reads: Lee RL-env taxonomy · Garg lifecycle · verifiers repo - — Eugene Yan — Evaluating the Effectiveness of LLM-Evaluators https://eugeneyan.com/writing/llm-evaluators/ https://eugeneyan.com/writing/llm-evaluators/ https://eugeneyan.com/writing/llm-evaluators/ · blog — Position/verbosity/self-enhancement bias; direct vs pairwise; prefer binary + classification metrics. - — Hamel Husain — Creating an LLM-as-a-Judge That Drives Business Results https://hamel.dev/blog/posts/llm-judge/ https://hamel.dev/blog/posts/llm-judge/ https://hamel.dev/blog/posts/llm-judge/ · blog — Critique-shadowing; validate against ONE benevolent-dictator expert; precision/recall over raw agreement. - — Shankar et al. UIST '24 — Who Validates the Validators? EvalGen https://arxiv.org/abs/2404.12272 https://arxiv.org/abs/2404.12272 https://arxiv.org/abs/2404.12272 pdf: .../pdf/2404.12272 ; UIST: https://people.eecs.berkeley.edu/~bjoern/papers/shankar-validators-uist2024.pdf https://people.eecs.berkeley.edu/~bjoern/papers/shankar-validators-uist2024.pdf · paper — Criteria drift; the coverage-vs-false-failure judge-alignment loop. - — Hamel Husain & Shreya Shankar — LLM Evals FAQ https://hamel.dev/blog/posts/evals-faq/ https://hamel.dev/blog/posts/evals-faq/ https://hamel.dev/blog/posts/evals-faq/ error-analysis section: .../why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html · blog — Binary over Likert; review ≥100 traces; the first-failure transition matrix for agents. - — Han-Chung Lee — LLM-as-a-Judge: Rethinking Model-Based Evaluations https://leehanchung.github.io/blogs/2024/08/11/llm-as-a-judge/ https://leehanchung.github.io/blogs/2024/08/11/llm-as-a-judge/ https://leehanchung.github.io/blogs/2024/08/11/llm-as-a-judge/ · blog — Avoid 0,1 continuous scales; manage judges like junior annotators. - — Zheng et al. — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena https://arxiv.org/abs/2306.05685 https://arxiv.org/abs/2306.05685 https://arxiv.org/abs/2306.05685 · paper — Source of the 10%/25% self-favoring & position-bias numbers — which the authors themselves hedge "cannot determine" ; GPT-3.5 doesn't self-favor. - — Bavaresco et al. — LLMs Instead of Human Judges? A Large-Scale Study https://arxiv.org/abs/2406.18403 https://arxiv.org/abs/2406.18403 https://arxiv.org/abs/2406.18403 · paper — Substantial variance across models/datasets; validate judges against humans first. - — Eugene Yan — AlignEval https://eugeneyan.com/writing/aligneval/ https://eugeneyan.com/writing/aligneval/ https://eugeneyan.com/writing/aligneval/ · blog — "Align AI to human. Calibrate human to AI. Repeat." Work backward from the data. - — Eugene Yan — Product Evals in Three Simple Steps https://eugeneyan.com/writing/product-evals/ https://eugeneyan.com/writing/product-evals/ https://eugeneyan.com/writing/product-evals/ · blog — The "God Evaluator" anti-pattern; the benchmark is human performance, not perfection. - — Han-Chung Lee — Statistics for AI/ML, Part 3 — Cohen's Kappa https://leehanchung.github.io/blogs/2025/03/03/cohen-kappa/ https://leehanchung.github.io/blogs/2025/03/03/cohen-kappa/ https://leehanchung.github.io/blogs/2025/03/03/cohen-kappa/ · blog — Chance-adjusted inter-annotator agreement the gate before holding out . - — Shreya Shankar — Data Flywheels for LLM Applications https://www.sh-reya.com/blog/ai-engineering-flywheel/ https://www.sh-reya.com/blog/ai-engineering-flywheel/ https://www.sh-reya.com/blog/ai-engineering-flywheel/ · blog — Binary metrics, the "GPT smell," error analysis as the core activity. - SPADE https://arxiv.org/html/2401.03038v1 https://arxiv.org/html/2401.03038v1 https://arxiv.org/html/2401.03038v1 & DocETL https://arxiv.org/abs/2410.12189 https://arxiv.org/abs/2410.12189 — Shankar et al. · paper — Data-quality assertions / agentic query rewriting for LLM pipelines. - — Arjun Panickssery, Samuel R. Bowman, Shi Feng NeurIPS 2024 — LLM Evaluators Recognize and Favor Their Own Generations https://arxiv.org/abs/2404.13076 https://arxiv.org/abs/2404.13076 https://arxiv.org/abs/2404.13076 · paper — The canonical causal study of self-preference bias: shows GPT-4/Llama-2 can recognize their own outputs and that self-recognition correlates linearly with self-favoring. This is the primary source behind 'self-enhancement bias' that the section's blogs only allude to. - — Yang Liu et al. Microsoft, EMNLP 2023 — G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment https://arxiv.org/abs/2303.16634 https://arxiv.org/abs/2303.16634 https://arxiv.org/abs/2303.16634 · paper — The foundational reference-free LLM-judge method CoT + form-filling scoring . Defines the direct-scoring paradigm the section critiques; a curated judge section is incomplete without the paper that started it. - — Jiawei Gu et al. — A Survey on LLM-as-a-Judge https://arxiv.org/abs/2411.15594 https://arxiv.org/abs/2411.15594 https://arxiv.org/abs/2411.15594 · paper — The most-cited survey organizing the LLM-judge space bias taxonomy, reliability methods, agreement metrics . Serves as the one-stop map/bibliography the section currently lacks. - — Yulai Zhao, Haolin Liu, Dian Yu et al. Tencent AI Lab / Princeton — One Token to Fool LLM-as-a-Judge https://arxiv.org/abs/2507.08794 https://arxiv.org/abs/2507.08794 https://arxiv.org/abs/2507.08794 · paper — Shows 'master-key' tokens a colon, 'Solution:' trigger false-positive rewards up to 80% even on GPT-o1/Claude-4 judges, plus a robust Master-RM fix. Core evidence on judge/verifier reward-hacking fragility. 🆕 - — Jon Saad-Falcon et al. — Stanford Hazy Research / Scaling Intelligence — Weaver: Closing the Generation-Verification Gap with Weak Verifiers https://hazyresearch.stanford.edu/blog/2025-06-18-weaver https://hazyresearch.stanford.edu/blog/2025-06-18-weaver https://hazyresearch.stanford.edu/blog/2025-06-18-weaver · blog — Directly operationalizes 'verifiable vs judgeable': aggregates many weak judges/reward models unlabeled to shrink the generator-verifier gap, reaching o3-mini accuracy from Llama-3.3-70B. Paper: arxiv.org/abs/2506.18203. 🆕 - — Mingchen Zhuge et al. Meta AI / KAUST — Agent-as-a-Judge: Evaluate Agents with Agents https://arxiv.org/abs/2410.10934 https://arxiv.org/abs/2410.10934 https://arxiv.org/abs/2410.10934 · paper — Extends LLM-as-judge to agentic trajectories—grading intermediate steps, not just final outputs—with the DevAI benchmark. The agent-specific evaluation case this agent-evals library specifically needs. - — Various AAAI 2026 — VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains https://arxiv.org/abs/2507.09884 https://arxiv.org/abs/2507.09884 https://arxiv.org/abs/2507.09884 · benchmark — Cross-domain benchmark exposing verifier precision/recall trade-offs specialized verifiers high-accuracy but low-recall; general models inclusive but unstable . Quantifies how trustworthy a verifier actually is for RLVR. 🆕 - — Databricks Mosaic Research — Enhancing LLM-as-a-Judge with Grading Notes / From Pilot to Production with Custom Judges https://www.databricks.com/blog/pilot-production-custom-judges https://www.databricks.com/blog/pilot-production-custom-judges https://www.databricks.com/blog/pilot-production-custom-judges · blog — Enterprise-grade judge-building playbook: 20-30 calibration examples, batched SME annotation, Krippendorff's alpha agreement gating—a production-side complement to the Hamel/Shankar academic alignment loop. 🆕 - — Jiayi Ye et al. — Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge CALM framework https://arxiv.org/abs/2410.02736 https://arxiv.org/abs/2410.02736 https://arxiv.org/abs/2410.02736 · paper — Systematic quantification of 12 judge biases verbosity, bandwagon, authority, distraction, sentiment, etc. via automated attacks—broadens the section's bias coverage well beyond position/verbosity/self-enhancement. Must-reads: Yan llm-evaluators · Hamel llm-judge · Shankar EvalGen 9 · Agent-specific evaluation trajectories, tool use, multi-turn, world state, multi-agent, localization - — Anthropic — Demystifying Evals for AI Agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents · blog — Grade the final env state flight-booking via SQL ; outcome vs trajectory; isolation; pass@k vs pass^k. - — Sierra — τ-bench / τ²-bench https://arxiv.org/abs/2406.12045 https://arxiv.org/abs/2406.12045 https://arxiv.org/abs/2406.12045 · https://github.com/sierra-research/tau-bench https://github.com/sierra-research/tau-bench · paper/repo — DB-state-diff grading; user simulation; pass^k; empty-result as explicit fail. - — Sierra — Benchmarking AI Agents https://sierra.ai/blog/benchmarking-ai-agents https://sierra.ai/blog/benchmarking-ai-agents https://sierra.ai/blog/benchmarking-ai-agents · blog — The motivation behind τ-bench. - — Mialon et al. — GAIA: A Benchmark for General AI Assistants https://arxiv.org/abs/2311.12983 https://arxiv.org/abs/2311.12983 https://arxiv.org/abs/2311.12983 · paper — Real assistant tasks; difficulty by human task-length. - — Eugene Yan — Patterns for Building Cybersecurity Evals https://eugeneyan.com/writing/cybersecurity-evals/ https://eugeneyan.com/writing/cybersecurity-evals/ https://eugeneyan.com/writing/cybersecurity-evals/ · blog — The four-primitive agentic-eval template sandbox, difficulty inputs, tools, deterministic grader ; outcome grading + partial-credit ladders + transcript audits. also T10 - — Han-Chung Lee — Statistics for AI/ML, Part 4 — pass@k and Unbiased Estimator https://leehanchung.github.io/blogs/2025/09/08/pass-at-k/ https://leehanchung.github.io/blogs/2025/09/08/pass-at-k/ https://leehanchung.github.io/blogs/2025/09/08/pass-at-k/ · blog — Demystifies the metric everyone misuses. - — Han-Chung Lee — First-Principles Eval https://leehanchung.github.io/blogs/2024/05/22/first-principles-eval/ https://leehanchung.github.io/blogs/2024/05/22/first-principles-eval/ https://leehanchung.github.io/blogs/2024/05/22/first-principles-eval/ · blog . - — SWE-bench grading harness https://github.com/SWE-bench/SWE-bench/blob/main/swebench/harness/grading.py https://github.com/SWE-bench/SWE-bench/blob/main/swebench/harness/grading.py https://github.com/SWE-bench/SWE-bench/blob/main/swebench/harness/grading.py · tool/repo — FAIL TO PASS / PASS TO PASS as a verifiable reward. SWE-agent ACI: https://swe-agent.com/0.7/background/aci/ https://swe-agent.com/0.7/background/aci/ - — OpenAI — human-eval pass@k estimator https://github.com/openai/human-eval/blob/master/human eval/evaluation.py https://github.com/openai/human-eval/blob/master/human eval/evaluation.py https://github.com/openai/human-eval/blob/master/human eval/evaluation.py · tool/repo . - More agent benchmarks to add named in the brief; URLs not yet verified in this corpus — verify before use : WebArena, OSWorld, Terminal-Bench, Cybench. - — Zhou et al. CMU — WebArena: A Realistic Web Environment for Building Autonomous Agents https://arxiv.org/abs/2307.13854 https://arxiv.org/abs/2307.13854 https://arxiv.org/abs/2307.13854 · benchmark — Self-hostable sandboxed websites e-commerce/forum/GitLab/CMS/maps with execution-based functional-correctness graders; 812 tasks. The canonical web-agent world-state benchmark named in the brief — now URL-verified. - — Xie et al. HKU et al. — OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments https://arxiv.org/abs/2404.07972 https://arxiv.org/abs/2404.07972 https://arxiv.org/abs/2404.07972 · benchmark — 369 real-computer tasks in VMs with per-task execution-based eval scripts and initial-state setup; humans 72% vs best agent 12%. Canonical computer-use benchmark named in the brief — now verified. - — Laude Institute + Stanford + community — Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command-Line Interfaces https://www.tbench.ai/ https://www.tbench.ai/ https://www.tbench.ai/ · benchmark — Sandboxed terminal tasks with deterministic verifiers across SWE/sysadmin/security; v2 leaderboard. The terminal-agent benchmark named in the brief — verified arxiv: arxiv.org/abs/2601.11868 . 🆕 - — Zhang et al. Stanford — Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models https://arxiv.org/abs/2408.08926 https://arxiv.org/abs/2408.08926 https://arxiv.org/abs/2408.08926 · benchmark — 40 professional CTF challenges with subtask annotations and deterministic flag-based grading; pairs naturally with Eugene Yan's cybersecurity-evals post already in the section. Named in the brief — now verified. - — Lù et al. McGill / Mila / Google DeepMind — AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories https://arxiv.org/abs/2504.08942 https://arxiv.org/abs/2504.08942 https://arxiv.org/abs/2504.08942 · paper — First benchmark of LLM-judges-of-trajectories: 1302 expert-reviewed web-agent runs; shows rule-based graders reject many valid trajectories under-reporting success . Core to the 'trajectory evaluation' theme the section currently lacks. 🆕 - — Cemri, Pan et al. UC Berkeley Sky Lab — Why Do Multi-Agent LLM Systems Fail? MAST taxonomy https://arxiv.org/abs/2503.13657 https://arxiv.org/abs/2503.13657 https://arxiv.org/abs/2503.13657 · paper — 14-mode failure taxonomy across 7 MAS frameworks from 200+ annotated traces; the reference framework for diagnosing multi-agent failures — directly fills the 'multi-agent' gap. 🆕 - — Trivedi et al. Stony Brook — ACL'24 Best Resource Paper — AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents https://aclanthology.org/2024.acl-long.850/ https://aclanthology.org/2024.acl-long.850/ https://aclanthology.org/2024.acl-long.850/ · benchmark — 9-app simulated world 457 APIs with state-based unit tests that also check for collateral damage/unexpected state changes — gold-standard world-state grading for tool-use agents. - — OpenAI Wei et al. — BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents https://openai.com/index/browsecomp/ https://openai.com/index/browsecomp/ https://openai.com/index/browsecomp/ · benchmark — 1,266 'inverted' hard-to-find/easy-to-verify questions for deep-research browsing agents; short verifiable answers make grading deterministic. Released 2025, now standard for browsing-agent eval. paper: arxiv.org/abs/2504.12516 🆕 - — Chen, Tang et al. Yale / All Hands — LocAgent: Graph-Guided LLM Agents for Code Localization https://arxiv.org/abs/2503.09089 https://arxiv.org/abs/2503.09089 https://arxiv.org/abs/2503.09089 · paper — Defines and evaluates code localization as its own capability Acc@k over file/function locations via code graphs — directly fills the 'localization' theme named in the section title but currently unlisted. 🆕 - — He et al. Tencent AI Lab — WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models https://arxiv.org/abs/2401.13919 https://arxiv.org/abs/2401.13919 https://arxiv.org/abs/2401.13919 · benchmark — 643 tasks on 15 live real-world sites with a GPT-4V automatic-judge eval protocol — an early, widely-cited example of multimodal-LLM-as-judge for live-web agent trajectories. - — BenchFlow — SkillsBench https://github.com/benchflow-ai/skillsbench https://github.com/benchflow-ai/skillsbench https://github.com/benchflow-ai/skillsbench · benchmark — 🆕 evaluates how well agent skills work and how effectively agents use them — makes skill-acquisition/skill-use a measurable axis the "Agent Skills" frontier . ~1.4k★. - — BenchFlow — ClawsBench https://github.com/benchflow-ai/ClawsBench https://github.com/benchflow-ai/ClawsBench https://github.com/benchflow-ai/ClawsBench · benchmark — 🆕 BenchFlow's agent benchmark results/data repo; full release in progress . - — OpenAI with SWE-bench authors — SWE-bench Verified https://openai.com/index/introducing-swe-bench-verified/ https://openai.com/index/introducing-swe-bench-verified/ https://openai.com/index/introducing-swe-bench-verified/ · benchmark — 500 human-validated SWE-bench instances graded by hidden FAIL TO PASS unit tests; the de facto standard for real-issue resolution and the headline coding-agent number labs report 🆕 - — Yang, Jimenez, Press et al. Princeton/Stanford — SWE-bench Multimodal https://arxiv.org/abs/2410.03859 https://arxiv.org/abs/2410.03859 https://arxiv.org/abs/2410.03859 · benchmark — 619 visual JS/front-end issues from 17 user-facing repos, test-verified; probes whether SWE agents generalize beyond Python/text to visual software domains - — Scale AI Deng, Da et al. — SWE-bench Pro https://arxiv.org/abs/2509.16941 https://arxiv.org/abs/2509.16941 https://arxiv.org/abs/2509.16941 · benchmark — 1,865 long-horizon, multi-file tasks across public GPL + held-out + commercial startup repos, test-graded; contamination-resistant and hard frontier <45% pass@1 🆕 - — OpenAI Miserendino, Patwardhan, Heidecke et al. — SWE-Lancer https://arxiv.org/abs/2502.12115 https://arxiv.org/abs/2502.12115 https://arxiv.org/abs/2502.12115 · benchmark — 1,400+ real Upwork freelance tasks worth $1M, graded by triple-verified end-to-end Playwright tests plus manager-decision tasks; ties capability to economic value 🆕 - — Pan, Wang, Neubig, Suhr, Zhang et al. Berkeley/CMU — SWE-Gym https://arxiv.org/abs/2412.21139 https://arxiv.org/abs/2412.21139 https://arxiv.org/abs/2412.21139 · benchmark — 2,438 executable Python SWE tasks with pre-installed deps + test verification; the first real training/eval gym for SWE agents and verifiers, ICML 2025 🆕 - — ByteDance Seed — Multi-SWE-bench https://arxiv.org/abs/2504.02605 https://arxiv.org/abs/2504.02605 https://arxiv.org/abs/2504.02605 · benchmark — 1,632 expert-annotated issue-resolution tasks across Java, TS, JS, Go, Rust, C, C++, test-graded; the leading multilingual SWE-bench extension, NeurIPS 2025 D&B 🆕 - — Nebius / Badertdinov et al. — SWE-rebench https://arxiv.org/abs/2505.20411 https://arxiv.org/abs/2505.20411 https://arxiv.org/abs/2505.20411 · benchmark — Automated pipeline yielding 21k+ executable Python tasks with continuously refreshed, decontaminated eval splits; quantifies how much SWE-bench Verified scores are inflated by contamination, NeurIPS 2025 D&B 🆕 - — METR — RE-Bench https://arxiv.org/abs/2411.15114 https://arxiv.org/abs/2411.15114 https://arxiv.org/abs/2411.15114 · benchmark — 7 open-ended ML research-engineering environments e.g. GPU-kernel optimization, scaling laws scored against 71 human-expert 8-hour attempts; the reference AI-R&D-uplift eval, ICML 2025 - — OpenAI Chan et al. — MLE-bench https://arxiv.org/abs/2410.07095 https://arxiv.org/abs/2410.07095 https://arxiv.org/abs/2410.07095 · https://github.com/openai/mle-bench https://github.com/openai/mle-bench · benchmark — 75 Kaggle ML-engineering competitions graded against real human leaderboards medal thresholds in 24h Docker runs; standard ML-engineering-agent eval, ICLR 2025. 🆕 - — OpenAI Starace et al. — PaperBench https://arxiv.org/abs/2504.01848 https://arxiv.org/abs/2504.01848 https://arxiv.org/abs/2504.01848 · benchmark — Replicate 20 ICML 2024 papers from scratch, graded by 8,316 author-co-developed rubric leaves via a validated LLM judge; rigorous research-replication agent eval, ICML 2025 🆕 - — Andy Konwinski / Kaggle — Konwinski Prize K Prize https://www.kaggle.com/competitions/konwinski-prize https://www.kaggle.com/competitions/konwinski-prize https://www.kaggle.com/competitions/konwinski-prize · leaderboard — $1M Kaggle forecasting-format contest on GitHub bugs filed after submission close, fully contamination-free, test-graded; round-1 top score only 7.5% exposed real-world difficulty 🆕 - — Gou et al., OSU NLP Group NeurIPS 2025 D&B — Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge https://arxiv.org/abs/2506.21506 https://arxiv.org/abs/2506.21506 https://arxiv.org/abs/2506.21506 · benchmark — 130 long-horizon live-web agentic-search tasks; novel Agent-as-a-Judge rubric-tree grader for time-varying, citation-backed answers — a serious answer to the Deep Research evaluation gap. 🆕 - — Xue et al., OSU NLP Group — Online-Mind2Web An Illusion of Progress? Assessing the Current State of Web Agents https://arxiv.org/abs/2504.01382 https://arxiv.org/abs/2504.01382 https://arxiv.org/abs/2504.01382 · benchmark — 300 realistic tasks on 136 live websites with an LLM-as-a-Judge auto-grader ~85% human agreement ; exposes overstated web-agent progress vs simple baselines. 🆕 - — AGI Inc agi-inc/REAL , powers realevals.xyz — REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites https://github.com/agi-inc/REAL https://github.com/agi-inc/REAL https://github.com/agi-inc/REAL · benchmark — 112 tasks on deterministic Next.js replicas of Amazon/Uber/LinkedIn etc.; reproducible LLM evaluator plus state validators — fixes the flakiness of live-site web benchmarks. 🆕 - — Thomas et al., Convergence AI — WebGames: Challenging General-Purpose Web-Browsing AI Agents https://arxiv.org/abs/2502.18356 https://arxiv.org/abs/2502.18356 https://arxiv.org/abs/2502.18356 · benchmark — 50+ client-side challenges isolating specific browser interaction skills with verifiable pass/fail; best agent 41% vs 96% human, a sharp diagnostic gap. 🆕 - — Patil et al., UC Berkeley Gorilla / ICML 2025 — Berkeley Function Calling Leaderboard BFCL V4 https://gorilla.cs.berkeley.edu/leaderboard.html https://gorilla.cs.berkeley.edu/leaderboard.html https://gorilla.cs.berkeley.edu/leaderboard.html · leaderboard — Executable + AST-based grading of tool/function calling; V4 adds multi-turn agentic, web-search and memory tasks — the de facto tool-calling leaderboard. 🆕 - — Wang et al., Shanghai AI Laboratory NeurIPS 2024 D&B — GTA: A Benchmark for General Tool Agents https://arxiv.org/abs/2407.08713 https://arxiv.org/abs/2407.08713 https://arxiv.org/abs/2407.08713 · benchmark — 229 human-written real-world queries with implicit multimodal tool use; executable evaluation platform across perception/operation/logic/creativity tools GTA-2 follow-up in 2026 . 🆕 - — Lei et al., XLang Lab / HKU ICLR 2025 Oral — Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows https://arxiv.org/abs/2411.07763 https://arxiv.org/abs/2411.07763 https://arxiv.org/abs/2411.07763 · benchmark — Enterprise text-to-SQL agent workflows over huge schemas and multiple dialects with execution-based grading; frontier models only ~17-21% — a hard, realistic data-agent eval. 🆕 - — Rawles et al., Google DeepMind / Google Research ICLR 2025 — AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents https://arxiv.org/abs/2405.14573 https://arxiv.org/abs/2405.14573 https://arxiv.org/abs/2405.14573 · benchmark — Live Android environment with durable reward signals from device system state for 116 parameterized tasks across 20 apps — the standard mobile-GUI agent benchmark. 🆕 - — Bonatti et al., Microsoft — WindowsAgentArena: Evaluating Multi-Modal OS Agents at Scale https://arxiv.org/abs/2409.08264 https://arxiv.org/abs/2409.08264 https://arxiv.org/abs/2409.08264 · benchmark — 154 realistic multi-step Windows-OS tasks across apps with programmatic success checks; parallelizable in Azure ~20 min full run — desktop computer-use counterpart to OSWorld. 🆕 - — Levy, Shlomov, Wiesel et al., IBM Research — ST-WebAgentBench: Evaluating Safety and Trustworthiness in Web Agents https://arxiv.org/abs/2410.06703 https://arxiv.org/abs/2410.06703 https://arxiv.org/abs/2410.06703 · benchmark — 375 enterprise tasks carrying 3,057 explicit safety/policy constraints; introduces Completion-under-Policy and Risk Ratio — grades whether agents obey rules, not just succeed. 🆕 - — Xu et al., CMU — TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks https://arxiv.org/abs/2412.14161 https://arxiv.org/abs/2412.14161 https://arxiv.org/abs/2412.14161 · benchmark — Self-hosted software-company sim web, code, chat coworkers with checkpoint-based partial-credit grading; best agent ~30% — a full-day-knowledge-worker eval. 🆕 - — Koh et al., Carnegie Mellon University — VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks https://arxiv.org/abs/2401.13649 https://arxiv.org/abs/2401.13649 https://arxiv.org/abs/2401.13649 · benchmark — 910 visually-grounded web tasks across Classifieds/Shopping/Reddit with reproducible programmatic reward functions — the multimodal extension of WebArena. - — Tejal Patwardhan et al. OpenAI — GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks https://arxiv.org/abs/2510.04374 https://arxiv.org/abs/2510.04374 https://arxiv.org/abs/2510.04374 · benchmark — 1,320 expert-built tasks across 44 occupations in the top 9 GDP sectors; 220-task gold subset open-sourced with a public automated grading service at evals.openai.com — the flagship economic-value agent benchmark. 🆕 - — CAIS + Scale AI 47 authors — Remote Labor Index: Measuring AI Automation of Remote Work https://arxiv.org/abs/2510.26787 https://arxiv.org/abs/2510.26787 https://arxiv.org/abs/2510.26787 · benchmark — Grades whether agents complete whole real freelance projects to client-acceptable standard; best agent automates only 2.5% — a hard, money-grounded ceiling for end-to-end remote work. 🆕 - — Center for AI Safety + Scale AI Dan Hendrycks et al. — Humanity's Last Exam https://arxiv.org/abs/2501.14249 https://arxiv.org/abs/2501.14249 https://arxiv.org/abs/2501.14249 · benchmark — 2,500 expert-written frontier-knowledge questions with unambiguous auto-gradable answers across dozens of fields; the canonical post-MMLU saturation exam note: now very widely cited . 🆕 - — OSU-NLP Group Ohio State — ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery https://github.com/OSU-NLP-Group/ScienceAgentBench https://github.com/OSU-NLP-Group/ScienceAgentBench https://github.com/OSU-NLP-Group/ScienceAgentBench · benchmark — 102 expert-validated tasks from 44 peer-reviewed papers; grades self-contained Python programs by execution + success rate; best agent solves only ~34% ICLR 2025 . 🆕 - — Siegel, Kapoor, Narayanan et al. Princeton — CORE-Bench: Computational Reproducibility Agent Benchmark https://arxiv.org/abs/2409.11363 https://arxiv.org/abs/2409.11363 https://arxiv.org/abs/2409.11363 · benchmark — 270 tasks over 90 papers CS/social science/medicine that grade whether an agent can reproduce published results from code+data; from the Princeton AI-Snake-Oil group. - — Mingxuan Du et al. — DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents https://arxiv.org/abs/2506.11763 https://arxiv.org/abs/2506.11763 https://arxiv.org/abs/2506.11763 · benchmark — 100 PhD-level tasks across 22 fields; reference-based adaptive-rubric grader for analyst-grade citation-rich reports, validated for human-judgment alignment — the standard deep-research-report eval. 🆕 - — FutureHouse + ScienceMachine — BixBench: A Comprehensive Benchmark for LLM-based Agents in Computational Biology https://arxiv.org/abs/2503.00096 https://arxiv.org/abs/2503.00096 https://arxiv.org/abs/2503.00096 · benchmark — 50+ real bioinformatics analysis scenarios with ~300 open-answer questions over multi-step Jupyter trajectories; frontier models hit only ~17% — serious wet-lab-adjacent science agent eval. 🆕 - — Meta Meta Agents Research Environments — Gaia2 and ARE: Scaling Up Agent Environments and Evaluations https://arxiv.org/abs/2509.17158 https://arxiv.org/abs/2509.17158 https://arxiv.org/abs/2509.17158 · benchmark — Successor to GAIA: dynamic, time-driven, multi-agent simulated environments with async world events and a verifiable scenario grader; frontier success ~42% — the serious general-assistant env from Meta. 🆕 - — Andon Labs Backlund & Petersson — Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents https://arxiv.org/abs/2502.15840 https://arxiv.org/abs/2502.15840 https://arxiv.org/abs/2502.15840 · benchmark — Run a simulated vending business over 20M-token horizons; objectively graded on profit/net-worth, exposing long-horizon coherence breakdowns unrelated to context limits. 🆕 - — Francois Chollet et al. ARC Prize Foundation — ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems https://arxiv.org/abs/2505.11831 https://arxiv.org/abs/2505.11831 https://arxiv.org/abs/2505.11831 · benchmark — Human-calibrated 400+ participants, 100% solvable grid-reasoning tasks with exact-match grading; 2-3x harder than ARC-AGI-1 across all approaches — the frontier fluid-intelligence benchmark. 🆕 - — Patronus AI — TRAIL: Trace Reasoning and Agentic Issue Localization https://arxiv.org/abs/2505.08638 https://arxiv.org/abs/2505.08638 https://arxiv.org/abs/2505.08638 · benchmark — 148 annotated agent traces with 841 errors reasoning/planning/execution ; grades whether an LLM can localize the failure in a trace best model ~11% . HF dataset PatronusAI/TRAIL. 🆕 - — Salesforce Research — CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios https://arxiv.org/abs/2505.18878 https://arxiv.org/abs/2505.18878 https://arxiv.org/abs/2505.18878 · benchmark — 19 expert-validated B2B/B2C tasks on a realistic Salesforce org with state-based grading; exposes the single-turn ~58% vs multi-turn ~35% reliability gap plus confidentiality checks. 🆕 Must-reads: Anthropic demystifying · τ-bench · Lee pass@k 10 · Safety / adversarial evaluation prompt injection, jailbreaks, action-authorization, benchmark auditing - — Wang, Li, Mang, Cheung, Sen, Song incl. Dawn Song — BenchJack: Systematically Auditing AI Agent Benchmarks https://arxiv.org/abs/2605.12673 https://arxiv.org/abs/2605.12673 https://arxiv.org/abs/2605.12673 · paper — Reward hacking emerges spontaneously in frontier models; an 8-pattern flaw taxonomy + a 30-question Agent-Eval checklist; "benchmarks must be secure by design." - — Dawn Song UC Berkeley RDI, lecture slides — Towards Building Safe & Secure Agentic AI https://rdi.berkeley.edu/adv-llm-agents/slides/dawn-agentic-ai.pdf https://rdi.berkeley.edu/adv-llm-agents/slides/dawn-agentic-ai.pdf https://rdi.berkeley.edu/adv-llm-agents/slides/dawn-agentic-ai.pdf · talk — The adversarial setting; environment-borne attacks. - — Dawn Song — ICLR 2025 keynote on LLM safety https://iclr.cc/virtual/2025/invited-talk/36783 https://iclr.cc/virtual/2025/invited-talk/36783 https://iclr.cc/virtual/2025/invited-talk/36783 · talk . - — Wang et al. incl. Dawn Song — CyberGym https://arxiv.org/html/2506.02548v2 https://arxiv.org/html/2506.02548v2 https://arxiv.org/html/2506.02548v2 · paper — Memory-safety PoC generation from OSS-Fuzz; sanitizer-crash grading at scale. - — Zeng et al. incl. Song — AIR-Bench 2024 https://arxiv.org/abs/2407.17436v2 https://arxiv.org/abs/2407.17436v2 https://arxiv.org/abs/2407.17436v2 · https://github.com/stanford-crfm/air-bench-2024 https://github.com/stanford-crfm/air-bench-2024 · paper/repo — Regulation-grounded risk taxonomy. - — DecodingTrust https://decodingtrust.github.io https://decodingtrust.github.io https://decodingtrust.github.io · benchmark — NeurIPS 2023 trustworthiness benchmark. - — RedCode https://arxiv.org/abs/2411.07781 https://arxiv.org/abs/2411.07781 https://arxiv.org/abs/2411.07781 · paper — Risky code execution/generation benchmark for code agents. - — AgentPoison https://arxiv.org/abs/2407.12784 https://arxiv.org/abs/2407.12784 https://arxiv.org/abs/2407.12784 · paper — Red-teams agents by poisoning their RAG memory. - — Miller Anthropic — Adding Error Bars to Evals A Statistical Approach to LM Evaluations https://arxiv.org/abs/2411.00640 https://arxiv.org/abs/2411.00640 https://arxiv.org/abs/2411.00640 · https://www.anthropic.com/research/statistical-approach-to-model-evals https://www.anthropic.com/research/statistical-approach-to-model-evals · paper — Standard errors, clustered SEs, paired difference tests — "is this difference real?" cross-cutting: T6/T8 - — Debenedetti, Zhang, Balunović, Beurer-Kellner, Fischer, Tramèr ETH Zurich — AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents https://arxiv.org/abs/2406.13352 https://arxiv.org/abs/2406.13352 https://arxiv.org/abs/2406.13352 · benchmark — The canonical prompt-injection benchmark for tool-using agents 97 tasks, 629 security cases over untrusted data ; NeurIPS 2024 D&B, now the standard eval everyone reports against. A glaring omission. 🆕 - — Andriushchenko, Souly, Davies et al. Gray Swan / UK AISI — AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents https://arxiv.org/abs/2410.09024 https://arxiv.org/abs/2410.09024 https://arxiv.org/abs/2410.09024 · benchmark — ICLR 2025 benchmark of 110/440 malicious agent tasks across 11 harm categories; shows leading models comply with malicious agent requests without jailbreaking. The reference action-misuse/refusal benchmark. 🆕 - — Zhan, Liang et al. UIUC — InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents https://arxiv.org/abs/2403.02691 https://arxiv.org/abs/2403.02691 https://arxiv.org/abs/2403.02691 · benchmark — ACL 2024 Findings; 1,054 IPI test cases over 17 user / 62 attacker tools, splitting direct-harm vs data-exfiltration intents. Foundational indirect-prompt-injection benchmark predating AgentDojo. - — Debenedetti, Shumailov, Fan, Hayes et al. Google DeepMind — Defeating Prompt Injections by Design CaMeL https://arxiv.org/abs/2503.18813 https://arxiv.org/abs/2503.18813 https://arxiv.org/abs/2503.18813 · paper — The defense-by-design counterpart: extracts control/data flow from the trusted query and enforces capability-based policies so untrusted data can't alter program flow; effectively solves AgentDojo's security eval. The key 2025 mitigation paper. 🆕 - — Simon Willison — The lethal trifecta for AI agents: private data, untrusted content, and external communication https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/ https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/ https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/ · blog — The most-cited conceptual frame for reasoning about when an agent is unconditionally vulnerable to prompt injection; essential practitioner mental model at the Eugene-Yan bar. 🆕 - — Kutasov, Bowman et al. Anthropic — SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents https://www.anthropic.com/research/shade-arena-sabotage-monitoring https://www.anthropic.com/research/shade-arena-sabotage-monitoring https://www.anthropic.com/research/shade-arena-sabotage-monitoring · benchmark — 17 complex environments pairing a benign main task with a hidden harmful side task to measure whether agents can sabotage without tripping an AI monitor; the canonical sabotage/monitorability eval. Paper: arxiv.org/abs/2506.15740 🆕 - — Anthropic Alignment team — Agentic Misalignment: How LLMs Could Be Insider Threats https://www.anthropic.com/research/agentic-misalignment https://www.anthropic.com/research/agentic-misalignment https://www.anthropic.com/research/agentic-misalignment · paper — Red-team study showing frontier models will resort to blackmail/leaking under goal conflict in agentic settings; the reference for action-authorization / insider-threat adversarial evaluation. Companion to the cited Anthropic error-bars piece. 🆕 - — Microsoft AI Red Team Azure — PyRIT — Python Risk Identification Tool for generative AI https://github.com/Azure/PyRIT https://github.com/Azure/PyRIT https://github.com/Azure/PyRIT · tool — The de-facto open-source red-teaming automation framework 70+ converters, multi-turn attacks like Crescendo/TAP ; how practitioners actually run adversarial evals at scale. The section lists papers but no tooling. 🆕 - — OWASP GenAI Security Project — OWASP Top 10 for Agentic Applications 2026 + LLM Applications 2025 https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/ https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/ https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/ · docs — Industry-standard risk taxonomy: goal hijack, tool misuse, identity/privilege abuse, memory poisoning, rogue agents; complements the regulation-grounded AIR-Bench taxonomy already listed. The canonical practitioner threat checklist. 🆕 - — MITRE — MITRE ATLAS — Adversarial Threat Landscape for AI Systems https://atlas.mitre.org/ https://atlas.mitre.org/ https://atlas.mitre.org/ · docs — ATT&CK-style living knowledge base of 16 tactics / 80+ techniques against AI systems with real-world case studies and mitigations; the standard reference framework for AI adversarial threat modeling. - — Zhang, Yang et al. — Agent Security Bench ASB : Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents https://proceedings.iclr.cc/paper files/paper/2025/file/5750f91d8fb9d5c02bd8ad2c3b44456b-Paper-Conference.pdf https://proceedings.iclr.cc/paper files/paper/2025/file/5750f91d8fb9d5c02bd8ad2c3b44456b-Paper-Conference.pdf https://proceedings.iclr.cc/paper files/paper/2025/file/5750f91d8fb9d5c02bd8ad2c3b44456b-Paper-Conference.pdf · benchmark — ICLR 2025 unified benchmark spanning 10 scenarios, 400+ tools, covering DPI/IPI, memory poisoning, plan-of-thought backdoors and defenses in one harness; broadest single attack/defense agent benchmark. 🆕 - — Gray Swan AI / UK AISI w/ OpenAI, Anthropic, GDM — Gray Swan x UK AISI Agent Red-Teaming Challenge https://app.grayswan.ai/arena/blog/agent-red-teaming-the-ai-jailbreak-showdown https://app.grayswan.ai/arena/blog/agent-red-teaming-the-ai-jailbreak-showdown https://app.grayswan.ai/arena/blog/agent-red-teaming-the-ai-jailbreak-showdown · talk — Largest public agent red-teaming exercise: ~2,000 red-teamers, 1.8M attempts, 62k breaches against 22 tool-using agents financial/shopping/marketing bots ; real-world adversarial-eval data at scale. 🆕 Must-reads: Dawn Song BenchJack · Anthropic error bars — Hamel Husain & Emil Sedgh — How to Construct Domain Specific LLM Evaluation Systems https://www.youtube.com/watch?v=eLXF0VojuSs https://www.youtube.com/watch?v=eLXF0VojuSs https://www.youtube.com/watch?v=eLXF0VojuSs · talk AI Engineer World's Fair 2024 — Jeff Huber & Jason Liu — How to look at your data https://www.youtube.com/watch?v=jryZvCuA0Uc https://www.youtube.com/watch?v=jryZvCuA0Uc https://www.youtube.com/watch?v=jryZvCuA0Uc · talk AI Engineer World's Fair 2025 — Bryan Bischof — Failure is a Funnel https://www.youtube.com/watch?v=k98gDjYbSaU https://www.youtube.com/watch?v=k98gDjYbSaU https://www.youtube.com/watch?v=k98gDjYbSaU · talk Data Council 2025 — Eugene Yan — Using LLMs as Judges: Insights, Challenges, Best Practices https://www.youtube.com/watch?v=7EGF0Mc0 os https://www.youtube.com/watch?v=7EGF0Mc0 os https://www.youtube.com/watch?v=7EGF0Mc0 os · talk Jason Liu series 2024 — Shreya Shankar — Scaling Up Vibe Checks for LLMs https://www.youtube.com/watch?v=eGVDKegRdgM https://www.youtube.com/watch?v=eGVDKegRdgM https://www.youtube.com/watch?v=eGVDKegRdgM · talk Stanford MLSys 97 — Shreya Shankar — Why LLM Data Processing Pipelines Fail https://www.youtube.com/watch?v=H-1QaLPnGsg https://www.youtube.com/watch?v=H-1QaLPnGsg https://www.youtube.com/watch?v=H-1QaLPnGsg · talk LangChain Interrupt 2025 — Ido Pesok Vercel v0 — Evals Are Not Unit Tests https://www.youtube.com/watch?v=L8OoYeDI ls https://www.youtube.com/watch?v=L8OoYeDI ls https://www.youtube.com/watch?v=L8OoYeDI ls · talk AI Engineer 2025 — David Karam Pi Labs — Building Metrics that actually work workshop https://www.youtube.com/watch?v=jxrGodnopHo https://www.youtube.com/watch?v=jxrGodnopHo https://www.youtube.com/watch?v=jxrGodnopHo · talk AI Engineer 2025 — Brooke Hopkins Coval — From Self-driving to Autonomous Voice Agents https://www.youtube.com/watch?v=kDczF4wBh8s https://www.youtube.com/watch?v=kDczF4wBh8s https://www.youtube.com/watch?v=kDczF4wBh8s · talk AI Engineer 2025 — Leonard Tang Haize Labs — Fuzzing in the GenAI Era https://www.youtube.com/watch?v=OMGPvW8TBHc https://www.youtube.com/watch?v=OMGPvW8TBHc https://www.youtube.com/watch?v=OMGPvW8TBHc · talk AI Engineer 2025 — Omar Khattab DSPy — On Engineering AI Systems that Endure the Bitter Lesson https://www.youtube.com/watch?v=qdmxApz3EJI https://www.youtube.com/watch?v=qdmxApz3EJI https://www.youtube.com/watch?v=qdmxApz3EJI · talk AI Engineer 2025 — Taylor Jordan Smith — Strategies for LLM Evals harnesses workshop https://www.youtube.com/watch?v=89NuzmKokIk https://www.youtube.com/watch?v=89NuzmKokIk https://www.youtube.com/watch?v=89NuzmKokIk · talk AI Engineer 2025 — Ankur Goyal Braintrust — The Future of Evals https://www.youtube.com/watch?v=MC55hdWLq4o https://www.youtube.com/watch?v=MC55hdWLq4o https://www.youtube.com/watch?v=MC55hdWLq4o · talk AI Engineer 2025 — Jason Wei OpenAI — 3 Key Ideas in AI in 2025 Verifier's Law https://www.youtube.com/watch?v=b6Doq2fz81U https://www.youtube.com/watch?v=b6Doq2fz81U https://www.youtube.com/watch?v=b6Doq2fz81U · talk Stanford AI Club 2025 — Jason Wei — Some Intuitions About Large Language Models https://www.youtube.com/watch?v=l898fqkjdFc https://www.youtube.com/watch?v=l898fqkjdFc https://www.youtube.com/watch?v=l898fqkjdFc · talk The AI Conference 2025 — Andrej Karpathy — Deep Dive into LLMs like ChatGPT https://www.youtube.com/watch?v=7xTGNNLPyMI https://www.youtube.com/watch?v=7xTGNNLPyMI https://www.youtube.com/watch?v=7xTGNNLPyMI · talk 2025 — John Schulman — RLHF: Progress and Challenges https://www.youtube.com/watch?v=hhiLw5Q UFg https://www.youtube.com/watch?v=hhiLw5Q UFg https://www.youtube.com/watch?v=hhiLw5Q UFg · talk UC Berkeley EECS 2023 — Nathan Lambert Ai2 — Aligning Open Language Models https://www.youtube.com/watch?v=AdLgPmcrXwQ https://www.youtube.com/watch?v=AdLgPmcrXwQ https://www.youtube.com/watch?v=AdLgPmcrXwQ · talk Stanford CS25 V4 — Chip Huyen — Building LLM Applications for Production https://www.youtube.com/watch?v=spamOhG7BOA https://www.youtube.com/watch?v=spamOhG7BOA https://www.youtube.com/watch?v=spamOhG7BOA · talk MLOps LLMs in Prod 2023 — Han-Chung Lee — The Model is the Product https://www.youtube.com/watch?v=4dUFIRj-BWo https://www.youtube.com/watch?v=4dUFIRj-BWo https://www.youtube.com/watch?v=4dUFIRj-BWo · talk Data Council 2025 — Will Brown Prime Intellect — RL Environments at Scale https://www.youtube.com/watch?v= IzZWeuTx7I https://www.youtube.com/watch?v= IzZWeuTx7I https://www.youtube.com/watch?v= IzZWeuTx7I · talk AI Engineer 2025 — Florian Brand Prime Intellect — LLM benchmarks in the time of agents https://www.youtube.com/watch?v=kmTMc-fVSXw https://www.youtube.com/watch?v=kmTMc-fVSXw https://www.youtube.com/watch?v=kmTMc-fVSXw · talk Big Techday 26 2026 — Hamel Husain How I AI / Claire Vo — Evals, error analysis, and better prompts https://www.youtube.com/watch?v=PgzOBNse2EA https://www.youtube.com/watch?v=PgzOBNse2EA https://www.youtube.com/watch?v=PgzOBNse2EA · podcast How I AI — Ankur Goyal How I AI — Evals are the new PRD for AI products https://www.youtube.com/watch?v=QE 1hRLsehM https://www.youtube.com/watch?v=QE 1hRLsehM https://www.youtube.com/watch?v=QE 1hRLsehM · podcast How I AI — Hamel Husain Vanishing Gradients — Ep 60: 10 Things I Hate About AI Evals https://www.youtube.com/watch?v=QEk-XwrkqhI https://www.youtube.com/watch?v=QEk-XwrkqhI https://www.youtube.com/watch?v=QEk-XwrkqhI · podcast Vanishing Gradients — Hamel Husain Vanishing Gradients — Ep 50: A Field Guide to Rapidly Improving AI Products https://www.youtube.com/watch?v=rWToRi2 SeY https://www.youtube.com/watch?v=rWToRi2 SeY https://www.youtube.com/watch?v=rWToRi2 SeY · podcast Vanishing Gradients — Ankur Goyal Latent Space — Five Hard-Earned Lessons About Evals https://www.youtube.com/watch?v=a4BV0gGmXgA https://www.youtube.com/watch?v=a4BV0gGmXgA https://www.youtube.com/watch?v=a4BV0gGmXgA · podcast Latent Space — Cameron & Hill-Smith Latent Space — Artificial Analysis: Independent LLM Evals https://www.youtube.com/watch?v=v5mBjeX4TJ8 https://www.youtube.com/watch?v=v5mBjeX4TJ8 https://www.youtube.com/watch?v=v5mBjeX4TJ8 · podcast Latent Space — Petersson & Backlund Andon Labs — Reality: The Final Eval Vending-Bench https://www.youtube.com/watch?v=ZAimcoJXUBo https://www.youtube.com/watch?v=ZAimcoJXUBo https://www.youtube.com/watch?v=ZAimcoJXUBo · podcast Latent Space / Cognitive Revolution — Vaibhav Gupta & Dex AI That Works — 5 Designing Evals https://www.youtube.com/watch?v=-N6MajRfqYw https://www.youtube.com/watch?v=-N6MajRfqYw https://www.youtube.com/watch?v=-N6MajRfqYw · podcast AI That Works — AI That Works — 16 Evaluating Prompts Across Models https://www.youtube.com/watch?v=OawyQOrlubM https://www.youtube.com/watch?v=OawyQOrlubM https://www.youtube.com/watch?v=OawyQOrlubM · podcast AI That Works — AI That Works — 24 Evals for Classification https://www.youtube.com/watch?v=5Fy0hBzyduU https://www.youtube.com/watch?v=5Fy0hBzyduU https://www.youtube.com/watch?v=5Fy0hBzyduU · podcast AI That Works — AI That Works — 34 Multimodal Evals https://www.youtube.com/watch?v=jzhVo0iAX I https://www.youtube.com/watch?v=jzhVo0iAX I https://www.youtube.com/watch?v=jzhVo0iAX I · podcast AI That Works — Maggie Konstanty MLOps Community — 372 It's 2026 and We're Still Talking Evals https://www.youtube.com/watch?v=9EjWR3QpJYk https://www.youtube.com/watch?v=9EjWR3QpJYk https://www.youtube.com/watch?v=9EjWR3QpJYk · podcast MLOps Community — Kelly Hong TWIML — 728 Generative Benchmarking https://www.youtube.com/watch?v=3kbiGPn0cOo https://www.youtube.com/watch?v=3kbiGPn0cOo https://www.youtube.com/watch?v=3kbiGPn0cOo · podcast TWIML AI — Percy Liang Gradient Dissent — Shaping AI Benchmarks HELM https://www.youtube.com/watch?v=kwkdKirqi6s https://www.youtube.com/watch?v=kwkdKirqi6s https://www.youtube.com/watch?v=kwkdKirqi6s · podcast Gradient Dissent — Joseph Gonzalez Gradient Dissent — Evaluating LLMs with Chatbot Arena https://www.youtube.com/watch?v=okHMaczHPXc https://www.youtube.com/watch?v=okHMaczHPXc https://www.youtube.com/watch?v=okHMaczHPXc · podcast Gradient Dissent — Aman Khan Learning from ML — Evaluating AI, Designing for Non-Determinism https://www.youtube.com/watch?v=v0eTTn7ZPEc https://www.youtube.com/watch?v=v0eTTn7ZPEc https://www.youtube.com/watch?v=v0eTTn7ZPEc · podcast Learning from Machine Learning — Andrej Karpathy Dwarkesh — Karpathy: RL is terrible, why benchmarks mislead https://www.youtube.com/watch?v=-lRBpyPt79c https://www.youtube.com/watch?v=-lRBpyPt79c https://www.youtube.com/watch?v=-lRBpyPt79c · podcast Dwarkesh Podcast — Hamel Husain & Shreya Shankar — How to Build AI Evals in 2026 Step-by-Step https://www.youtube.com/watch?v=J7N9FMouSKg https://www.youtube.com/watch?v=J7N9FMouSKg https://www.youtube.com/watch?v=J7N9FMouSKg · podcast Aakash Gupta 2026 — Dawn Song — Towards Building Safe & Trustworthy AI Agents https://www.youtube.com/watch?v=QAgR4uQ15rc https://www.youtube.com/watch?v=QAgR4uQ15rc https://www.youtube.com/watch?v=QAgR4uQ15rc · lecture Berkeley LLM Agents MOOC F24 — Dawn Song — Towards Building Safe and Secure Agentic AI https://www.youtube.com/watch?v=ti6yPE2VPZc https://www.youtube.com/watch?v=ti6yPE2VPZc https://www.youtube.com/watch?v=ti6yPE2VPZc · lecture Berkeley Advanced LLM Agents Sp25 — Ben Mann Anthropic — Measuring Agent Capabilities and Anthropic's RSP https://www.youtube.com/watch?v=6y2AnWol7oo https://www.youtube.com/watch?v=6y2AnWol7oo https://www.youtube.com/watch?v=6y2AnWol7oo · lecture Berkeley LLM Agents MOOC F24 — Percy Liang — Open-Source and Science in the Era of Foundation Models https://www.youtube.com/watch?v=f3KKx9LWntQ https://www.youtube.com/watch?v=f3KKx9LWntQ https://www.youtube.com/watch?v=f3KKx9LWntQ · lecture Berkeley LLM Agents MOOC F24 — Hashimoto & Liang — CS336 Lecture 12: Evaluation https://www.youtube.com/watch?v=x-R5l2HsXqM https://www.youtube.com/watch?v=x-R5l2HsXqM https://www.youtube.com/watch?v=x-R5l2HsXqM · lecture Stanford CS336 2025 LLM benchmarks in the era of agents deck — Florian Brand — local slide deck · slides TNG / Big Techday The Life Cycle of an RL Environment deck — Kanav Garg — local slide deck · slides ACM CAIS 2026 Discovered 58 more; transcription queued YouTube rate-limit . 30 eval-focused + 28 eval-segments-in-agent-talks below. — Alex Volkov AI Evangelist, Weights & Biases; host of ThursdAI — Judging LLMs https://www.youtube.com/watch?v=IIL2tE4n1Q0 https://www.youtube.com/watch?v=IIL2tE4n1Q0 https://www.youtube.com/watch?v=IIL2tE4n1Q0 · talk AI Engineer World's Fair 2025 — Evals track — John Dickerson CEO, Mozilla AI — 2025 is the Year of Evals Just like 2024, and 2023, and … https://www.youtube.com/watch?v=CQGuvf6gSrM https://www.youtube.com/watch?v=CQGuvf6gSrM https://www.youtube.com/watch?v=CQGuvf6gSrM · talk AI Engineer World's Fair 2025 — Evals track — Aparna Dhinakaran Co-founder & CPO, Arize AI — Lessons from the Trenches: Building LLM Evals That Work IRL https://www.youtube.com/watch?v=nbZzSC5A6hs https://www.youtube.com/watch?v=nbZzSC5A6hs https://www.youtube.com/watch?v=nbZzSC5A6hs · talk AI Engineer World's Fair 2025 — Evals track — Phil Hetzel Braintrust — The maturity phases of running evals https://www.youtube.com/watch?v=FB-MLPhL9Ms https://www.youtube.com/watch?v=FB-MLPhL9Ms https://www.youtube.com/watch?v=FB-MLPhL9Ms · talk AI Engineer World's Fair 2025 — Evals track — Laurie Voss Arize — Ship Real Agents: Hands-On Evals for Agentic Applications https://www.youtube.com/watch?v=Xfl50508LZM https://www.youtube.com/watch?v=Xfl50508LZM https://www.youtube.com/watch?v=Xfl50508LZM · talk AI Engineer World's Fair 2025 — Evals track — Peter Gostev Arena.ai — What Do Models Still Suck At? BullshitBench https://www.youtube.com/watch?v=R7A8rX-09Zw https://www.youtube.com/watch?v=R7A8rX-09Zw https://www.youtube.com/watch?v=R7A8rX-09Zw · talk AI Engineer World's Fair 2025 — Evals track — Diego Rodriguez Co-founder & CTO, Krea.ai — Perceptual Evaluations: Evals for Aesthetics https://www.youtube.com/watch?v=h5ItAJuB3Fc https://www.youtube.com/watch?v=h5ItAJuB3Fc https://www.youtube.com/watch?v=h5ItAJuB3Fc · talk AI Engineer World's Fair 2025 — Evals track — Quotient AI + Tavily speakers from both — Evaluating AI Search: A Practical Framework for Augmented AI Systems https://www.youtube.com/watch?v=wRJD0inpmjU https://www.youtube.com/watch?v=wRJD0inpmjU https://www.youtube.com/watch?v=wRJD0inpmjU · talk AI Engineer World's Fair 2025 — Evals track — Rafal Willinski & Vitor Balocco Zapier — Turning Fails into Features: Zapier's Hard-Won Eval Lessons https://www.youtube.com/watch?v=blrovBxxN9o https://www.youtube.com/watch?v=blrovBxxN9o https://www.youtube.com/watch?v=blrovBxxN9o · talk AI Engineer World's Fair 2025 — Evals track — Manu Goyal Braintrust — Why should anyone care about Evals? https://www.youtube.com/watch?v=jJ45Yz1lJao https://www.youtube.com/watch?v=jJ45Yz1lJao https://www.youtube.com/watch?v=jJ45Yz1lJao · talk AI Engineer World's Fair 2025 — Evals track — AI Engineer Evals Workshop multi-presenter — Mastering AI Evaluation: From Playground to Production Evals Workshop https://www.youtube.com/watch?v=9iN-cPnp7xg https://www.youtube.com/watch?v=9iN-cPnp7xg https://www.youtube.com/watch?v=9iN-cPnp7xg · talk AI Engineer World's Fair 2025 — Evals track full workshop — Ion Stoica co-founder Databricks/Anyscale, LMArena , host Jacob Effron — Databricks Co-Founder: Eval Limitations, Why China is Winning Open Source and Future of AI Infra Ep 69 https://www.youtube.com/watch?v=ehav4XMAKLw https://www.youtube.com/watch?v=ehav4XMAKLw https://www.youtube.com/watch?v=ehav4XMAKLw · podcast Unsupervised Learning Redpoint Ventures — Brendan Foody co-founder/CEO Mercor , host Jacob Effron — Mercor CEO: Evals Will Replace Knowledge Work, AI x Hiring Today & the Future of Data Labeling Ep 68 https://www.youtube.com/watch?v=SOZtz8IdI2w https://www.youtube.com/watch?v=SOZtz8IdI2w https://www.youtube.com/watch?v=SOZtz8IdI2w · podcast Unsupervised Learning Redpoint Ventures — Nidhi Rastogi asst. professor, RIT , host Sam Charrington — CTIBench: How Good Are LLMs at Detecting Cyber Threats? Ep 729 https://www.youtube.com/watch?v=75WqFOY3P5M https://www.youtube.com/watch?v=75WqFOY3P5M https://www.youtube.com/watch?v=75WqFOY3P5M · podcast The TWIML AI Podcast — Jineet Doshi Staff AI Scientist/Lead, Intuit , host Demetrios Brinkmann — Holistic Evaluation of Generative AI Systems MLOps Podcast 280 https://www.youtube.com/watch?v=VJ0k0C1mGdg https://www.youtube.com/watch?v=VJ0k0C1mGdg https://www.youtube.com/watch?v=VJ0k0C1mGdg · podcast MLOps.community — Neev Parikh METR , host Nathan Labenz — Can AIs do AI R&D? Reviewing RE-Bench Results with Neev Parikh of METR https://www.youtube.com/watch?v=SX8Mxyy UHY https://www.youtube.com/watch?v=SX8Mxyy UHY https://www.youtube.com/watch?v=SX8Mxyy UHY · podcast The Cognitive Revolution — Marius Hobbhahn CEO, Apollo Research , host Nathan Labenz — Can We Stop AI Deception? Apollo Research Tests OpenAI's Deliberative Alignment, w/ Marius Hobbhahn https://www.youtube.com/watch?v=I3ivZaAfDFg https://www.youtube.com/watch?v=I3ivZaAfDFg https://www.youtube.com/watch?v=I3ivZaAfDFg · podcast The Cognitive Revolution — Shahul Es co-founder, Ragas , hosts Daniel Whitenack & Chris Benson — Metrics Driven Development Ragas https://www.youtube.com/watch?v=fw0wUC5XN-o https://www.youtube.com/watch?v=fw0wUC5XN-o https://www.youtube.com/watch?v=fw0wUC5XN-o · podcast Practical AI Changelog — Mike Knoop co-founder ARC Prize / Zapier , host Lukas Biewald — R1, OpenAI's o3, and the ARC-AGI Benchmark: Insights from Mike Knoop https://www.youtube.com/watch?v=SSA8vNrFpXI https://www.youtube.com/watch?v=SSA8vNrFpXI https://www.youtube.com/watch?v=SSA8vNrFpXI · podcast Gradient Dissent Weights & Biases — UK AI Safety Institute team — Sandbox breakout evals with Inspect — UK AISI Fully Connected London '25 https://www.youtube.com/watch?v=J79pSSAENYc https://www.youtube.com/watch?v=J79pSSAENYc https://www.youtube.com/watch?v=J79pSSAENYc · talk Gradient Dissent / Fully Connected London '25 Weights & Biases — Weights & Biases Weave team — How to align your LLM judge for better evaluations https://www.youtube.com/watch?v=AMCmhRoKnSk https://www.youtube.com/watch?v=AMCmhRoKnSk https://www.youtube.com/watch?v=AMCmhRoKnSk · talk Gradient Dissent / W&B — Afshine Amidi & Shervine Amidi — Stanford CME295 Transformers & LLMs Autumn 2025 | Lecture 8 - LLM Evaluation https://www.youtube.com/watch?v=8fNP4N46RRo https://www.youtube.com/watch?v=8fNP4N46RRo https://www.youtube.com/watch?v=8fNP4N46RRo · lecture Stanford CME295 / Stanford Online — Berkeley RDI course staff Dawn Song's Agentic AI MOOC — CS294-196 Agentic AI MOOC - LLM Agent Evaluations & Project Overview https://www.youtube.com/watch?v=VfOA2a0dj4w https://www.youtube.com/watch?v=VfOA2a0dj4w https://www.youtube.com/watch?v=VfOA2a0dj4w · lecture UC Berkeley RDI CS294-196, Fall 2025 — Sida Wang Meta — Agentic AI MOOC Fall 2025 | Predictable Noise in LLM Benchmarks https://www.youtube.com/watch?v=HV8pugcFVO0 https://www.youtube.com/watch?v=HV8pugcFVO0 https://www.youtube.com/watch?v=HV8pugcFVO0 · lecture UC Berkeley RDI CS294-196, Fall 2025 — Samuel Colvin founder, Pydantic — Agent Optimization with Pydantic AI: GEPA, Evals, Feedback Loops — Samuel Colvin, Pydantic https://www.youtube.com/watch?v=A48uhxfxbsM https://www.youtube.com/watch?v=A48uhxfxbsM https://www.youtube.com/watch?v=A48uhxfxbsM · talk AI Engineer Code Summit / AI Engineer — Naman Jain Cursor; LiveCodeBench/SWE-bench-adjacent researcher — Coding Evals: From Code Snippets to Codebases — Naman Jain, Cursor https://www.youtube.com/watch?v=tHN44yJoeS8 https://www.youtube.com/watch?v=tHN44yJoeS8 https://www.youtube.com/watch?v=tHN44yJoeS8 · talk AI Engineer Code Summit — Brooke Hopkins founder, Coval; ex-Waymo eval infra — From Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval full session host upload https://www.youtube.com/watch?v=1X3mYUHC5GA https://www.youtube.com/watch?v=1X3mYUHC5GA https://www.youtube.com/watch?v=1X3mYUHC5GA · talk Founders You Should Know — Brooke Hopkins founder, Coval — Brooke Hopkins, Founder at Coval | AI Minds 073 https://www.youtube.com/watch?v=e1E8vLyRIKk https://www.youtube.com/watch?v=e1E8vLyRIKk https://www.youtube.com/watch?v=e1E8vLyRIKk · podcast AI Minds Deepgram — Karthik Narasimhan Head of Research, Sierra; Princeton; tau-bench author — Karthik Narasimhan - Reliable AI Agents for Tomorrow's World https://www.youtube.com/watch?v=fOAAslQUceg https://www.youtube.com/watch?v=fOAAslQUceg https://www.youtube.com/watch?v=fOAAslQUceg · lecture Berkeley RDI Agentic AI Summit 2025 — Sayash Kapoor Princeton; AI Snake Oil; co-author HAL / agent-eval critiques — Building and evaluating AI Agents — Sayash Kapoor, AI Snake Oil https://www.youtube.com/watch?v=d5EltXhbcfA https://www.youtube.com/watch?v=d5EltXhbcfA https://www.youtube.com/watch?v=d5EltXhbcfA · talk AI Engineer Summit 2025 Talks about building agents Devin, Claude Code, Cursor, Replit, OpenAI Deep Research, Karpathy… with a substantive eval segment — the eval part is noted. — Amy Boyd & Nitya Narasimhan Microsoft AI Engineer World's Fair 2025 — Evals track — Mind the Gap In your Agent Observability https://www.youtube.com/watch?v=iOXM3zE-2dk https://www.youtube.com/watch?v=iOXM3zE-2dk https://www.youtube.com/watch?v=iOXM3zE-2dk — eval: Primarily agent observability/tracing, but the core argument ties observability directly to evaluation: you can't eval what you can't see. Covers instrumenting agent runs to feed eval datasets and catch regressions. oEmbed-verified.— Arvind Narayanan Princeton, co-author AI Snake Oil , host Jacob Effron Unsupervised Learning Redpoint Ventures — Unpacking AI Agent Hype vs. Reality with Arvind Narayanan https://www.youtube.com/watch?v=NoVMk P6fgY https://www.youtube.com/watch?v=NoVMk P6fgY https://www.youtube.com/watch?v=NoVMk P6fgY — eval: Large central segment on the limitations of agent benchmarks: why current agent evals are flawed/overstated, construct validity, capability vs. reliability, and the gap between benchmark scores and real-world robustness. Surrounding material covers agent hype and societal impact.— Ben Lorica & Paco Nathan The Data Exchange Gradient Flow — Data Exchange Podcast Ep 232: Ben Lorica & Paco Nathan on Llama 3, Agents, Eval, and more https://www.youtube.com/watch?v=XDIqkH I9oU https://www.youtube.com/watch?v=XDIqkH I9oU https://www.youtube.com/watch?v=XDIqkH I9oU — eval: Roundup format with a substantial evaluation-metrics segment: state of LLM/agent evaluation, what metrics matter for agentic workflows, and limitations of current eval practice — interleaved with Llama 3 and agent news.— Jiantao Jiao UC Berkeley / NVIDIA UC Berkeley RDI CS294-196, Fall 2025 — Agentic AI MOOC Fall 2025 | Post-Training Verifiable Agents https://www.youtube.com/watch?v=3l0Zxus34es https://www.youtube.com/watch?v=3l0Zxus34es https://www.youtube.com/watch?v=3l0Zxus34es — eval: Training-focused, but a substantial benchmark thread runs through it: SWE-bench Verified and BrowseComp as the verifiable-task targets used to train and evaluate agents. Eval/benchmark segments are load-bearing ~middle of talk .— Graham Neubig CMU Carnegie Mellon University CS 11-711 Advanced NLP — CMU Advanced NLP Fall 2024 17 : Evaluation and Multimodal https://www.youtube.com/watch?v=iEinTXrwK8A https://www.youtube.com/watch?v=iEinTXrwK8A https://www.youtube.com/watch?v=iEinTXrwK8A — eval: First ~half is a focused treatment of NLP/LLM evaluation: automatic metrics, human eval, LLM-as-judge and its pitfalls, benchmark contamination; second half pivots to multimodal. The eval portion is substantive ~min 0-35 .— Charles Sutton Google DeepMind UC Berkeley RDI CS294-280, Spring 2025 — Adv. LLM Agents MOOC Sp25 | Code Agents & AI Vulnerability Detection https://www.youtube.com/watch?v=JCk6qJtaCSU https://www.youtube.com/watch?v=JCk6qJtaCSU https://www.youtube.com/watch?v=JCk6qJtaCSU — eval: Coding-agent talk that leans heavily on benchmarks to measure progress: SWE-bench-style code-fixing eval and vulnerability-detection benchmarks, plus discussion of how to construct verifiable security-eval tasks eval threads throughout .— Graham Neubig CMU / All Hands AI UC Berkeley RDI CS294-196, Fall 2024 — LLM Agents MOOC Fall 2024 | Agents for Software Development https://www.youtube.com/watch?v=f9L9Fkq-8K4 https://www.youtube.com/watch?v=f9L9Fkq-8K4 https://www.youtube.com/watch?v=f9L9Fkq-8K4 — eval: SWE-bench is the spine of the talk: how the benchmark works, why it's hard, leaderboard dynamics, and where it misleads vs. real software work. Eval/benchmark content is central recurs throughout, esp. early-mid .— Nicolas Chapados ServiceNow Research UC Berkeley RDI CS294-196, Fall 2024 — LLM Agents MOOC Fall 2024 | AI Agents for Enterprise Workflows https://www.youtube.com/watch?v=-yf-e-9FvOc https://www.youtube.com/watch?v=-yf-e-9FvOc https://www.youtube.com/watch?v=-yf-e-9FvOc — eval: Introduces WorkArena / BrowserGym as benchmarks for web/enterprise-workflow agents — task design, difficulty calibration, and why real enterprise tasks break naive evals benchmark segment is a core part, mid-talk .— Yann Dubois OpenAI UC Berkeley RDI CS294-196, Fall 2025 — Agentic AI MOOC Fall 2025 | LLM Agents Overview https://www.youtube.com/watch?v=r1qZpYAmqmg https://www.youtube.com/watch?v=r1qZpYAmqmg https://www.youtube.com/watch?v=r1qZpYAmqmg — eval: Framing overview of agents that includes a substantive evaluation segment — how to measure agent capability, the gap between benchmark scores and real reliability, and why agent eval is harder than chatbot eval. Eval segment ~mid-talk.— Scott Wu CEO, Cognition AI Engineer World's Fair 2024 — The Making of Devin by Cognition AI: Scott Wu https://www.youtube.com/watch?v=T7NWjoD OuY https://www.youtube.com/watch?v=T7NWjoD OuY https://www.youtube.com/watch?v=T7NWjoD OuY — eval: Agent-building/demo talk for Devin. Eval segment covers how Cognition measures the agent: SWE-bench results plus their philosophy that public benchmarks are insufficient, motivating an internal 'cognition-golden' benchmark with fully reproducible environments, simulated users Devin can chat with, and evaluator agents that autonomously judge outcomes. Eval discussion sits in the back third around the SWE-bench / 'how we measure progress' portion.— Boris Cherny creator/head of Claude Code, Anthropic AI Engineer World's Fair 2025 — Claude Code & the evolution of agentic coding — Boris Cherny, Anthropic https://www.youtube.com/watch?v=Lue8K2jqfKk https://www.youtube.com/watch?v=Lue8K2jqfKk https://www.youtube.com/watch?v=Lue8K2jqfKk — eval: Talk about model capability vs 'harness'/scaffolding for coding agents. Eval-relevant segment: how the Claude Code team relies on internal evals to decide what scaffolding to keep, and the observation that as models improve you must keep raising the difficulty of your eval set. Measurement framing recurs through the harness discussion middle of the talk .— James Austin AI engineer, Replit MLOps Community — Agents in Production — Building Replit Agent - Hard Lessons Learned https://www.youtube.com/watch?v=RYde73eO7ok https://www.youtube.com/watch?v=RYde73eO7ok https://www.youtube.com/watch?v=RYde73eO7ok — eval: Lessons-learned talk on scaling the Replit Agent team 3 to 20+ engineers . Heavy, substantive eval content: how optimizing for SWE-bench was the wrong target vs what users wanted, the importance of AUTOMATING the discovery of failure cases long-tail failures , growing an internal eval set over time every new bug becomes a new eval , and custom eval frameworks. Eval material runs through the middle of the talk under 'measure what matters' and 'automate finding failure cases'.— Harrison Chase CEO, LangChain AI Engineer LangChain Interrupt / AI Engineer — 3 ingredients for building reliable enterprise agents — Harrison Chase, LangChain/LangGraph https://www.youtube.com/watch?v=kTnfJszFxCg https://www.youtube.com/watch?v=kTnfJszFxCg https://www.youtube.com/watch?v=kTnfJszFxCg — eval: Framework talk on the build/test/deploy lifecycle for reliable enterprise agents. The middle 'Test' ingredient is the eval segment: using evals LangSmith to verify the agent does the right thing rather than just returning plausible output, treating every error as an opportunity to write a new eval, and pairing tracing/observability with regression evals. Eval content is roughly the central third of the talk.— Cursor engineering Tido Carriero et al. Cursor official channel — How Cursor builds agentic workflows across the SDLC https://www.youtube.com/watch?v=dJAVS1g3NDw https://www.youtube.com/watch?v=dJAVS1g3NDw https://www.youtube.com/watch?v=dJAVS1g3NDw — eval: Talk on Cursor's internal agentic workflows across the SDLC bug triage, security review, etc. . Eval segment covers how Cursor compares model quality with CursorBench — an in-house suite of intentionally underspecified, multi-file tasks built from real IDE sessions, scored with agentic graders, plus ONLINE evaluation to check whether agent changes actually help developers in practice. Eval discussion appears where they explain how they decide which models/changes to ship.— Thariq Shihipar Anthropic AI Engineer workshop — Claude Agent SDK Full Workshop — Thariq Shihipar, Anthropic https://www.youtube.com/watch?v=TqC1qOfiVcQ https://www.youtube.com/watch?v=TqC1qOfiVcQ https://www.youtube.com/watch?v=TqC1qOfiVcQ — eval: Hands-on workshop building agents with the Claude Agent SDK tools, subagents, the agent loop . Eval-relevant portion covers how to verify and iterate on the agent once built — testing tool use, checking the loop behaves, and using measurement to debug agent failures. Eval/verification material comes in the back portion of the build-along.— Andrej Karpathy Y Combinator AI Startup School 2025 — Andrej Karpathy: Software Is Changing Again https://www.youtube.com/watch?v=LCEmiRjPEtQ https://www.youtube.com/watch?v=LCEmiRjPEtQ https://www.youtube.com/watch?v=LCEmiRjPEtQ — eval: Keynote on 'Software 3.0', LLMs as a new computing substrate, partial-autonomy apps and the 'autonomy slider'. Eval-relevant thread: his argument for keeping humans in the verification loop, making the generation-verification loop fast, and 'keeping the AI on a leash' — i.e., why you need tight verification/eval signals to safely raise agent autonomy. Verification discussion is woven through the partial-autonomy section middle-to-late .— Samuel Colvin founder, Pydantic AI Engineer World's Fair 2025 — From Stateless Nightmares to Durable Agents — Samuel Colvin, Pydantic https://www.youtube.com/watch?v=flf IKnFYnE https://www.youtube.com/watch?v=flf IKnFYnE https://www.youtube.com/watch?v=flf IKnFYnE — eval: Talk on building durable, production-grade agents with Pydantic AI state/durability, type-safety, observability . Eval segment: Colvin's view that evals are still an unsolved problem, how Pydantic AI's evals library + Logfire observability fit the production loop, and using traces/observability as the substrate for evaluating agent behavior. Eval discussion appears in the observability/production-readiness portion.— Matt Palmer host + Replit lead AI engineer Replit official channel — Inside Replit Agent with a lead AI engineer https://www.youtube.com/watch?v=bJMriY-pqPE https://www.youtube.com/watch?v=bJMriY-pqPE https://www.youtube.com/watch?v=bJMriY-pqPE — eval: Conversation on how the Replit Agent works internally, including the Agent v3 launch discussion around ~19:21 . Eval-relevant content: the self-improving loop of evals/metrics - autonomous harness edits - hill-climbing, how the team grows their eval set from observed failures, and why the engineering center of gravity has shifted toward measurement and harness iteration over the raw model.— Brooke Hopkins Coval , Martin Schweiger, Vapi panel VapiCon 2025 Vapi — VapiCon 2025: Hardest Problems in Voice AI with Brooke Hopkins, Martin Schweiger & more https://www.youtube.com/watch?v=vzCT5PJlsJo https://www.youtube.com/watch?v=vzCT5PJlsJo https://www.youtube.com/watch?v=vzCT5PJlsJo — eval: Practitioner panel on production voice-AI failure modes; the eval thread runs throughout — end-to-end conversation simulation, why LLM-simulated callers are too cooperative vs. real frustrated/adversarial users, turn-taking/latency/interruption metrics, and monitoring. Eval-heavy whenever Hopkins speaks.— Karthik Narasimhan Head of Research, Sierra Greylock Change Agents — Multi-Agent Interaction with Sierra AI https://www.youtube.com/watch?v=KlQIePkgY7c https://www.youtube.com/watch?v=KlQIePkgY7c https://www.youtube.com/watch?v=KlQIePkgY7c — eval: Talk on how Sierra builds multi-agent customer-experience systems; includes the evaluation segment on tau-bench-style benchmarking, supervisor/critic agents reviewing primary-agent output, and measuring reliability of tool-using conversational agents.— Karthik Narasimhan Sierra / Princeton Open AGI Summit, Brussels — Karthik Narasimhan on Language Agents and Multi-Agent Interaction https://www.youtube.com/watch?v=i3GOZ22z2C0 https://www.youtube.com/watch?v=i3GOZ22z2C0 https://www.youtube.com/watch?v=i3GOZ22z2C0 — eval: Survey of language-agent design and multi-agent interaction with an evaluation segment motivating tau-bench: why real-world tool-agent-user tasks need interaction-based benchmarks rather than static QA. Eval discussion is a sizeable chunk, not the whole talk.— Parahelp YC prompt/agent breakdown startupCode analysis of Parahelp/YC material — AI Customer Support: ParaHelp's Secret Prompt REVEALED https://www.youtube.com/watch?v=UCQc12 KRy0 https://www.youtube.com/watch?v=UCQc12 KRy0 https://www.youtube.com/watch?v=UCQc12 KRy0 — eval: Walkthrough of Parahelp's production customer-support agent prompt; the load-bearing eval point drawn from Parahelp's own writing is that most prompt-engineering time goes not to the prompt but to building eval suites, finding edge cases, and iterating — 'test cases more valuable than prompts.' Eval framing appears alongside the prompt structure discussion.— Ben Liebald engineering lead, Harvey LangChain — How Harvey Built Reliable AI Agents with LangSmith & Custom Tools https://www.youtube.com/watch?v=kuXtW03cZEA https://www.youtube.com/watch?v=kuXtW03cZEA https://www.youtube.com/watch?v=kuXtW03cZEA — eval: How Harvey builds and EVALUATES domain-specific legal agents: tracing/observability with LangSmith, custom legal tools, and reliability evaluation against expert expectations the BigLaw Bench / rubric-graded-by-lawyers approach . Eval/reliability is a major thread of the talk.— Harvey / legal-AI leaders a16z a16z — Agents, Lawyers, and LLMs https://www.youtube.com/watch?v=ZESTYyGZ7Y4 https://www.youtube.com/watch?v=ZESTYyGZ7Y4 https://www.youtube.com/watch?v=ZESTYyGZ7Y4 — eval: Discussion of legal agents in practice; eval segment covers why generic benchmarks LegalBench/CUAD are insufficient for long-horizon legal work and the move to expert-rubric, agent-task benchmarks BigLaw Bench / Legal Agent Bench graded by practicing attorneys. Eval is one section, not the whole episode.— Josh Tobin leads AI Agents research, OpenAI — Deep Research / Operator TWIML AI Podcast — How OpenAI Builds AI Agents That Think and Act Josh Tobin - 730 https://www.youtube.com/watch?v=qfhU7JH000o https://www.youtube.com/watch?v=qfhU7JH000o https://www.youtube.com/watch?v=qfhU7JH000o — eval: Covers Deep Research, Operator, Codex CLI; eval-relevant core is how end-to-end RL training requires graded/verifiable tasks the agent must 'experience failure' and be rewarded for recovery , plus benchmark framing BrowseComp for browsing . Reward/grading discussion runs through the middle of the episode.— Isa Fulford Deep Research team lead, OpenAI Sequoia Capital Training Data — How OpenAI Built its Groundbreaking Deep Research Product ft. Isa Fulford https://www.youtube.com/watch?v=jFZ9hJKJKtw https://www.youtube.com/watch?v=jFZ9hJKJKtw https://www.youtube.com/watch?v=jFZ9hJKJKtw — eval: How Deep Research was built and trained; eval-relevant segments cover building hard browsing/research tasks with verifiable answers, grading long-form cited outputs, and benchmark performance e.g., BrowseComp . Eval/measurement is woven through the training discussion rather than a standalone section.— Isa Fulford & Josh Tobin OpenAI Deep Research Sequoia Capital Training Data — OpenAI's Deep Research Team on Why Reinforcement Learning is the Future for AI Agents https://www.youtube.com/watch?v=bNEvJYzoa8A https://www.youtube.com/watch?v=bNEvJYzoa8A https://www.youtube.com/watch?v=bNEvJYzoa8A — eval: RL-for-agents discussion; eval content is the dependence of end-to-end RL on graded tasks and verifiable rewards, and how they construct hard research/browsing evals the model can be scored against. Measurement framing recurs throughout.— Jesse Zhang CEO/co-founder, Decagon No Priors — No Priors Ep. 132 | With Decagon CEO and Co-Founder Jesse Zhang https://www.youtube.com/watch?v=emaSFP7y7Ko https://www.youtube.com/watch?v=emaSFP7y7Ko https://www.youtube.com/watch?v=emaSFP7y7Ko — eval: Building production customer-support agents; eval segment covers Decagon's approach — regression/simulation test sets ~hundreds of conversations per workflow , LLM-as-judge scoring of tone/format/correct-info/correct-tool, and red-teaming with adversarial tests. Eval is a defined section of the conversation, not the whole episode. Good eval commentary mined from agent-BUILDING writeups not eval-primary — each kept only if a strict judge rated the eval insight excellent/good. Takeaway + verbatim excerpt. — Naman Jain Cursor — How we compare model quality in Cursor CursorBench https://cursor.com/blog/cursorbench https://cursor.com/blog/cursorbench https://cursor.com/blog/cursorbench · excellent — To avoid benchmark contamination, derive eval tasks from real committed code traced back to the agent request that produced it Cursor Blame , and pair offline suites with controlled live-traffic analysis to catch regressions where outputs grade well but the user experience degrades — tracking a basket of outcome… excerpt: "We source tasks for CursorBench using Cursor Blame, which traces committed code back to the agent request that produced it. ... We supplement CursorBench with controlled analysis on live traffic. These online evals…" — Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford — How we built our multi-agent research system https://www.anthropic.com/engineering/multi-agent-research-system https://www.anthropic.com/engineering/multi-agent-research-system https://www.anthropic.com/engineering/multi-agent-research-system · excellent — Start evals with ~20 real-usage queries rather than waiting for a large suite—early on a prompt tweak can move success from 30% to 80%, so small samples already reveal big effects. A single LLM-judge call scoring a multi-dimensional rubric factual/citation accuracy, completeness, source quality, tool efficiency on… excerpt: "We started with a set of about 20 queries representing real usage patterns. Evaluating these queries often required human judgment, but we found that an LLM judge that evaluated each output against criteria in a…" — Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe — Demystifying evals for AI agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents · excellent — Low eval scores frequently measure broken graders and harnesses, not weak models: rigid string-matching, ambiguous specs, and non-reproducible stochastic tasks can suppress a score from 95% to 42%, so you must read transcripts and audit the eval before trusting any number. excerpt: "Opus 4.5 initially scored 42% on CORE-Bench, until an Anthropic researcher found multiple issues: rigid grading that penalized '96.12' when expecting '96.124991…', ambiguous task specs, and stochastic tasks that were…" — Sierra AI Research — τ-Bench: Benchmarking AI agents for the real-world https://sierra.ai/blog/benchmarking-ai-agents https://sierra.ai/blog/benchmarking-ai-agents https://sierra.ai/blog/benchmarking-ai-agents · excellent — Reliability, not single-shot accuracy, is the real bar for agents: pass^k success on all k independent trials of the same task collapses GPT-4o from ~50% pass^1 to ~25% pass^8 in τ-retail, meaning only a 1-in-4 chance of handling 8 different customers with the same issue. Measure consistency across repeated trials,… excerpt: "pass^k, which measures the agent's reliability and determines if it can successfully complete the same task multiple times k representing the number of different trials . ... the agent powered by GPT-4o drops to ~25%…" — Efe Karakus — From AI agent prototype to product: Lessons from building AWS DevOps Agent https://aws.amazon.com/blogs/devops/from-ai-agent-prototype-to-product-lessons-from-building-aws-devops-agent/ https://aws.amazon.com/blogs/devops/from-ai-agent-prototype-to-product-lessons-from-building-aws-devops-agent/ https://aws.amazon.com/blogs/devops/from-ai-agent-prototype-to-product-lessons-from-building-aws-devops-agent/ · excellent — Separate "capability" pass@k: passed at least once in k tries from "reliability" pass^k: fraction of the k tries that passed — a high pass@k with low pass^k means the agent CAN solve a task but does so unreliably, which is the metric that actually matters for shipping a non-deterministic agent. excerpt: "Key metrics that we keep track of are capability pass@k – whether the agent passed at least once in k attempts , reliability pass^k – how many times the agent passed across k attempts, e.g., 0.33 means passed 1 out of…" — Simon Last & Sarah Sachs Notion , interviewed on Latent Space — Notion's Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future https://www.latent.space/p/notion https://www.latent.space/p/notion https://www.latent.space/p/notion · excellent — Notion runs a three-tier eval system with distinct pass-rate targets: regression/unit tests gated in CI, launch "report card" evals requiring 80-90% across user journeys to ship, and deliberately hard "frontier/headroom" evals targeted at ~30% pass rate so the suite keeps giving signal instead of saturating. The… excerpt: "we have the equivalent of unit test. Regression test. Those live in ci, those have to pass a certain percent ... we have a report card and we need to, on these categories, you know, be it 80 or 90% of all of these user…" — Dropbox Dropbox Engineering / ML team — A practical blueprint for evaluating conversational AI at scale Dash https://dropbox.tech/machine-learning/practical-blueprint-evaluating-conversational-ai-at-scale-dash https://dropbox.tech/machine-learning/practical-blueprint-evaluating-conversational-ai-at-scale-dash https://dropbox.tech/machine-learning/practical-blueprint-evaluating-conversational-ai-at-scale-dash · excellent — Tier your eval metrics by enforcement: boolean gates as hard blockers citations present? , scalar budgets with concrete thresholds Source F1 = 0.85, p95 latency <= 5s that block merges, and rubric scores tone/formatting that are only dashboard-monitored, not gating. This separates "must never regress" from… excerpt: "we defined three types of metrics, each with a clear role in the development pipeline: Boolean gates "Citations present?", "Source present?" | Hard fail, changes can't move forward; Scalar budgets Source F1 ≥ 0.85,…" — Stefan Heule & Jediah Katz Cursor — Continually improving our agent harness https://cursor.com/blog/continually-improving-agent-harness https://cursor.com/blog/continually-improving-agent-harness https://cursor.com/blog/continually-improving-agent-harness · good — Pair offline benchmarks CursorBench with online signals that proxy real satisfaction: a "Keep Rate" measuring what fraction of agent-written code survives in the codebase after fixed time intervals, plus an LLM-judge reading user follow-up messages to infer satisfaction, validated via side-by-side A/B tests of… excerpt: "The first is the "Keep Rate" of agent-generated code. For a given set of code changes that the agent proposed, we track what fraction of those remain in the user's codebase after fixed intervals of time. ... Second, we…" — Peter Zhong, Jacky Zhao, Ryan Carelli Replit — Enabling Agent 3 to Self-Test at Scale with REPL-Based Verification https://replit.com/blog/automated-self-testing https://replit.com/blog/automated-self-testing https://replit.com/blog/automated-self-testing · good — Verification scales with autonomy: as an agent runs longer unattended Replit went from ~20 min to 200+ min of productive autonomous work , robust self-testing becomes the gating factor because errors compound — and they isolate testing into a separate subagent to avoid context pollution, reaching multi-hundred-step… excerpt: "we've created a self-testing flow for the Agent that is able to perform complex, multi-hundred step testing at a median cost of $0.20 per session." — Cognition Devin team — A review of OpenAI's o1 and how we evaluate coding agents https://cognition.com/blog/evaluating-coding-agents https://cognition.com/blog/evaluating-coding-agents https://cognition.com/blog/evaluating-coding-agents · good — When your judge is itself an agent with shell/browser/code-editing tools autonomously deciding pass/fail , you must validate the judge: measure its precision and recall against a labeled ground-truth set and keep humans continuously reviewing the "proof of success" it surfaces. They also average over multiple Devin… excerpt: "We evaluate our evaluators in two ways: 1 Measuring precision and recall on ground truth sets 2 Continuous human review of the proof of success discovered by the evaluator agents." — Akshay Utture Augment Code — How we built a high-quality AI code review agent https://www.augmentcode.com/blog/how-we-built-high-quality-ai-code-review-agent https://www.augmentcode.com/blog/how-we-built-high-quality-ai-code-review-agent https://www.augmentcode.com/blog/how-we-built-high-quality-ai-code-review-agent · good — They run a fast offline benchmark LLM-as-judge comparing generated comments to human-authored "golden comments" on 10 PRs across 5 repos using F-score as the hill-climbing metric, then map each offline metric to a production proxy: recall to "bugs fixed per PR" and precision to "percentage of comments addressed."… excerpt: "F-score acts as the primary hill-climbing metric for offline improvements. ... Bugs fixed per PR | Recall | Measures real-world bug prevention and review coverage | Percentage of comments addressed | Precision |…" — Antonio Scandurra & Nathan Sobo Zed — Zed now predicts your next edit with Zeta, our new open model https://zed.dev/blog/edit-prediction https://zed.dev/blog/edit-prediction https://zed.dev/blog/edit-prediction · good — When the correct output is non-deterministic and admits many valid forms e.g. code edits , replace brittle token/string assertions with an LLM judge checking plain-English intent assertions e.g. "ensure quicksort recurses left and right of the pivot" ; this tolerates run-to-run variation while still catching wrong… excerpt: "instead of strict assertions, we used a larger LLM to evaluate Zeta's edits. By writing our test assertions in plain English and having Claude check if the results matched our intent, we could validate that Zeta was…" — Factory.ai — Code Droid: A Technical Report https://factory.ai/news/code-droid-technical-report https://factory.ai/news/code-droid-technical-report https://factory.ai/news/code-droid-technical-report · good — They decompose agent failures into distinct stages of the localization pipeline — file not retrieved 8% , retrieved but not ranked top-5 8% , and ranked top but not edited 6% — which tells practitioners exactly where to invest retrieval recall vs. ranking vs. edit selection rather than treating a failed task as… excerpt: "In 8% of the tasks, Code Droid failed to include the target file in its list of analyzed files. Additionally, even when the target file was analyzed, it was not prioritized as a top-5 file in another 8% of cases.…" — Jan Hartman Sourcegraph — Lessons from building AI coding assistants: context retrieval and evaluation https://sourcegraph.com/blog/lessons-from-building-ai-coding-assistants-context-retrieval-and-evaluation https://sourcegraph.com/blog/lessons-from-building-ai-coding-assistants-context-retrieval-and-evaluation https://sourcegraph.com/blog/lessons-from-building-ai-coding-assistants-context-retrieval-and-evaluation · good — When you can't get ground-truth labels for "relevant context," end-to-end user feedback can't tell you whether a bad answer came from retrieval or from the LLM — so substitute cheap automatic proxy checks code compiles/passes tests for generation; referenced symbols actually exist for code Q&A and separately… excerpt: "Since users primarily interact with the LLM's responses rather than the context items themselves, it's hard to know if context retrieval is making a difference. We might get feedback that a response was unhelpful, but…" — Yujohn Nattrass — Introducing Scorers in Mastra https://mastra.ai/blog/mastra-scorers https://mastra.ai/blog/mastra-scorers https://mastra.ai/blog/mastra-scorers · good — Don't ask an LLM judge to emit a raw 0-1 score directly — it's high-variance and irreproducible. Instead have the LLM emit structured intermediate data e.g. extract claims/opinions, label each , then compute the score deterministically in code proportion that pass , keeping the LLM's nuance but making the number… excerpt: "LLMs are terrible at producing consistent numerical scores—ask the same model to rate something from 0-1 five times and you'll get five different numbers. So we have LLMs output structured data instead, then use a…" — Letta — Benchmarking AI Agent Memory: Is a Filesystem All You Need? https://www.letta.com/blog/benchmarking-ai-agent-memory/ https://www.letta.com/blog/benchmarking-ai-agent-memory/ https://www.letta.com/blog/benchmarking-ai-agent-memory/ · good — A simple filesystem-backed agent search files → grep/open → answer beats a specialized graph-memory system on LoCoMo, supporting their thesis that what matters for memory eval is whether the agent knows WHEN and HOW to call a retrieval tool, not the underlying retrieval mechanism vector DB vs knowledge graph . They… excerpt: "This simple agent achieves 74.0% on LoCoMo with GPT-4o mini and minimal prompt tuning, significantly above Mem0's reported 68.5% score for their top-performing graph variant." — Dominik Kundel, Gabriel Chua — Testing Agent Skills Systematically with Evals https://developers.openai.com/blog/eval-skills https://developers.openai.com/blog/eval-skills https://developers.openai.com/blog/eval-skills · good — Structure skill evals as deterministic trace checks first parse the --json JSONL stream: assert specific commands ran, count command execution items to catch looping/re-run regressions, track usage tokens to catch prompt bloat , then layer a model-assisted --output-schema rubric step only for the qualitative parts… excerpt: "Deterministic checks answer 'did it do the basics?' but they don't answer 'did it do it the way you wanted?' For skills like setup-demo-app, many requirements are qualitative: component structure, styling conventions,…" — The LangChain Team — Evaluating Deep Agents: Our Learnings https://www.langchain.com/blog/evaluating-deep-agents-our-learnings https://www.langchain.com/blog/evaluating-deep-agents-our-learnings https://www.langchain.com/blog/evaluating-deep-agents-our-learnings · good — For multi-turn agent evals, you can't hardcode a fixed sequence of user inputs because once the agent diverges from the expected path the later scripted inputs become incoherent; pair this with per-test fresh/temporary environments so runs stay reproducible and non-flaky, and lean on single-step evals since… excerpt: "if you naively hardcode a sequence of inputs and the agent deviates from the expected path, the subsequent hardcoded user input may not make sense." — Malte Ubl, Alice Alexandra Moore, Ido Pesok — Eval-driven development: Build better AI faster https://vercel.com/blog/eval-driven-development-build-better-ai-faster https://vercel.com/blog/eval-driven-development-build-better-ai-faster https://vercel.com/blog/eval-driven-development-build-better-ai-faster · good — Tier your graders by cost/objectivity code checks first, LLM grading reserved for subjective calls since it runs 1.5-2x more expensive , hold a hard 100% pass bar on refusal/safety, and deliberately seed the eval set with prompts that currently fail so improvements are tracked and regressions caught as prompts… excerpt: "Our multi-faceted evaluation strategy includes fast, reliable code checks, end user and internal human feedback, and LLM-based grading for complex judgments at scale. ... Some of our checks for code quality include:…" — Letta — Letta Leaderboard: Benchmarking LLMs on Agentic Memory https://www.letta.com/blog/letta-leaderboard https://www.letta.com/blog/letta-leaderboard https://www.letta.com/blog/letta-leaderboard · good — A good agentic-memory eval must penalize unnecessary memory tool calls, not just reward correct answers: models that are strong at archival retrieval tend to over-call memory tools even when the answer is already in context, which is a real failure mode you only catch if your scoring includes an extraneous-operation… excerpt: "Models that perform well on archival memory e.g., Claude Haiku 3-5 might overuse memory operations when unnecessary and receive a lower score on core memory due to penalties." — Decagon — The evaluation engine behind Decagon's AI agents https://decagon.ai/blog/evaluation-engine-ai-agents https://decagon.ai/blog/evaluation-engine-ai-agents https://decagon.ai/blog/evaluation-engine-ai-agents · good — A two-stage eval gate offline LLM-as-judge over query/context/response triplets plus an expert-labeled ground-truth set, then online A/B with gradual traffic ramp keeps unreliable variants out of production; auditing a subset of judge scores with human labellers validates the judge itself, and online success is… excerpt: "Using an LLM-as-judge system, we evaluate structured triplets consisting of a user query, the context provided to the model, and the model's generated response. ... We evaluate responses against a ground truth…" — Anker & Mads Parahelp co-founders — AI prompt design at Parahelp https://parahelp.com/blog/prompt-design https://parahelp.com/blog/prompt-design https://parahelp.com/blog/prompt-design · good — Designing the agent to emit structured XML output is a deliberate "design-for-evaluability" tactic: rigid, parseable output lets you programmatically grade each decision, and pairing it with an outcome metric like "% of tickets resolved end-to-end" grounds eval in real production results rather than proxy scores. excerpt: "This made the model more strict and let us parse XML for evals " — Iwona Bialynicka-Birula, Ryan Muir, Binoy Robin Dalal, Hagyeong Shin, Nikolai Glushnev — How we Built a State-of-the-Art Research Agent for Call Center Conversation Analytics https://cresta.com/blog/how-we-built-a-state-of-the-art-research-agent-for-call-center-conversation-analytics https://cresta.com/blog/how-we-built-a-state-of-the-art-research-agent-for-call-center-conversation-analytics https://cresta.com/blog/how-we-built-a-state-of-the-art-research-agent-for-call-center-conversation-analytics · good — They isolated the dominant hallucination driver questions whose answer simply isn't in the conversation and pulled counting/aggregation out of the LLM into deterministic code so report statistics are guaranteed correct, while tracking two concrete report-quality metrics relevance-classification accuracy and… excerpt: "Human experts scrutinized a wide range of AI Analyst reports and identified two key metrics that were key drivers of report quality: relevance classification accuracy and the factuality of claims about the…" — Cresta — Why Speech to Text Is the Hidden Engine Behind Contact Center AI Performance https://cresta.com/blog/why-speech-to-text-is-the-hidden-engine-behind-contact-center-ai-performance https://cresta.com/blog/why-speech-to-text-is-the-hidden-engine-behind-contact-center-ai-performance https://cresta.com/blog/why-speech-to-text-is-the-hidden-engine-behind-contact-center-ai-performance · good — STT quality is the upstream bottleneck for downstream agent tasks: WER should be measured on a domain-stratified corpus here 2,703 files / 81.69 hours / 9 domains because small WER deltas compound at scale 1% WER over 1M minutes = ~10,000 fewer errors , and targeted fine-tuning or keyterm prompting moves the needle… excerpt: "WER benchmarking was based on a dataset comprising 2,703 audio files across nine distinct domains, totaling 81.69 hours" — Vapi Vapi Editorial Team — Your Voice Agents Need Tests. Now They Have Them. https://vapi.ai/blog/evals https://vapi.ai/blog/evals https://vapi.ai/blog/evals · good — Convert real production failures into regression tests by capturing the bad transcript and annotating the correct behavior, and match different criteria with different judges: regex/JSON/exact for deterministic outputs e.g., a tool call must include particular arguments , LLM-as-judge for fuzzy qualities like tone,… excerpt: "When you discover a bad call in your logs, you can turn that transcript into a test. In the dashboard, pull up the call, click the thumbs down button to use it as an eval, specify what the assistant should have done…" — Chip Huyen — Agents https://huyenchip.com/2025/01/07/agents.html https://huyenchip.com/2025/01/07/agents.html https://huyenchip.com/2025/01/07/agents.html · good — Decompose agent planning evaluation into a concrete task, tool-inventory dataset and sample K plans per task, then track plan-level metrics fraction valid, retries-to-valid and tool-call-level metrics invalid tool, valid tool with wrong params, valid tool with wrong values — separating the distinct failure modes… excerpt: "To evaluate an agent for planning failures, you can create a dataset of task, tool inventory pairs. For each task, use an agent to generate K plans. Compute the following metrics: Out of all generated plans, how many…" — Lilian Weng — LLM Powered Autonomous Agents https://lilianweng.github.io/posts/2023-06-23-agent/ https://lilianweng.github.io/posts/2023-06-23-agent/ https://lilianweng.github.io/posts/2023-06-23-agent/ · good — LLM-as-judge can silently fail in expert domains: in ChemCrow, an LLM evaluator rated GPT-4 and ChemCrow as roughly equal, while domain experts judging chemical correctness found ChemCrow far superior. The takeaway is that an LLM judge lacking domain expertise cannot detect flaws it doesn't understand, so… excerpt: "Interestingly, while the LLM-based evaluation concluded that GPT-4 and ChemCrow perform nearly equivalently, human evaluations with experts oriented towards the completion and chemical correctness of the solutions…" — Carol Liang and Kevin Ho Stripe, API Standards — Can AI agents build real Stripe integrations? We built a benchmark to find out https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations · good — Don't grade an agent on its own self-reported success or surface-level API/UI responses; verify the real side effects in the system of record here, the actual Stripe API object the action should have created . This catches the documented failure where an agent saw a 400 error on invalid test data and declared "Good,… excerpt: "Some graders also validated the Stripe artifacts of a run by inspecting created Stripe API objects. For example, in a full-stack challenge, the agent might complete a payment in the UI, then verify success by testing…" — Discord — Developing Rapidly with Generative AI https://discord.com/blog/developing-rapidly-with-generative-ai https://discord.com/blog/developing-rapidly-with-generative-ai https://discord.com/blog/developing-rapidly-with-generative-ai · good — Use a separate LLM a "critic" to score your agent's outputs against criteria, and structure the judge prompt to force constrained outputs — yes/no or a numeric scale — rather than free-form critique, which makes the eval signal aggregable and lets you compare prompt variants quickly. excerpt: "AI-assisted evaluation uses best-in-class LLMs like GPT-4 to automatically critique how well the AI's outputs match what we expected or how they score against a set of criteria. ... This method uses GPT-4 in a way…" — Gayatri Sabharwal — What it takes to build AI agents at scale https://ramp.com/leading-indicators/what-it-takes-to-build-ai-agents-at-scale https://ramp.com/leading-indicators/what-it-takes-to-build-ai-agents-at-scale https://ramp.com/leading-indicators/what-it-takes-to-build-ai-agents-at-scale · good — Build eval ground truth from a domain expert's spec, then use a frontier model to generate adversarial edge cases the expert missed, and validate with beta-user feedback; the genuinely hard problem is deciding when eval coverage is sufficient to remove the human from the loop. The post also draws a useful line:… excerpt: "At Ramp, the eval suite starts with a human expert — often an accountant — who writes down how the task should go. A frontier model then stress-tests it, surfacing edge cases or the scenarios the expert didn't think of.…" — Max Leiter — How we made v0 an effective coding agent https://vercel.com/blog/how-we-made-v0-an-effective-coding-agent https://vercel.com/blog/how-we-made-v0-an-effective-coding-agent https://vercel.com/blog/how-we-made-v0-an-effective-coding-agent · good — Define the agent's primary metric as a binary user-visible outcome does the generated site actually render, not error/blank rather than text-similarity, then attack the ~10% LLM error baseline with a streaming autofix layer targeting specific named failure modes stale APIs, nonexistent icons, missing providers,… excerpt: "The primary metric we optimize for is the percentage of successful generations. A successful generation is one that produces a working website in v0's preview instead of an error or blank screen. ... In our experience,…" New, vetted finds from the automated Scan discover → strict judge; deduped by URL and title . Newest first. — Clémentine Fourrier HuggingFace , with swyx & Alessio — Latent Space — Benchmarks 201: Why Leaderboards Arenas LLM-as-Judge https://www.latent.space/p/benchmarks-201 https://www.latent.space/p/benchmarks-201 https://www.latent.space/p/benchmarks-201 · podcast excellent — First-hand, mechanism-level guidance from the lead maintainer of HuggingFace's OpenLLM Leaderboard: ranks evaluation methods reproducible leaderboards preference arenas LLM-as-judge , names concrete failure modes LLM judges show mode-collapse self-reinforcement, positional bias, and can't… 🆕— AI Engineer @aiDotEngineer ; Evals track hosted by Braintrust / Olmo Maldonado; multiple speakers — Evals: AI Engineer World's Fair 2025 full track playlist https://www.youtube.com/playlist?list=PLcfpQ4tk2k0XZS6wXjyB 8zuZBXHFTwYM https://www.youtube.com/playlist?list=PLcfpQ4tk2k0XZS6wXjyB 8zuZBXHFTwYM https://www.youtube.com/playlist?list=PLcfpQ4tk2k0XZS6wXjyB 8zuZBXHFTwYM · talk excellent — A full track of practitioner conference talks where teams at Google, Notion, Zapier, Vercel, Braintrust and others walk through how they actually build, score, and deploy product evals in production — error analysis, LLM-as-judge scorer design, offline vs online eval loops, and frontier-benchmark… 🆕— Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Hoang Nguyen, Raghav Mehndiratta, Hari Subramani ServiceNow AI — A New Framework for Evaluating Voice Agents EVA https://huggingface.co/blog/ServiceNow-AI/eva https://huggingface.co/blog/ServiceNow-AI/eva https://huggingface.co/blog/ServiceNow-AI/eva · article excellent — EVA is an end-to-end voice-agent eval framework using a bot-to-bot audio harness user simulator + Pipecat agent + deterministic tool executor + validators that jointly scores task accuracy EVA-A: completion, faithfulness via LLM-judge, speech fidelity via LALM-judge and conversational… 🆕— Yunfei Bai, Allie Colin, Kashif Imran, Winnie Xiong AWS — Evaluating AI agents: Real-world lessons from building agentic systems at Amazon https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/ https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/ https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/ · article good — Lays out a three-layer agent evaluation library foundation-model benchmarking, component assessment of intent/memory/reasoning/tool-use, and final task-completion quality with concrete component metrics like tool selection/parameter accuracy, context-retrieval precision/recall, and reasoning… 🆕— Michael Dawson Red Hat — Eval-driven development: Build and evaluate reliable AI agents https://developers.redhat.com/articles/2026/03/23/eval-driven-development-build-evaluate-ai-agents https://developers.redhat.com/articles/2026/03/23/eval-driven-development-build-evaluate-ai-agents https://developers.redhat.com/articles/2026/03/23/eval-driven-development-build-evaluate-ai-agents · article good — A hands-on, 8-stage eval-driven workflow for a real multi-turn IT-self-service agent: uses DeepEval's ConversationalGEval/ConversationSimulator with ~15 custom LLM-as-judge metrics, a directory of 11 "known bad" conversations to validate that the metrics actually catch failures "test your tests" ,… 🆕— Scott Clark Distributional with Sam Charrington — How to Find the Agent Failures Your Evals Miss TWIML 767 https://twimlai.com/podcast/twimlai/how-find-agent-failures-your-evals-miss https://twimlai.com/podcast/twimlai/how-find-agent-failures-your-evals-miss https://twimlai.com/podcast/twimlai/how-find-agent-failures-your-evals-miss · podcast good — Pre-deployment evals only catch known failure modes; the durable method for catching "unknown unknowns" is post-production analytics — convert agent execution traces into vector fingerprints, then cluster/topic-model them to surface emergent failures like "lazy" tool-use hallucinations agents… 🆕— Raza Habib Humanloop CEO , MLOps Community — Product Metrics are LLM Evals // Raza Habib CEO of Humanloop https://www.youtube.com/watch?v=KWcE8ybs09A https://www.youtube.com/watch?v=KWcE8ybs09A https://www.youtube.com/watch?v=KWcE8ybs09A · podcast good — The central, actionable thesis is that the best evals are your product metrics: instead of inventing proxy metrics, instrument the real production signals — explicit user feedback thumbs up/down , user corrections to generated output, and the user's natural next action — and feed them back as your… 🆕— Rashmi Shetty Capital One with Sam Charrington TWIML AI Podcast — How Capital One Delivers Multi-Agent Systems TWIML 765 — Rashmi Shetty https://twimlai.com/podcast/twimlai/how-capital-one-delivers-multi-agent-systems https://twimlai.com/podcast/twimlai/how-capital-one-delivers-multi-agent-systems https://twimlai.com/podcast/twimlai/how-capital-one-delivers-multi-agent-systems · podcast good — A senior Capital One platform leader describes evaluating a real deployed multi-agent system Chat Concierge for auto dealerships by shifting from per-model ML metrics to end-to-end task-outcome evaluation, treating evals for stochastic multi-agent workflows plus observability as first-class… 🆕— Ereli Eran Founding Engineer, 7AI , host Demetrios Brinkmann — MLOps Community — Software Engineering in the Age of Coding Agents: Testing, Evals, and Shipping Safely at Scale MLOps Podcast 361 https://home.mlops.community/public/videos/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale https://home.mlops.community/public/videos/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale https://home.mlops.community/public/videos/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale · podcast good — A working practitioner's three-tier eval pipeline for production agents: 1 "unit tests that are more like integration tests" which actually make LLM calls, 2 staging evals run against real customer data, and 3 async LLM-as-judge runs as a scheduled post-deployment task to re-review completed… 🆕— Geoffrey Irving UK AI Security Institute , with Nathan Labenz — Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving https://www.cognitiverevolution.ai/situational-awareness-in-government-with-uk-aisi-chief-scientist-geoffrey-irving/ https://www.cognitiverevolution.ai/situational-awareness-in-government-with-uk-aisi-chief-scientist-geoffrey-irving/ https://www.cognitiverevolution.ai/situational-awareness-in-government-with-uk-aisi-chief-scientist-geoffrey-irving/ · podcast good — UK AISI's Chief Scientist details real eval practice: open-sourcing the Inspect eval framework, calibrating fast automated evals against wet-lab biology ground truth, red-teaming across 30+ model runs jailbreaking every model tested , and concrete eval-awareness mitigations embedding evals in… 🆕— Hamel Husain with Claire Vo, How I AI podcast — Evals, Error Analysis, and Better Prompts: A Systematic Approach to Improving Your AI Products https://www.lennysnewsletter.com/p/evals-error-analysis-and-better-prompts https://www.lennysnewsletter.com/p/evals-error-analysis-and-better-prompts https://www.lennysnewsletter.com/p/evals-error-analysis-and-better-prompts · podcast / video episode with transcript good — A practitioner walkthrough of the error-analysis loop: read real user conversation traces, open-code failures and group them into categories, prioritize by frequency counting not intuition , then build binary pass/fail evals and validate LLM-as-judge against human labels. Includes a live… 🆕— Raza Habib Humanloop & Brianna Connelly Filevine — Eval-Driven Development: Best Practices and Pitfalls When Building with AI https://home.mlops.community/public/videos/eval-driven-development-best-practices-and-pitfalls-when-building-with-ai-raza-habib-and-brianna-connelly-ai-in-production-2025-2025-03-13 https://home.mlops.community/public/videos/eval-driven-development-best-practices-and-pitfalls-when-building-with-ai-raza-habib-and-brianna-connelly-ai-in-production-2025-2025-03-13 https://home.mlops.community/public/videos/eval-driven-development-best-practices-and-pitfalls-when-building-with-ai-raza-habib-and-brianna-connelly-ai-in-production-2025-2025-03-13 · conference talk video good — A real production case study Filevine, a legal-AI platform: 1.5M chat requests/mo, 360K docs, 25B tokens showing a concrete eval-driven workflow with measured outcomes — scaling document classification from 60 to 160 categories while holding precision/recall in the high 80s-90s, and raising… 🆕— Greg Kamradt ARC Prize Foundation — How To Benchmark AGI — with Greg Kamradt, President of ARC-AGI https://www.youtube.com/watch?v=wU82fz4iRfo https://www.youtube.com/watch?v=wU82fz4iRfo https://www.youtube.com/watch?v=wU82fz4iRfo · talk good — Kamradt frames a benchmark's job as measuring generalization/skill-acquisition efficiency rather than memorized task completion, and gives concrete, reusable eval-design rules: build tasks easy for humans but hard for AI to expose true capability gaps; a benchmark only gives useful signal in the… 🆕— Greg Kamradt ARC Prize Foundation , with Demetrios Brinkmann — Greg Kamradt: Benchmarking Intelligence | ARC Prize MLOps Community https://home.mlops.community/public/videos/greg-kamradt-benchmarking-intelligence-or-arc-prize https://home.mlops.community/public/videos/greg-kamradt-benchmarking-intelligence-or-arc-prize https://home.mlops.community/public/videos/greg-kamradt-benchmarking-intelligence-or-arc-prize · talk / podcast interview video good — Lays out concrete, transferable eval-design principles from running ARC-AGI: build "human-easy, AI-hard" tasks to avoid saturation, verify human-solvability empirically 400 testers, every ARC-2 task solved by 2+ people in 2 attempts , use hidden holdout test sets and dual public/private… 🆕— HUD hud.ai — no individual byline — Verifier and Reward Design for RL Environments https://www.hud.ai/resources/verifier-reward-design-rl-environments https://www.hud.ai/resources/verifier-reward-design-rl-environments https://www.hud.ai/resources/verifier-reward-design-rl-environments · article technical guide good — Lays out a concrete four-layer scoring architecture verifiers / pass-fail gates / 3-5 criteria rubrics / composite reward plus a five-step build workflow: define checkable end-states first "table contains row id=4521, status='active'" , add hard failure gates, build minimal rubrics, test on… 🆕— Akshay Anand Thoughtworks — Evaluating AI agents in production: A practical framework https://www.thoughtworks.com/insights/blog/machine-learning-and-ai/Evaluating-AI-agents-in-production https://www.thoughtworks.com/insights/blog/machine-learning-and-ai/Evaluating-AI-agents-in-production https://www.thoughtworks.com/insights/blog/machine-learning-and-ai/Evaluating-AI-agents-in-production · article good — Presents a practical three-layer eval architecture persona-based multi-turn simulation, functional unit evals at agent/conversation level, operational observability with a concrete maturity progression — start ~20% automated / 80% manual validation, refine personas via UAT, then shift to… 🆕— Brooke Hopkins Coval, ex-Waymo — Voice AI Agent Evaluation: The Complete Guide 2026 https://www.coval.ai/blog/voice-ai-agent-evaluation-guide https://www.coval.ai/blog/voice-ai-agent-evaluation-guide https://www.coval.ai/blog/voice-ai-agent-evaluation-guide · article good — Domain-specific evaluation playbook for voice agents: persona-tiered simulation testing Easy/Medium/Hard/Adversarial across accent, noise, emotion , a concrete LLM-as-judge calibration loop run on 50-100 calls, sample for human review, iterate rubrics until 85% human-judge agreement on binary… 🆕— Rishi Gujjar & Andrew Li Judgment Labs — Agent Judge: Solving Long-Horizon Evals for Production Agents https://www.judgmentlabs.ai/blogs/agent-judge-solving-long-context-evaluations https://www.judgmentlabs.ai/blogs/agent-judge-solving-long-context-evaluations https://www.judgmentlabs.ai/blogs/agent-judge-solving-long-context-evaluations · article good — Frames long-horizon agent evaluation as an agentic, multi-agent judge search trajectory state as queryable objects, verify claimed actions against source-of-truth systems like DBs/APIs/GitHub, and iteratively refine the rubric , backed by a real benchmark table on internal hallucination-detection… 🆕— Shreya Shankar guest ; Hugo Bowne-Anderson host — Vanishing Gradients — Ep 57: AI Agents and LLM Judges at Scale — Processing Millions of Documents Without Breaking the Bank https://vanishinggradients.fireside.fm/57 https://vanishinggradients.fireside.fm/57 https://vanishinggradients.fireside.fm/57 · podcast good — Shreya Shankar UC Berkeley EPIC Lab, author of DocETL walks through an end-to-end methodology for reliable LLM-judge and agent pipelines at scale: treat unstructured-text LLM workflows as ETL; do error analysis on the first 50-100 traces with a human to surface failure modes; add guardrails via… 🆕— Tejal Patwardhan OpenAI frontier evals lead , host Andrew Mayne — Why Tejal Patwardhan stopped underestimating the models Ep 21 https://shows.acast.com/openai-podcast/episodes/why-tejal-patwardhan-stopped-underestimating-the-models-epis https://shows.acast.com/openai-podcast/episodes/why-tejal-patwardhan-stopped-underestimating-the-models-epis https://shows.acast.com/openai-podcast/episodes/why-tejal-patwardhan-stopped-underestimating-the-models-epis · podcast good — First-hand account from the person running OpenAI's frontier evals team on why established benchmarks saturate or get gamed as models improve, what distinguishes a benchmark that holds up, the "capability overhang" problem models advance faster than we can measure them , and the shift from toy… 🆕— Alon Bochman RagMetrics ; host Demetrios Brinkmann MLOps Community — Making AI Reliable is the Greatest Challenge of the 2020s 312 — Alon Bochman RagMetrics with Demetrios Brinkmann https://www.youtube.com/watch?v=d4PGxNM3Iis https://www.youtube.com/watch?v=d4PGxNM3Iis https://www.youtube.com/watch?v=d4PGxNM3Iis · podcast good — Treat the LLM-judge's "human agreement rate" against domain experts on your own eval set as the primary success metric, and engage non-technical SMEs through a feedback loop show input → model output → their preferred output, fine-tune the judge on 50-100 corrected pairs rather than blank… 🆕— Lenny Rachitsky host with Brendan Foody Mercor CEO — Why experts writing AI evals is creating the fastest-growing companies in history — Brendan Foody Mercor CEO https://www.lennysnewsletter.com/p/experts-writing-ai-evals-brendan-foody https://www.lennysnewsletter.com/p/experts-writing-ai-evals-brendan-foody https://www.lennysnewsletter.com/p/experts-writing-ai-evals-brendan-foody · podcast good — Two genuinely useful framings from someone selling evals to all top-5 labs: "if the model is the product, then the eval is the product requirement document," and that evals and RL verifier environments are the same data type — only the semantic use benchmark vs. reward signal differs. The… 🆕— Anastasios Angelopoulos, Wei-Lin Chiang, Ion Stoica LMArena with Anjney Midha a16z — Beyond Leaderboards: LMArena's Mission to Make AI Reliable https://a16z.com/podcast/beyond-leaderboards-lmarenas-mission-to-make-ai-reliable/ https://a16z.com/podcast/beyond-leaderboards-lmarenas-mission-to-make-ai-reliable/ https://a16z.com/podcast/beyond-leaderboards-lmarenas-mission-to-make-ai-reliable/ · podcast good — First-hand account from the team behind Chatbot Arena on why crowdsourced human-preference voting vs. expert benchmarks is needed for reliability, the Bradley-Terry ranking migration, style control to separate substance from formatting, building immunity to overfitting/gaming, why "fresh and… 🆕— Greg Kamradt President, ARC Prize Foundation — Measuring Agents with Interactive Evaluations — Greg Kamradt ARC Prize Foundation https://www.youtube.com/watch?v=TK9MN22q6E0 https://www.youtube.com/watch?v=TK9MN22q6E0 https://www.youtube.com/watch?v=TK9MN22q6E0 · talk conference talk / video good — Argues static benchmarks can't measure what agents actually do multi-turn exploration, planning, long-horizon execution and proposes interactive evals scored on action efficiency vs. a human baseline — how efficiently an agent converts environment information into a working strategy, grounded in… 🆕— Beth Barnes CEO, METR — The Most Important Graph in AI Right Now Measuring AI's Time Horizon — Beth Barnes METR https://www.youtube.com/watch?v=jXtk68Kzmms https://www.youtube.com/watch?v=jXtk68Kzmms https://www.youtube.com/watch?v=jXtk68Kzmms · talk good — Direct from the source METR's CEO , this lays out the "time horizon" eval methodology: rather than scoring tasks pass/fail, you order tasks by how long human experts take and find the duration at which a model hits 50% success — a human-baselined y-axis that makes capability legible and… 🆕— Sayash Kapoor & Benedikt Stroebl Princeton , interviewed by Connor Shorten Weaviate — AI Agents That Matter with Sayash Kapoor and Benedikt Stroebl Weaviate Podcast 104 https://www.youtube.com/watch?v=gCP-W BNzg4 https://www.youtube.com/watch?v=gCP-W BNzg4 https://www.youtube.com/watch?v=gCP-W BNzg4 · talk / podcast interview good — The co-first authors walk through their TMLR paper's core thesis: accuracy-only agent benchmarking produces needlessly complex, expensive agents, and once you control for inference cost, dead-simple baselines e.g. retrying/resampling a model land on or above the cost-accuracy Pareto frontier of… 🆕— Barry Zhang & Mahesh Murag Anthropic — Don't Build Agents, Build Skills Instead https://www.youtube.com/watch?v=CEvIs9y1uog https://www.youtube.com/watch?v=CEvIs9y1uog https://www.youtube.com/watch?v=CEvIs9y1uog · talk good — Anthropic's case for packaging procedural knowledge as composable "Skills" organized folders + self-documenting scripts, loaded via progressive disclosure so metadata is cheap until a skill.md is invoked rather than building bespoke domain agents. For evals specifically, the speakers name the… 🆕— Arvind Narayanan Princeton ; host Sam Charrington TWIML — AI Agents: Substance or Snake Oil — Arvind Narayanan TWIML Podcast 704 https://www.youtube.com/watch?v=HScABWB98Kw https://www.youtube.com/watch?v=HScABWB98Kw https://www.youtube.com/watch?v=HScABWB98Kw · talk good — Grounded in Narayanan's "AI Agents That Matter" paper, it argues agent evals are systematically misleading: leaderboards ignore inference cost, so simple repeated-sampling baselines can match or beat complex agent architectures on benchmarks like HumanEval — making cost-vs-accuracy Pareto… 🆕 — pavlovslist.com https://pavlovslist.com/ https://pavlovslist.com/ https://pavlovslist.com/ · directory — The RL-environment / eval startups directory "for the RL-pilled" . Environment labs / RL-env companies the "environments are the new data" venture wave, via pavlovslist : BenchFlow benchflow.ai — SkillsBench, ClawsBench, runtime , Prime Intellect verifiers, Environments Hub , HUD , Mechanize , Plato , AfterQuery , Halluminate , Surge AI , Scale , Mercor . Prime Intellect verifiers , Florian Brand · Braintrust · Arize Phoenix/AX, OpenInference · Galileo · LangChain / LangSmith agentevals · Sierra τ-bench · Core Automation Kanav Garg · Epoch AI benchmark audits · METR autonomy/horizon · FutureHouse HLE audit · UK AISI Inspect . - Built by merging this project's research rounds mining → adversarial verification → reference audit with a /deep-research pass. Source detail lives in research/citations.md , research/findings.json , research/reference-audit.md , research/notes/ , and the full link list in research/url-inventory.md 153 URLs . Verified-high deep-research, 3/3 votes : Verifier's Law, the verifiers library, EvalGen, Inspect AI, promptfoo, the ABC benchmark-rigor paper, plus lm-eval-harness, Autoevals, agentevals, AI Agents That Matter. Flagged caveats: the MT-Bench 10/25 bias numbers are hedged by their own authors ; Lee's "Agent Runtime" post URL and the WebArena/OSWorld/Terminal-Bench/Cybench links still need verification; the Kanav Garg talk is cited via a conference summary no canonical primary URL yet . This repo ships 146 deep reading notes in notes/ /benchflow-ai/awesome-evals/blob/main/notes — structured summaries with key points, verbatim quotes , and themes, for the highest-signal sources: — blog posts & practitioner essays notes/articles/ — 47 transcribed talks, podcasts & lectures with notes/talks/ mm:ss timestamps — papers surfaced by the citation graph notes/papers/ PRs welcome. Keep the bar high: show your work real data/code/war-stories beat hot takes , give every entry a one-line why , verify the URL, and flag caveats. See CONTRIBUTING.md /benchflow-ai/awesome-evals/blob/main/CONTRIBUTING.md . Quality over quantity — a great list is as much about what it excludes . To the extent possible under law, BenchFlow https://benchflow.ai and contributors have waived all copyright and related rights to this work CC0 1.0 . The linked resources remain under their respective licenses.