A curated, opinionated,
non-BSlibrary of the best resources forbuilding and evaluating AI agentsβ papers, blog posts, talks, courses, tools, and benchmarks.
Maintained by BenchFlow Β· Most "awesome" lists are link dumps. This one is annotated and verified: every entry says what it is and why it belongs, URLs are checked, quotes are verbatim, and dead/abandoned tools are pruned (not silently listed). It was assembled by:
- a depth-4 recursive citation crawl(11.6k papers, ranked by in-degree) to surface the academic canon, targeted practitioner-web discovery for the industry sources citation graphs miss (Eugene Yan, Han-Chung Lee, Hamel Husain, Shreya Shankar, Nathan Lambert, β¦),47 talks & podcasts transcribed and deep-noted(verbatim + timestamps), and** per-section gap audits**with adversarial verification.
**443+ curated links Β· 146 deep reading notes** (see [ notes/](/benchflow-ai/awesome-evals/blob/main/notes)). Markers: π = released/updated 2025β2026 Β·
[CONTRIBUTING](/benchflow-ai/awesome-evals/blob/main/CONTRIBUTING.md).
π
Playbook:[β real, runnable code + worked examples for LLM-as-judge (aligned to humans), pass@k/pass^k, error analysis, trajectory & world-state grading, CI gating, verifiable rewards, and more.]PATTERNS.md
π Playbook β real code & worked examples (PATTERNS.md)β Must-read starter set (read these first)1 Β· Why we need evals2 Β· "If you can eval it, you have built it" β eval β capability β RL environment3 Β· The model / harness / skill decomposition4 Β· Observability & the output / eval space (the surfaces you can grade)5 Β· Evaluation infrastructure (the eval stack: datasets, scorers, online/offline, tracing, CI)6 Β· Benchmark vs. eval (and benchmark integrity: contamination, saturation, label errors, leaderboard gaming)7 Β· Evals & RL environments (verifiers, reward design, difficulty calibration, lifecycle)8 Β· LLM-as-judge & verifiers (alignment, biases, verifiable vs judgeable)9 Β· Agent-specific evaluation (trajectories, tool use, multi-turn, world state, multi-agent, localization)10 Β· Safety / adversarial evaluation (prompt injection, jailbreaks, action-authorization, benchmark auditing)π Talks, podcasts & slides (transcribed + noted)π¬ Eval insights inside general agent postsπ Scan additionsCompanies & landscape (eval / RL-environment market)Notes on provenance & gapsDeep notesContributingLicense
β Shunyu Yao βThe Second Halfhttps://ysymyth.github.io/The-Second-Half/Β·blogβ "Evaluation becomes more important than training." The field-levelwhy.β Eugene Yan βAn LLM-as-Judge Won't Save the Product, Fixing Your Process Willhttps://eugeneyan.com/writing/eval-process/Β·blogβ Process over tooling; evals as the scientific method.β Han-Chung Lee βHidden Technical Debt: Agent Evaluation Infrastructurehttps://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/Β·blogβ Control/data plane, the five eval surfaces, state deltas. "Chat eval was a spreadsheet; agent eval is a system."β Hamel Husain & Shreya Shankar βLLM Evals FAQhttps://hamel.dev/blog/posts/evals-faq/Β·blogβ The densest operational Q&A: error analysis, binary judgments, the benevolent-dictator labeler.β Jason Wei βAsymmetry of Verification and Verifier's Lawhttps://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-lawΒ·blogβ "Ability to verify == ability to create an RL environment."β Anthropic βDemystifying Evals for AI Agentshttps://www.anthropic.com/engineering/demystifying-evals-for-ai-agentsΒ·blogβ Best primary on agent-specific evals: task design, outcome vs trajectory, isolated trials, pass@k vs pass^k.β Ofir Press βHow to Build Good Language Modeling Benchmarkshttps://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/Β·blogβ Natural / auto-evaluatable / challenging; the "-200%" difficulty target; ~1-yr saturation.β Kapoor, Stroebl, Siegel, Nadgir, Narayanan βAI Agents That Matterhttps://arxiv.org/abs/2407.01502Β·paperβ Cost as a first-class metric; model-dev vs app-dev; missing holdouts breed overfitting.β Nathan Lambert βBuilding on Evaluation Quicksandhttps://www.interconnects.ai/p/building-on-evaluation-quicksandΒ·blogβ LLM eval has no ground truth; contamination; evalβtraining coupling.β Shankar, Zamfirescu-Pereira, Hartmann, Parameswaran, Arawjo (UIST '24) βWho Validates the Validators? (EvalGen)https://arxiv.org/abs/2404.12272Β·paperβ "Criteria drift": you can't write the rubric before you grade.β Florian Brand (Prime Intellect) βBenches 2026 β "LLM benchmarks in the era of agents"https://florianbrand.com/posts/benches-2026Β·blog + 61-slide talkβ The sharpest current read on why benchmarks break in the agent era: the "evals are dead, just measure vibes" backlash, how every layer of the eval-running stack (prompt Β· sampling temp Β· grader Β· harness) swings the score, and that benchmark ground truth is frequently wrong.β OpenAI βA Shared Playbook for Trustworthy Third-Party Evaluationshttps://openai.com/index/trustworthy-third-party-evaluations-foundations/Β·blog (Safety, May 2026)β What makesindependentevals of frontier-model safeguards & capabilities trustworthy: harness selection, the validity hazards that distort results, and the standards third-party evaluators need.
β Shunyu Yao βThe Second Halfhttps://ysymyth.github.io/The-Second-Half/Β·blogβ The bottleneck shifts from solving problems todefining and evaluatingthem. (also T2, T7) - β Eugene Yan βAn LLM-as-Judge Won't Save the Product, Fixing Your Process Willhttps://eugeneyan.com/writing/eval-process/Β·blogβ "Buying or building another evaluation tool won't save the product." Evals = the scientific method in disguise. - β Hamel Husain βYour AI Product Needs Evalshttps://hamel.dev/blog/posts/evals/Β·blogβ The canonical "you need evals"; remove all friction from looking at your data; don't rely on generic frameworks. - β Hamel Husain βA Field Guide to Rapidly Improving AI Productshttps://hamel.dev/blog/posts/field-guide/Β·blogβ "Error analysis is consistently the highest-ROI activity." The metric for an AI roadmap is experiments run. - β Shreya Shankar βIn Defense of AI Evals, for Everyonehttps://www.sh-reya.com/blog/in-defense-ai-evals/Β·blogβ Rebuts the anti-eval backlash; evals = the systematic measurement of application quality. - β Yan, Bischof, Frye, Husain, Liu, Shankar βWhat We Learned from a Year of Building with LLMshttps://applied-llms.org/(Part II:https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-ii/) Β·blogβ The "intern test," genchi genbutsu, turning vibe-checks into assertions. - β Nathan Lambert βBig Tech's LLM Evals Are Just Marketinghttps://www.interconnects.ai/p/evals-are-marketingΒ·blogβ Why frontier-lab leaderboard numbers are marketing, not science. - β Chip Huyen βAI Engineering pitfallshttps://huyenchip.com/2025/01/16/ai-engineering-pitfalls.htmlΒ·blogβ Common eval/AI-engineering mistakes from theAI Engineeringauthor. (also T6) - β Aishwarya Naresh Reganti & Kiriti Badam (O'Reilly Radar) βEvals Are NOT All You Needhttps://www.oreilly.com/radar/evals-are-not-all-you-need/Β·blogβ The essential nuance piece: automated graders alone don't save you; you need a continuous-improvement flywheel of offline tests + production monitoring + real-user iteration. Pairs with Shreya's 'In Defense' to complete the backlash debate. π - β Hamel Husain & Shreya Shankar with Lenny Rachitsky (Lenny's Podcast/Newsletter) βWhy AI evals are the hottest new skill for product buildershttps://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skillΒ·talkβ The accessible 'why evals matter' on-ramp (live walkthrough of error analysis, open/axial coding) that mainstreamed evals to PMs in 2025; the apartment-leasing-bot anecdote is the canonical 'you can't vibe-check' story. π - β OpenAI βHow evals drive the next chapter in AI for businesseshttps://openai.com/index/evals-drive-next-chapter-of-ai/Β·blogβ Frontier-lab framing of evals as turning fuzzy business goals into specs and measurable ROI; useful counterweight to Lambert's 'evals are marketing' and grounds the 'why' for enterprise readers. π β (unverified URL) - β Aman Khan (Arize) with Lenny Rachitsky βBeyond vibe checks: A PM's complete guide to evalshttps://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-completeΒ·blogβ The widely-shared PM-oriented argument for moving past 'looked good to me' vibe checks to systematic evals; one of the pieces that made evals a mainstream product skill in 2025. π - β Gergely Orosz & Hamel Husain (The Pragmatic Engineer) βA pragmatic guide to LLM evals for devshttps://newsletter.pragmaticengineer.com/p/evalsΒ·newsletterβ Reaches the broad engineering audience with the core 'why': LLM non-determinism breaks traditional testing, so you need evals. High-distribution motivation piece co-written by Hamel. π - β OpenAI βPredicting model behavior before release by simulating deployment (Deployment Simulation)https://openai.com/index/deployment-simulation/Β·blogβ Concrete 2026 evidence for why fixed/static evals fail: models recognize when they're being tested and game test suites; replaying ~1.3M real conversations surfaced reward-hacking no fixed eval caught. Strong 'why evals must evolve' argument. π β (unverified URL) - β Greg Brockman (OpenAI) βevals are surprisingly often all you needhttps://x.com/gdb/status/1733553161884127435Β·blogβ The canonical one-liner ('evals are the new unit test') that anchors the whole 'why evals' thesis; frequently cited founding quote for the movement. Short but load-bearing.
Must-reads: Yao Β· Yan (eval-process) Β· Hamel (field-guide) #
β Jason Wei βAsymmetry of Verification and Verifier's Lawhttps://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-lawΒ·blogβ Trainability tracks verifiability; verifying = creating an RL environment. -
β Han-Chung Lee βA Taxonomy of RL Environments for LLM Agentshttps://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/Β·blogβ A benchmark is a frozen RL environment; the E = {T,H,V,S,C} decomposition; "verifiable beats judgeable." -
β Kanav Garg (Core Automation; ex-DeepMind) β talk; summary atThe Life Cycle of an RL Environmenthttps://muratbuffalo.blogspot.com/2026/06/acm-cais-conference-on-ai-and-agentic.htmlΒ·talkβ Difficulty calibration (the 1β4/16 Goldilocks band), RL as variance reduction, reward hacking under training pressure.(local notes:research/notes/kanav-garg-rl-environment-lifecycle.md
) -
β David Silver & Richard Sutton βWelcome to the Era of Experiencehttps://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdfΒ·paperβ Human-data value approaching its ceiling; the frontier is agents learning from experience / synthetic environments. -
β Nathan Lambert βRLHF Book, Ch. 16 β Evaluationhttps://rlhfbook.com/c/16-evaluationΒ·bookβ Evaluation as a reflection of training goals; prompt-format sensitivity (60%β~0%). -
β Nathan Lambert βWhat Comes Next with Reinforcement Learninghttps://www.interconnects.ai/p/what-comes-next-with-reinforcementΒ·blogβ Long-horizon credit assignment; where RL is and isn't ready. -
β Prime Intellect βverifiershttps://github.com/PrimeIntellect-ai/verifiers(docs:.../blob/main/docs/environments.md
) Β·tool/repoβ One environment package shared by eval andprime-rl
β the eval-is-an-RL-env thesis as code. - β DeepSeek-AI (Guo et al.) βDeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948Β·paperβ The proof-of-thesis: pure RL with rule-based verifiable rewards (no SFT) makes reasoning emerge β the canonical 'if you can verify it, RL builds it' result; also published in Nature 2025. Conspicuously absent from a section literally about eval-as-RL-environment. π - β Lambert et al. (Allen Institute for AI) βTΓΌlu 3: Pushing Frontiers in Open Language Model Post-Traininghttps://arxiv.org/abs/2411.15124Β·paperβ Coined/popularized RLVR and open-sourced the recipe + code (open-instruct): swap the reward model for a verifier on tasks with checkable answers. The foundational citation behind every 'verifiable beats judgeable' claim in this section. π - β Anthropic βNatural Emergent Misalignment from Reward Hacking in Production RLhttps://www.anthropic.com/research/emergent-misalignment-reward-hackingΒ·paperβ Empirical receipt for the section's 'reward hacking under training pressure' theme: learning to cheat on real coding environments generalizes to sabotage/alignment-faking; introduces inoculation prompting as mitigation (arXiv 2511.18397). π - β Prime Intellect βEnvironments Hub: A Community Hub To Scale RL To Open AGIhttps://www.primeintellect.ai/blog/environmentsΒ·blogβ The launch post for the verifiers-spec marketplace (2,500+ shared eval/RL environments) β the eval-is-an-RL-env thesis as an actual ecosystem, the natural companion to the already-listed verifiers repo. π - β Ege Erdil, Matthew Barnett, Tamay Besiroglu (Mechanize) βHow to fully automate software engineeringhttps://www.mechanize.work/blog/how-to-fully-automate-software-engineering/Β·blogβ Sharpest statement of the inverse thesis: today's RL environments are rudimentary, so capability is gated on building richer/more diverse environments β 'you only get the capability you can build an environment for.' π - β Mechanize (Erdil, Barnett, Besiroglu) βCheap RL tasks will waste computehttps://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/Β·blogβ The economics of environment quality: data and compute are complementary, so low-quality (cheaply-bought) tasks waste expensive RL compute β directly informs difficulty calibration / why environment design matters. π - β Jean-Stanislas Denain & Chris Barber (Epoch AI) βAn FAQ on Reinforcement Learning Environmentshttps://epoch.ai/gradient-updates/state-of-rl-envsΒ·blogβ Practitioner-interview survey (18 pros) on how RL environments are actually built, the reward-hacking failure modes, and the production-scaling bottleneck β the empirical state-of-the-field map this section lacks. π - β AJ Kourabi & Dylan Patel (SemiAnalysis) βRL Environments and RL for Science: Data Foundries and Multi-Agent Architectureshttps://newsletter.semianalysis.com/p/rl-environments-and-rl-for-scienceΒ·newsletterβ Market-structure view: 35+ companies now sell RL environments; capability gains are coming from ramping RL compute, not pretraining. Grounds the 'benchmark = frozen RL environment' thesis in who's actually building/buying them. π - β Harbor / Stanford / Laude Institute βTerminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaceshttps://github.com/harbor-framework/terminal-benchΒ·benchmarkβ A concrete instance of the thesis: each task ships a Docker environment + programmatic verification test suite + oracle β i.e. a benchmark that IS an RL environment (and is used as one). 2.4k stars, active. π - β Sierra Research (Barres et al.) βtau2-bench (ΟΒ²-Bench): A Benchmark for Tool-Agent-User Interaction in Real-World Domainshttps://github.com/sierra-research/tau2-benchΒ·benchmarkβ Dual-control, multi-turn, policy-following eval with a simulated user and verifiable DB-state checks β the canonical example of a verifiable conversational/agentic environment beyond math/code (paper arXiv 2506.07982). π
Must-reads: Wei Β· Lee (RL-env taxonomy) #
β Han-Chung Lee βHidden Technical Debt: Agent Harnesshttps://leehanchung.github.io/blogs/2026/05/08/hidden-technical-debt-agent-harness/Β·blogβ The harness is the agent; what teams call "the model" is mostly harness + product. -
β Han-Chung Lee βHidden Technical Debt series (index)https://leehanchung.github.io/blogs/Β·blogβ The four-part series (eval infra, runtime, harness, + agent runtime ~2026/04/24).(verify the runtime post URL on the index.) -
β METR βMeasuring AI Ability to Complete Long Taskshttps://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/Β·paper/blogβ Scaffolds change the measured horizon; success-vs-human-time as a primitive. (also T9) -
β Nathan Lambert βTuring Post interview ("Open Models Won't Catch Up")https://www.turingpost.com/p/nathanlambertΒ·talk/interviewβ "What technical people call the harness or the product matters more than just the model." -
β Florian Brand (Prime Intellect) βQuo vadis, LLM benchmarks?https://florianbrand.com/posts/benches-2026(talk:https://www.youtube.com/watch?v=kmTMc-fVSXw) Β·blog/talkβ The AlgoTune case:same model, different harness, opposite ranking.(also T6)(notes:research/notes/florian-brand-*
) - β Han-Chung Lee βThe Model is the Producthttps://leehanchung.github.io/talks/2025/04/23/the-model-is-the-product/Β·talkβ The primary-source talk (Data Council 2025) behind the must-read author's whole thesis β the direct counterpart to Hamel's 'Model is Not the Product'; the foundational text of the harness/model debate this section is built on. π - β Hamel Husain βThe Model is Not the Producthttps://www.youtube.com/watch?v=EEw2PpL-_NMΒ·talkβ The opposing side of the Lee debate (Data Council 2025): great products are mostly harness + product + evals, not the model. Section already cites Lee; it should cite the debate it half-references. π - β Simon Willison βAgents are models using tools in a loophttps://simonwillison.net/2025/May/22/tools-in-a-loop/Β·blogβ The canonical, now-widely-adopted definition of an agent; 'the skill is in the design of both the tools and the loop' β the cleanest statement of why the harness, not the model, dominates behavior. π - β OpenAI βHarness engineering: leveraging Codex in an agent-first worldhttps://openai.com/index/harness-engineering/Β·blogβ Frontier-lab primary source coining 'harness engineering': a 1M-line codebase built by Codex agents where improving the environment/harness mattered more than the model. Lab-side complement to Lee's 'harness is the agent'. (URL returns 403 to scraper but page is live; corroborated by InfoQ/Milvus coverage.) π - β Anthropic βEquipping agents for the real world with Agent Skillshttps://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skillsΒ·blogβ The primary source for the 'skill' leg of the model/harness/skill decomposition β skills as composable, progressively-disclosed capabilities (later made an open standard). The section title says 'skill' but has zero skill sources. π - β Anthropic βEffective context engineering for AI agentshttps://www.anthropic.com/engineering/effective-context-engineering-for-ai-agentsΒ·blogβ Anthropic's primary statement that the harness's job is engineering context (editing, compaction, memory, programmatic tool-calling) β the mechanism behind why same model + different harness diverges. π - β Anthropic (Ken Aizawa) βWriting effective tools for agents β with agentshttps://www.anthropic.com/engineering/writing-tools-for-agentsΒ·blogβ Tool design is a load-bearing part of the harness; 'agents are only as effective as the tools we give them,' validated eval-first. Directly ties harness decisions to measured agent performance. π - β Pete Hodgson βSame Model, Different Results: Why Coding Agents Aren't Interchangeablehttps://blog.thepete.net/blog/2025/12/10/same-model-different-results-why-coding-agents-arent-interchangeable/Β·blogβ Concrete teardown of Claude Code's harness (system reminders, sub-agents, planning, IDE feedback) showing identical models yield different results β the practitioner case-study version of Brand's AlgoTune point. π - β Princeton SAgE team (Kapoor, Narayanan, et al.) βHolistic Agent Leaderboard (HAL)https://hal.cs.princeton.edu/Β·benchmarkβ Standardized, cost-aware harness that runs the SAME agent harness across 9 benchmarks/9 models (21,730 rollouts) β the infrastructure answer to 'harness confounds rankings.' ICLR 2026; paper arXiv:2510.11977. π - β Addy Osmani (O'Reilly Radar) βAgent Harness Engineeringhttps://www.oreilly.com/radar/agent-harness-engineering/Β·blogβ 'A decent model with a great harness beats a great model with a bad harness'; reframes agent failures as harness/config problems (traceable AGENTS.md rules). Names the converging harness primitives across coding agents. π - β Nathan Lambert (Interconnects) βWhat comes next with open models (weights / tools / harness decomposition)https://www.interconnects.ai/p/the-next-phase-of-open-modelsΒ·blogβ Lambert's written articulation (Mar 2026) of an AI system as weights + tools + harness β the written companion to the Turing Post interview already listed, with the explicit three-part decomposition. π
Must-reads: Lee (harness) Β· Brand (Quo vadis) #
β Han-Chung Lee βHidden Technical Debt: Agent Evaluation Infrastructurehttps://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/Β·blogβ Control plane / data plane; thefive surfaces(output, trace, memory, environment, mechanistic); the empty-tool-result hallucination. - β Braintrust βThe Three Pillars of AI Observabilityhttps://www.braintrust.dev/blog/three-pillars-ai-observabilityΒ·blogβ Dataset reconciliation (living datasets); traces / evals / annotation. - β Arize (AX docs) βAgent Trajectory Evaluationshttps://arize.com/docs/ax/evaluate/evaluators/trace-and-session-evals/trace-level-evaluations/agent-trajectory-evaluationsΒ·docsβ Grading the path, not just the answer. - β Galileo βAI Agent Metrics: How Elite Teams Evaluatehttps://galileo.ai/blog/ai-agent-metricsΒ·blogβ A concrete agent-metric taxonomy (action completion, tool selection, etc.). - β Arize βOpenInference semantic conventionshttps://github.com/Arize-ai/openinference/blob/main/spec/semantic_conventions.mdΒ·tool/repoβ An OTel-based agent trace schema (tool, args, observation, latency, cost). - β LangChain βLangSmith Evaluation / Trajectory evalshttps://docs.langchain.com/langsmith/evaluationΒ·https://docs.langchain.com/langsmith/trajectory-evalsΒ·docs. - β OpenTelemetry / CNCF βOpenTelemetry GenAI Semantic Conventions (agent & framework spans)https://github.com/open-telemetry/semantic-conventions-genaiΒ·docsβ The upstream vendor-neutral standard (spans/metrics/events for LLM calls, invoke_agent, execute_tool, MCP) that OpenInference maps onto β the canonical trace schema the section's OpenInference entry derives from. π - β OpenTelemetry βSemantic Conventions for GenAI agent and framework spanshttps://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/Β·docsβ Human-readable spec page for create_agent / invoke_agent / execute_tool spans and attributes β the precise definition of what a gradable agent trace looks like. π - β OpenTelemetry (blog) βInside the LLM Call: GenAI Observability with OpenTelemetryhttps://opentelemetry.io/blog/2026/genai-observability/Β·blogβ Walkthrough of emitting and reading GenAI spans (token usage, finish reasons, tool calls) β concrete intro to the trace surface for practitioners not steeped in OTel. π - β Weights & Biases βW&B Weave β tracing & evaluation toolkithttps://docs.wandb.ai/weaveΒ·docsβ @weave.op trace trees (inputs/outputs/cost/latency) plus a scorer-based eval harness β a widely used surface for grading both traces and outputs. π - β Laminar βLaminar β open-source observability for AI agentshttps://laminar.sh/Β·toolβ OTel-native, agent-specific: transcript view, SQL-over-traces, and a rollout debugger β purpose-built for grading multi-step agent trajectories rather than single LLM calls. π
Must-reads: Lee (eval infra) Β· Braintrust (three pillars) (All repos URL-verified via GitHub API, Jun 2026. π = released/expanded 2025β2026. β οΈ = caveat/discontinued.)
β UK AISI βInspect AIhttps://github.com/UKGovernmentBEIS/inspect_aiΒ·https://inspect.aisi.org.uk/β@task
binds dataset + solver + scorer; custom scorers; sandboxed tools. The reference agent-eval framework.(MUST)β UK AISI βinspect_evalshttps://github.com/UKGovernmentBEIS/inspect_evalsβ π the companion catalog of community benchmarks (GAIA, CTFs, AIMEβ¦) β the "batteries" for Inspect.β EleutherAI βlm-evaluation-harnesshttps://github.com/EleutherAI/lm-evaluation-harnessβ the standard academic harness; first-class decontamination; task YAMLs.β Allen Institute (Ai2) βOLMEShttps://github.com/allenai/olmesβ π the reproducible evalstandard + harness behind OLMo/TΓΌlu: standardized prompts/metrics/formatting for apples-to-apples model comparison.βBenchFlowhttps://github.com/benchflow-ai/benchflowΒ·https://benchflow.aiβ π environment-lab framework: research infra + runtime for building RL environments, evals & post-training; shipsSkillsBench andClawsBench. ("Environments are the new data.")β Hugging Face βlightevalhttps://github.com/huggingface/lightevalβ π all-in-one harness across transformers/vLLM/TGI/nanotron, 1000+ tasks; HF's successor toevaluate
.β Groq βOpenBenchhttps://github.com/groq/openbenchβ π provider-agnosticbench
CLI, 95+ benchmarks, built on Inspect primitives.β OpenAI βsimple-evalshttps://github.com/openai/simple-evalsβ minimal zero-shot/CoT scripts (MMLU, HumanEval, SimpleQA, HealthBench); the numbers OpenAI publishes.β οΈ not actively maintained.βOpenAI Evalshttps://github.com/openai/evalsβ thecompletion_fn
abstraction = swap the system-under-test. (Best-practices:https://developers.openai.com/api/docs/guides/evaluation-best-practices)βpromptfoohttps://github.com/promptfoo/promptfooβ MIT eval + red-teaming CLI; git-diffable YAML configs.(MUST)βDeepEval / Confident AIhttps://github.com/confident-ai/deepevalβ "pytest for LLMs," 40+ metrics (G-Eval, RAG, hallucination) + red-team; ~2M evals/day; hosted cloud. πβpydantic-evalshttps://github.com/pydantic/pydantic-ai(ai.pydantic.dev/evals
) β π type-safe Datasets/Cases/Evaluators with OTel tracing, from the Pydantic AI team.β LangChain βopenevalshttps://github.com/langchain-ai/openevalsβ π prebuilt evaluators +create_llm_as_judge
(incl. multimodal); general-purpose companion toagentevals(https://github.com/langchain-ai/agentevals, trajectory match).βMLflow GenAI evaluatehttps://mlflow.org/docs/latest/genai/eval-monitor/β πmlflow.genai.evaluate
: 50+ judges/metrics, custom scorers, regression datasets inside MLflow.β Stanford CRFM βHELM (crfm-helm)https://github.com/stanford-crfm/helmβ holistic eval: standardized datasets + metrics beyond accuracy + leaderboard (also VHELM, HEIM).βGiskardhttps://github.com/Giskard-AI/giskard-ossβ auto-generates adversarial test suites (injection, hallucination, bias) from a plain-language app description.βDeepchecks LLMhttps://github.com/deepchecks/deepchecks(llmdocs.deepchecks.com
) β property-based scoring (grounded-in-context, toxicity, fluency) + custom LLM-judge properties.βUpTrainhttps://github.com/uptrain-ai/uptrainβ 20+ preconfigured checks + root-cause analysis on failures.βHFevaluate
https://github.com/huggingface/evaluateβ classic metrics library,β οΈ maintenance mode (use lighteval for LLMs).β harbor-framework (Laude Institute / Stanford) βHarborhttps://github.com/harbor-framework/harborβ π framework for running agent evals + creating/using RL environments; powers Terminal-Bench 2.0. ~2.7kβ
.β οΈ name overloaded (cf.av/harbor
local-LLM toolkit).
β Matt Pocock β[evalite](https://github.com/mattpocock/evalite)[https://github.com/mattpocock/evalite](https://github.com/mattpocock/evalite)β π local-first eval runner on Vitest;`.eval.ts`
files, web UI, cost-aware.β[Mastra scorers](https://github.com/mastra-ai/mastra)[https://github.com/mastra-ai/mastra](https://github.com/mastra-ai/mastra)(`mastra.ai/docs/evals/overview`
) β π model-graded/rule/statistical scorers, live evals, CI, in the Mastra agent framework.βVercel agent-evalhttps://github.com/vercel-labs/agent-evalβ π A/B-test coding agents (Claude Code, Codex, Cursor) on custom tasks; pass-rate dashboards.β Braintrust βAutoevalshttps://github.com/braintrustdata/autoevalsβ OSS scorer library (Factuality, relevance, securityβ¦) across Py/JS/Go/Ruby.
βTruLenshttps://github.com/truera/trulensβ instrumentation + "feedback functions" (the RAG triad), now OTel-based.β Stanford βAREShttps://github.com/stanford-futuredata/ARESβ synthetic queries + fine-tuned judges + prediction-powered inference for confidence intervals.β Amazon Science βRAGCheckerhttps://github.com/amazon-science/RAGCheckerβ π claim-level diagnosis separating retriever vs generator errors.βcontinuous-eval (Relari)https://github.com/relari-ai/continuous-evalβ modular per-module metrics across retrieval/generation/tool-use.βTonic Validatehttps://github.com/TonicAI/tonic_validateβ RAG metrics as a GitHub Action for CI.
β Haize Labs βverdicthttps://github.com/haizelabs/verdictβ π declarative compound judges (debate/verification/aggregation, inference-time scaling); arXiv:2502.18018.β OpenPipe (ART) βRULERhttps://github.com/OpenPipe/ART(art.openpipe.ai/fundamentals/ruler
) β π LLM-judge that ranks trajectories with no labels β judge-as-RL-reward.(industry must-read)βPrometheus 2https://github.com/prometheus-eval/prometheus-evalβ open-weight evaluator LMs for rubric-based assessment + pairwise.βAtla Selenehttps://github.com/atla-ai/selene-miniβ π 8B SoTA open judge (score + critique); + MCP serveratla-ai/atla-mcp-server
. arXiv:2501.17195.βPatronus Lynx / GLIDERhttps://github.com/patronus-ai/Lynx-hallucination-detectionΒ·https://github.com/patronus-ai/gliderβ π open hallucination judge / explainable span-level judge.βFlow-Judgehttps://github.com/flowaicom/flow-judgeβ efficient 3.8B open evaluator.β AI2 βRewardBenchhttps://github.com/allenai/reward-benchβ canonical reward-model (+v2 judge) benchmark/harness.βJudgeBenchhttps://github.com/ScalerLab/JudgeBenchβ benchmark to evaluate the judges themselves.β Fireworks βreward-kithttps://github.com/fw-ai-external/reward-kitβ π decorator-based reward-function authoring (TRL/Fireworks interop).
β Prime Intellect βverifiershttps://github.com/PrimeIntellect-ai/verifiersβ Environment = dataset + harness + rubric; one package for eval, RL, synthetic data.(MUST)β Prime Intellect βEnvironments Hubhttps://github.com/PrimeIntellect-ai/community-environments(app.primeintellect.ai) β π crowdsourced verifiers-based RL/eval envs.β Prime Intellect βprime-rlhttps://github.com/PrimeIntellect-ai/prime-rlβ π async RL trainer consuming verifiers envs (INTELLECT-3).βBenchFlowhttps://github.com/benchflow-ai/benchflowΒ·https://benchflow.aiβ π environment lab: builds & runs RL/eval environments (SkillsBench, ClawsBench, runtime). "Environments are the new data." (also Β§5a)βHUDhttps://github.com/hud-evals/hud-pythonβ π SDK to build/run agent eval environments (computer-use, browser, MCP) with telemetry.β Nous Research βAtroposhttps://github.com/NousResearch/atroposβ π async "environment microservice" framework for rollouts/verifiable rewards.βverlhttps://github.com/volcengine/verl(nowverl-project/verl
) β de-facto industry RLVR trainer (PPO/GRPO). ~22kβ
.βOpenRLHFhttps://github.com/OpenRLHF/OpenRLHFΒ·SkyRLβhttps://github.com/NovaSky-AI/SkyRLΒ·** AReaL**βhttps://github.com/areal-project/AReaLΒ·** ROLL**βhttps://github.com/alibaba/ROLLΒ·** rLLM**βhttps://github.com/agentica-project/rllmΒ·** TRL**βhttps://github.com/huggingface/trlβ the RL-training stack agents are post-trained + eval'd in.β General Reasoning βOpen Reward Standard (ORS)https://docs.openreward.ai/(PyPIopenreward
) β π MCP-extending spec adding RL primitives (episodes, rewards, curriculum).β οΈ no single canonical repo confirmed.
βArize Phoenixhttps://github.com/Arize-ai/phoenixβ OSS OTel tracing + response/retrieval evals + datasets/experiments.(MUST)βLangfusehttps://github.com/langfuse/langfuseβ OSS: evals (LLM-judge, feedback, manual labeling), datasets/experiments, prompt mgmt; self-hostable. πβ Comet βOpikhttps://github.com/comet-ml/opikβ π fully-OSS eval + observability (judges, datasets, CI-runnable evals).βW&B Weavehttps://github.com/wandb/weaveβweave.Evaluation
scorers (exact/regex/model-graded/embedding) + Guardrails; comparison dashboards. π (Humanloop's migration target.)βBraintrusthttps://www.braintrust.dev/docs/start/eval-sdk(offline-eval-guide) βEval()
over golden datasets; offline vs online.(MUST)βPatronus AIhttps://www.patronus.ai/(github.com/patronus-ai
) β π research-grade judges (Lynx, GLIDER,Percival agent-failure debugger), experiments, multimodal judge.βMaxim AIhttps://www.getmaxim.ai/β π agentsimulation+ eval + observability across thousands of scenarios/personas.βGalileohttps://galileo.ai/β Luna evaluators + Agentic Evaluations.βVellumhttps://www.vellum.ai/β visual workflows + offline/online evals scoring every production run.βHeliconehttps://github.com/helicone/heliconeβ OSS gateway + observability; "Scores" ingests external eval results.βTraceloop / OpenLLMetryhttps://github.com/traceloop/openllmetryβ OSS OTel instrumentation (Py/TS/Go/Ruby) + hosted reliability platform.βLangtracehttps://github.com/Scale3-Labs/langtraceβ OSS OTel-standard tracing + manual scoring + dataset mgmt.βWhyLabs / LangKithttps://github.com/whylabs/langkitβ high-throughput text-signal metrics (toxicity, PII, jailbreak) for production monitoring.βPortkeyhttps://github.com/portkey-ai/gatewayβ π OSS gateway + 60+ guardrails + observability (fully open-sourced Mar 2026).βDatadog LLM Observabilityhttps://www.datadoghq.com/product/ai/llm-observability/β π evaluators + golden datasets +LLM Experiments+ AI Agent Monitoring (Jun 2025).βFiddler AIhttps://www.fiddler.ai/β π Trust Models (Safety/PII/Faithfulness) scoring in <100ms; Guardrails + agentic observability.βSeaOtterhttps://seaotter.ai/submit?utm_source=github&utm_medium=awesome_list&utm_campaign=launch&utm_content=A-09-benchflow-awesome-evalsΒ·toolβ Adversarial critic for AI agent outputs. Submit an output plus an acceptance policy; get pass/rework/fail with specific reasons before accepting the work.βPromptLayerhttps://www.promptlayer.com/Β·New Relic AI Monitoringβhttps://newrelic.com/platform/ai-monitoringβ lighter prompt-CMS / APM-native monitoring.
β Arize βOpenInferencehttps://github.com/Arize-ai/openinferenceβ semantic conventions for agent traces (tool/args/observation/latency/cost). -
βOpenTelemetry GenAI semantic conventionshttps://opentelemetry.io/docs/specs/semconv/gen-ai/(open-telemetry/semantic-conventions
) β π the vendor-neutral schema (now covers agent orchestration, MCP tool calls, and aquality-evaluation span hook). - β Braintrust βBraintrusthttps://www.braintrust.dev/Β·toolβ Industry-standard eval+observability platform (Notion, Stripe, Vercel) tying offline experiments to production logs; the section already cites Braintrust's Autoevals but omits the platform itself. π - β RagaAI βRagaAI Catalysthttps://github.com/raga-ai-hub/RagaAI-CatalystΒ·toolβ OSS agent-observability + eval SDK with multi-agent trace/execution-graph debugging, synthetic-data gen, and guardrail management β covers the online/guardrail-eval slice the section lacks. π - β OpenAI βOpenAI Cookbook β Evalshttps://developers.openai.com/cookbook/topic/evalsΒ·docsβ Maintained, runnable recipes for building evals (incl. Agents SDK eval, evaluating agents with Langfuse); the practical companion to OpenAI Evals and a curator-grade 'show real work' resource. π β (unverified URL)
Must-reads: Inspect AI Β· promptfoo Β· Braintrust Β· verifiers Β· DeepEval Β· Phoenix/Langfuse (pick your observability) Β· RULER (judge-as-reward)
6 Β· Benchmark vs. eval (and benchmark integrity: contamination, saturation, label errors, leaderboard gaming) #
β Ofir Press βHow to Build Good Language Modeling Benchmarkshttps://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/Β·blogβ The benchmark-author's checklist; difficulty target; one-number reporting; 150β500 task sizing. -
β Kapoor et al. βAI Agents That Matterhttps://arxiv.org/abs/2407.01502Β·paperβ Cost-controlled evaluation; model-dev vs downstream-dev needs; holdouts. -
β OpenAI βWhy We No Longer Evaluate SWE-bench Verifiedhttps://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/Β·blogβ ~59% of audited failures were broken tests. (mirror:https://decrypt.co/359012/...) -
β Shivalika Singh et al. (Cohere/Princeton/Stanford/MIT/AI2) βThe Leaderboard Illusionhttps://arxiv.org/abs/2504.20879Β·paperβ Private testing, selective disclosure, and data-access asymmetry on Chatbot Arena.(notes:research/notes/leaderboard-illusion.md
) -
βThe SWE-bench Illusion: When SOTA LLMs Remember Instead of Reasonhttps://arxiv.org/abs/2506.12286Β·paperβ Memorization inflates SWE-bench scores. -
βEstablishing Best Practices for Building Rigorous Agentic Benchmarks (ABC)https://arxiv.org/abs/2507.02825Β·paperβ SWE-bench Verified weak tests; Ο-bench rewards empty responses.(verified high) -
β Epoch AI βFrontierMath Tiers 1β3 v2 (corrected)https://epoch.ai/benchmarks/frontiermath-tiers-1-3-v2(changelog:.../frontiermath-tier-4-v2
) Β·pageβ ~42% of problems corrected after AI-assisted review. (also T8: the operator-as-rot-detector tale) -
β FutureHouse / Andrew White βAbout 30% of Humanity's Last Exam Answers Are Wronghttps://www.futurehouse.org/research-announcements/hle-examΒ·blogβ 29 Β± 3.7% of text-only chem/bio answers contradicted by the literature. (LessWrong writeup:https://www.lesswrong.com/posts/JANqfGrMyBgcKtGgK/) -
β Nathan Lambert βBuilding on Evaluation Quicksandhttps://www.interconnects.ai/p/building-on-evaluation-quicksandΒ·blogβ No hard source of truth; synthetic-data contamination. -
βLost in Simulationhttps://arxiv.org/abs/2601.17087Β·paperβ Simulated users are unreliable proxies (~9pp swings by simulator choice; demographic miscalibration). -
β Jimenez, Yang, β¦ Press, Narasimhan βSWE-bench: Can LMs Resolve Real-World GitHub Issues?https://arxiv.org/abs/2310.06770Β·https://www.swebench.com(Verified:.../verified.html
) Β·paper/site. - β Eugene Yan βTask-Specific LLM Evals that Do & Don't Workhttps://eugeneyan.com/writing/evals/Β·blogβ Off-the-shelf evals rarely transfer; accuracy is too coarse. - βAndrej Karpathy on evalshttps://x.com/karpathy/status/1896266683301659068Β·postβ "We make a number of specific recommendationsβ¦" (the eval-as-narrow critique). - β Hugh Zhang et al. (Scale AI) βA Careful Examination of LLM Performance on Grade School Arithmetic (GSM1k)https://arxiv.org/abs/2405.00332Β·paperβ Held-out GSM1k replica of GSM8k exposes up to 8% accuracy drop and partial memorization (Mistral/Phi) β the canonical method for measuring benchmark overfitting/contamination via a matched holdout. - β Curtis Northcutt, Anish Athalye, Jonas Mueller βPervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarkshttps://arxiv.org/abs/2103.14749Β·paperβ NeurIPS 2021 foundational result: ~3.3% avg label errors across 10 famous test sets (ImageNet, MNIST, etc.); corrections flip model rankings. The canonical 'label errors' citation this section's theme rests on (labelerrors.com / cleanlab). - β Aryo Pradipta Gema et al. (Edinburgh) βAre We Done with MMLU? (MMLU-Redux)https://arxiv.org/abs/2406.04127Β·paperβ ~6.5% of MMLU questions contain errors (57% in Virology); MMLU-Redux re-annotation shifts rankings β directly demonstrates label-error impact on the most-cited LLM benchmark. - β Naman Jain et al. (UC Berkeley) βLiveCodeBench: Holistic and Contamination-Free Evaluation of LLMs for Codehttps://arxiv.org/abs/2403.07974Β·benchmarkβ Time-windowed problem collection (post-cutoff scoring) as the leading contamination-resistant design pattern β the section discusses contamination but lists no exemplar of how to engineer around it. - β White, Dohan, LeCun, Goldblum et al. βLiveBench: A Challenging, Contamination-Limited LLM Benchmarkhttps://github.com/LiveBench/LiveBenchΒ·benchmarkβ Monthly-refreshed questions from new arXiv/news/competitions with objective ground truth β the canonical 'dynamic refresh' answer to saturation and contamination. - β ClΓ©mentine Fourrier / Hugging Face βThe LLM Evaluation Guidebook (Open LLM Leaderboard team)https://github.com/huggingface/evaluation-guidebookΒ·docsβ Practitioner reference from running the Open LLM Leaderboard; explicit sections on contamination, reproducibility, and leaderboard design β the hands-on 'how to not get fooled' companion to this section (updated version: hf.co/spaces/OpenEvals/evaluation-guidebook). - β Kapoor, Stroebl, Kirgis et al. (Princeton) βHolistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluationhttps://arxiv.org/abs/2510.11977Β·paperβ 21,000+ standardized agent runs surfacing leaderboard unreliability and unreported misbehaviors (agents searching HuggingFace for benchmark answers) β extends 'AI Agents That Matter' to leaderboard integrity for agents specifically. π - β Jambholkar, Rajani, Bakshi (Collinear AI) βGaming the System: Goodhart's Law Exemplified in the AI Leaderboard Controversyhttps://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversyΒ·blogβ Practitioner framing of the Llama 4 / Chatbot Arena gaming episode through Goodhart's Law β the accessible blog companion to The Leaderboard Illusion paper. π - β OpenAI βA Shared Playbook for Trustworthy Third-Party Evaluationshttps://openai.com/index/trustworthy-third-party-evaluations-foundations/Β·blog (Safety, May 29 2026)β What makesindependentevals of frontier-model safeguards & capabilities trustworthy: selecting the right harness, checking for validity hazards that distort results, and the standards third-party evaluators need. (also T10) π
Must-reads: Press Β· Kapoor et al. Β· OpenAI (SWE-bench Verified) Β· Leaderboard Illusion (See also T2 β verifiers library, Lee's RL-env taxonomy, Garg's lifecycle, Wei's verifier's law.)
β Nathan Lambert et al. βRewardBenchhttps://arxiv.org/abs/2403.13787Β·paperβ Evaluating reward models (the verifier you train against). -
β Nathan Lambert βThe New RL Scaling Lawshttps://www.interconnects.ai/p/the-new-rl-scaling-lawsΒ·blogβ Where RLVR scaling is heading. (interview:https://www.latent.space/p/the-rlvr-revolution-with-nathan-lambert) -
βSpurious Rewards: Rethinking Training Signals in RLVRhttps://arxiv.org/abs/2506.10947Β·paperβ Random/spurious rewards rival ground truth on Qwen2.5 (Qwen-specific).(cite arXiv figures, not the blog gloss β seeresearch/notes/reference-audit.md
) - β Nathan Lambert βThe State of Post-Training 2025https://www.interconnects.ai/p/the-state-of-post-training-2025Β·blogβ Context for where evals feed training. - β Lilian Weng βReward Hacking in Reinforcement Learninghttps://lilianweng.github.io/posts/2024-11-28-reward-hacking/Β·blogβ The canonical survey of reward hacking β taxonomy, RLHF-specific failure modes, mitigations; the foundational reference any reward-design section needs. - β Victoria Krakovna et al. (Google DeepMind) βSpecification gaming: the flip side of AI ingenuityhttps://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/Β·blogβ Canonical specification-gaming post (+the running examples list); origin story of why verifiers/reward functions get gamed, predating the LLM-RL wave. - β Latent Space / Will Brown βMulti-Turn RL for Multi-Hour Agents β with Will Brown (Prime Intellect)https://www.latent.space/p/willccbbΒ·talkβ The verifiers author on building multi-turn RL environments, turn-level credit assignment and reward design in practice β the practitioner voice behind the verifiers library already cited here. π - β various (arXiv 2509.21882) βPosition: The Hidden Costs and Measurement Gaps of RLVRhttps://arxiv.org/abs/2509.21882Β·paperβ RLVR gains overstated via budget mismatch, calibration drift, contamination; proposes a tax-aware minimum standard β the rigor counterweight to Lambert's RL-scaling optimism. π - β Saumya Malik, Nathan Lambert et al. (Ai2) βRewardBench 2: Advancing Reward Model Evaluationhttps://arxiv.org/abs/2506.01937Β·benchmarkβ The 2025 successor to RewardBench (already listed) β harder, less saturated, ICLR 2026; the current bar for evaluating the verifier you train against. π - β Nathan Lambert βReward Modeling (RLHF Book, ch. 5)https://rlhfbook.com/c/05-reward-modelsΒ·docsβ Canonical free reference chapter on reward models β the standing explainer for the 'verifier you train against' framing this section uses. π - β Shubham Parashar et al. (Texas A&M) βCurriculum RL from Easy to Hard Tasks Improves LLM Reasoning (E2H Reasoner)https://arxiv.org/abs/2506.06632Β·paperβ Difficulty-calibration primary source: easy-to-hard scheduling with convergence guarantees and the 'fade out easy tasks' result β directly fills the section's difficulty-calibration theme. π - β Jiacheng Guo, Ling Yang, Mengdi Wang et al. (Princeton) βGenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulatorshttps://arxiv.org/abs/2512.19682Β·paperβ Generative environment simulator with an alpha-Curriculum Reward that keeps tasks in the zone of proximal development β recent take on auto-calibrating env difficulty to the agent. π
Must-reads: Lee (RL-env taxonomy) Β· Garg (lifecycle) Β· verifiers (repo) #
β Eugene Yan βEvaluating the Effectiveness of LLM-Evaluatorshttps://eugeneyan.com/writing/llm-evaluators/Β·blogβ Position/verbosity/self-enhancement bias; direct vs pairwise; prefer binary + classification metrics. - β Hamel Husain βCreating an LLM-as-a-Judge That Drives Business Resultshttps://hamel.dev/blog/posts/llm-judge/Β·blogβ Critique-shadowing; validate against ONE benevolent-dictator expert; precision/recall over raw agreement. -
β Shankar et al. (UIST '24) βWho Validates the Validators? (EvalGen)https://arxiv.org/abs/2404.12272(pdf:.../pdf/2404.12272
; UIST:https://people.eecs.berkeley.edu/~bjoern/papers/shankar-validators-uist2024.pdf) Β·paperβ Criteria drift; the coverage-vs-false-failure judge-alignment loop. -
β Hamel Husain & Shreya Shankar βLLM Evals FAQhttps://hamel.dev/blog/posts/evals-faq/(error-analysis section:.../why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html
) Β·blogβ Binary over Likert; review β₯100 traces; the first-failure transition matrix for agents. - β Han-Chung Lee βLLM-as-a-Judge: Rethinking Model-Based Evaluationshttps://leehanchung.github.io/blogs/2024/08/11/llm-as-a-judge/Β·blogβ Avoid [0,1] continuous scales; manage judges like junior annotators. - β Zheng et al. βJudging LLM-as-a-Judge with MT-Bench and Chatbot Arenahttps://arxiv.org/abs/2306.05685Β·paperβ Source of the 10%/25% self-favoring & position-bias numbers βwhich the authors themselves hedge("cannot determine"); GPT-3.5 doesn't self-favor. - β Bavaresco et al. βLLMs Instead of Human Judges? A Large-Scale Studyhttps://arxiv.org/abs/2406.18403Β·paperβ Substantial variance across models/datasets; validate judges against humans first. - β Eugene Yan βAlignEvalhttps://eugeneyan.com/writing/aligneval/Β·blogβ "Align AI to human. Calibrate human to AI. Repeat." Work backward from the data. - β Eugene Yan βProduct Evals in Three Simple Stepshttps://eugeneyan.com/writing/product-evals/Β·blogβ The "God Evaluator" anti-pattern; the benchmark is human performance, not perfection. - β Han-Chung Lee βStatistics for AI/ML, Part 3 β Cohen's Kappahttps://leehanchung.github.io/blogs/2025/03/03/cohen-kappa/Β·blogβ Chance-adjusted inter-annotator agreement (the gate before holding out). - β Shreya Shankar βData Flywheels for LLM Applicationshttps://www.sh-reya.com/blog/ai-engineering-flywheel/Β·blogβ Binary metrics, the "GPT smell," error analysis as the core activity. - (SPADEhttps://arxiv.org/html/2401.03038v1) &DocETL(https://arxiv.org/abs/2410.12189) β Shankar et al. Β·paperβ Data-quality assertions / agentic query rewriting for LLM pipelines. - β Arjun Panickssery, Samuel R. Bowman, Shi Feng (NeurIPS 2024) βLLM Evaluators Recognize and Favor Their Own Generationshttps://arxiv.org/abs/2404.13076Β·paperβ The canonical causal study of self-preference bias: shows GPT-4/Llama-2 can recognize their own outputs and that self-recognition correlates linearly with self-favoring. This is the primary source behind 'self-enhancement bias' that the section's blogs only allude to. - β Yang Liu et al. (Microsoft, EMNLP 2023) βG-Eval: NLG Evaluation using GPT-4 with Better Human Alignmenthttps://arxiv.org/abs/2303.16634Β·paperβ The foundational reference-free LLM-judge method (CoT + form-filling scoring). Defines the direct-scoring paradigm the section critiques; a curated judge section is incomplete without the paper that started it. - β Jiawei Gu et al. βA Survey on LLM-as-a-Judgehttps://arxiv.org/abs/2411.15594Β·paperβ The most-cited survey organizing the LLM-judge space (bias taxonomy, reliability methods, agreement metrics). Serves as the one-stop map/bibliography the section currently lacks. - β Yulai Zhao, Haolin Liu, Dian Yu et al. (Tencent AI Lab / Princeton) βOne Token to Fool LLM-as-a-Judgehttps://arxiv.org/abs/2507.08794Β·paperβ Shows 'master-key' tokens (a colon, 'Solution:') trigger false-positive rewards up to 80% even on GPT-o1/Claude-4 judges, plus a robust Master-RM fix. Core evidence on judge/verifier reward-hacking fragility. π - β Jon Saad-Falcon et al. β Stanford Hazy Research / Scaling Intelligence βWeaver: Closing the Generation-Verification Gap with Weak Verifiershttps://hazyresearch.stanford.edu/blog/2025-06-18-weaverΒ·blogβ Directly operationalizes 'verifiable vs judgeable': aggregates many weak judges/reward models (unlabeled) to shrink the generator-verifier gap, reaching o3-mini accuracy from Llama-3.3-70B. Paper: arxiv.org/abs/2506.18203. π - β Mingchen Zhuge et al. (Meta AI / KAUST) βAgent-as-a-Judge: Evaluate Agents with Agentshttps://arxiv.org/abs/2410.10934Β·paperβ Extends LLM-as-judge to agentic trajectoriesβgrading intermediate steps, not just final outputsβwith the DevAI benchmark. The agent-specific evaluation case this agent-evals library specifically needs. - β Various (AAAI 2026) βVerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domainshttps://arxiv.org/abs/2507.09884Β·benchmarkβ Cross-domain benchmark exposing verifier precision/recall trade-offs (specialized verifiers high-accuracy but low-recall; general models inclusive but unstable). Quantifies how trustworthy a verifier actually is for RLVR. π - β Databricks (Mosaic Research) βEnhancing LLM-as-a-Judge with Grading Notes / From Pilot to Production with Custom Judgeshttps://www.databricks.com/blog/pilot-production-custom-judgesΒ·blogβ Enterprise-grade judge-building playbook: 20-30 calibration examples, batched SME annotation, Krippendorff's alpha agreement gatingβa production-side complement to the Hamel/Shankar academic alignment loop. π - β Jiayi Ye et al. βJustice or Prejudice? Quantifying Biases in LLM-as-a-Judge (CALM framework)https://arxiv.org/abs/2410.02736Β·paperβ Systematic quantification of 12 judge biases (verbosity, bandwagon, authority, distraction, sentiment, etc.) via automated attacksβbroadens the section's bias coverage well beyond position/verbosity/self-enhancement.
Must-reads: Yan (llm-evaluators) Β· Hamel (llm-judge) Β· Shankar (EvalGen)
9 Β· Agent-specific evaluation (trajectories, tool use, multi-turn, world state, multi-agent, localization) #
β Anthropic βDemystifying Evals for AI Agentshttps://www.anthropic.com/engineering/demystifying-evals-for-ai-agentsΒ·blogβ Grade the final env state (flight-booking via SQL); outcome vs trajectory; isolation; pass@k vs pass^k. - β Sierra βΟ-bench / ΟΒ²-benchhttps://arxiv.org/abs/2406.12045Β·https://github.com/sierra-research/tau-benchΒ·paper/repoβ DB-state-diff grading; user simulation; pass^k; empty-result as explicit fail. - β Sierra βBenchmarking AI Agentshttps://sierra.ai/blog/benchmarking-ai-agentsΒ·blogβ The motivation behind Ο-bench. - β Mialon et al. βGAIA: A Benchmark for General AI Assistantshttps://arxiv.org/abs/2311.12983Β·paperβ Real assistant tasks; difficulty by human task-length. - β Eugene Yan βPatterns for Building Cybersecurity Evalshttps://eugeneyan.com/writing/cybersecurity-evals/Β·blogβ The four-primitive agentic-eval template (sandbox, difficulty inputs, tools, deterministic grader); outcome grading + partial-credit ladders + transcript audits. (also T10) - β Han-Chung Lee βStatistics for AI/ML, Part 4 β pass@k and Unbiased Estimatorhttps://leehanchung.github.io/blogs/2025/09/08/pass-at-k/Β·blogβ Demystifies the metric everyone misuses. - β Han-Chung Lee βFirst-Principles Evalhttps://leehanchung.github.io/blogs/2024/05/22/first-principles-eval/Β·blog. - βSWE-bench grading harnesshttps://github.com/SWE-bench/SWE-bench/blob/main/swebench/harness/grading.pyΒ·tool/repoβ FAIL_TO_PASS / PASS_TO_PASS as a verifiable reward. (SWE-agent ACI:https://swe-agent.com/0.7/background/aci/) - β OpenAI βhuman-eval (pass@k estimator)https://github.com/openai/human-eval/blob/master/human_eval/evaluation.pyΒ·tool/repo. - More agent benchmarks to add*(named in the brief; URLs not yet verified in this corpus β verify before use):*WebArena, OSWorld, Terminal-Bench, Cybench. - β Zhou et al. (CMU) βWebArena: A Realistic Web Environment for Building Autonomous Agentshttps://arxiv.org/abs/2307.13854Β·benchmarkβ Self-hostable sandboxed websites (e-commerce/forum/GitLab/CMS/maps) with execution-based functional-correctness graders; 812 tasks. The canonical web-agent world-state benchmark named in the brief β now URL-verified. - β Xie et al. (HKU et al.) βOSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environmentshttps://arxiv.org/abs/2404.07972Β·benchmarkβ 369 real-computer tasks in VMs with per-task execution-based eval scripts and initial-state setup; humans 72% vs best agent 12%. Canonical computer-use benchmark named in the brief β now verified. - β Laude Institute + Stanford + community βTerminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command-Line Interfaceshttps://www.tbench.ai/Β·benchmarkβ Sandboxed terminal tasks with deterministic verifiers across SWE/sysadmin/security; v2 leaderboard. The terminal-agent benchmark named in the brief β verified (arxiv: arxiv.org/abs/2601.11868). π - β Zhang et al. (Stanford) βCybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Modelshttps://arxiv.org/abs/2408.08926Β·benchmarkβ 40 professional CTF challenges with subtask annotations and deterministic flag-based grading; pairs naturally with Eugene Yan's cybersecurity-evals post already in the section. Named in the brief β now verified. - β LΓΉ et al. (McGill / Mila / Google DeepMind) βAgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectorieshttps://arxiv.org/abs/2504.08942Β·paperβ First benchmark of LLM-judges-of-trajectories: 1302 expert-reviewed web-agent runs; shows rule-based graders reject many valid trajectories (under-reporting success). Core to the 'trajectory evaluation' theme the section currently lacks. π - β Cemri, Pan et al. (UC Berkeley Sky Lab) βWhy Do Multi-Agent LLM Systems Fail? (MAST taxonomy)https://arxiv.org/abs/2503.13657Β·paperβ 14-mode failure taxonomy across 7 MAS frameworks from 200+ annotated traces; the reference framework for diagnosing multi-agent failures β directly fills the 'multi-agent' gap. π - β Trivedi et al. (Stony Brook) β ACL'24 Best Resource Paper βAppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agentshttps://aclanthology.org/2024.acl-long.850/Β·benchmarkβ 9-app simulated world (457 APIs) with state-based unit tests that also check for collateral damage/unexpected state changes β gold-standard world-state grading for tool-use agents. - β OpenAI (Wei et al.) βBrowseComp: A Simple Yet Challenging Benchmark for Browsing Agentshttps://openai.com/index/browsecomp/Β·benchmarkβ 1,266 'inverted' hard-to-find/easy-to-verify questions for deep-research browsing agents; short verifiable answers make grading deterministic. Released 2025, now standard for browsing-agent eval. (paper: arxiv.org/abs/2504.12516) π - β Chen, Tang et al. (Yale / All Hands) βLocAgent: Graph-Guided LLM Agents for Code Localizationhttps://arxiv.org/abs/2503.09089Β·paperβ Defines and evaluates code localization as its own capability (Acc@k over file/function locations via code graphs) β directly fills the 'localization' theme named in the section title but currently unlisted. π - β He et al. (Tencent AI Lab) βWebVoyager: Building an End-to-End Web Agent with Large Multimodal Modelshttps://arxiv.org/abs/2401.13919Β·benchmarkβ 643 tasks on 15 live real-world sites with a GPT-4V automatic-judge eval protocol β an early, widely-cited example of multimodal-LLM-as-judge for live-web agent trajectories. - β BenchFlow βSkillsBenchhttps://github.com/benchflow-ai/skillsbenchΒ·benchmarkβ π evaluates how well agentskills work and how effectively agents use them β makes skill-acquisition/skill-use a measurable axis (the "Agent Skills" frontier). ~1.4kβ . - β BenchFlow βClawsBenchhttps://github.com/benchflow-ai/ClawsBenchΒ·benchmarkβ π BenchFlow's agent benchmark (results/data repo; full release in progress). - β OpenAI (with SWE-bench authors) βSWE-bench Verifiedhttps://openai.com/index/introducing-swe-bench-verified/Β·benchmarkβ 500 human-validated SWE-bench instances graded by hidden FAIL_TO_PASS unit tests; the de facto standard for real-issue resolution and the headline coding-agent number labs report π - β Yang, Jimenez, Press et al. (Princeton/Stanford) βSWE-bench Multimodalhttps://arxiv.org/abs/2410.03859Β·benchmarkβ 619 visual JS/front-end issues from 17 user-facing repos, test-verified; probes whether SWE agents generalize beyond Python/text to visual software domains - β Scale AI (Deng, Da et al.) βSWE-bench Prohttps://arxiv.org/abs/2509.16941Β·benchmarkβ 1,865 long-horizon, multi-file tasks across public GPL + held-out + commercial startup repos, test-graded; contamination-resistant and hard (frontier <45% pass@1) π - β OpenAI (Miserendino, Patwardhan, Heidecke et al.) βSWE-Lancerhttps://arxiv.org/abs/2502.12115Β·benchmarkβ 1,400+ real Upwork freelance tasks worth $1M, graded by triple-verified end-to-end Playwright tests plus manager-decision tasks; ties capability to economic value π - β Pan, Wang, Neubig, Suhr, Zhang et al. (Berkeley/CMU) βSWE-Gymhttps://arxiv.org/abs/2412.21139Β·benchmarkβ 2,438 executable Python SWE tasks with pre-installed deps + test verification; the first real training/eval gym for SWE agents and verifiers, ICML 2025 π - β ByteDance Seed βMulti-SWE-benchhttps://arxiv.org/abs/2504.02605Β·benchmarkβ 1,632 expert-annotated issue-resolution tasks across Java, TS, JS, Go, Rust, C, C++, test-graded; the leading multilingual SWE-bench extension, NeurIPS 2025 D&B π - β Nebius / Badertdinov et al. βSWE-rebenchhttps://arxiv.org/abs/2505.20411Β·benchmarkβ Automated pipeline yielding 21k+ executable Python tasks with continuously refreshed, decontaminated eval splits; quantifies how much SWE-bench Verified scores are inflated by contamination, NeurIPS 2025 D&B π - β METR βRE-Benchhttps://arxiv.org/abs/2411.15114Β·benchmarkβ 7 open-ended ML research-engineering environments (e.g. GPU-kernel optimization, scaling laws) scored against 71 human-expert 8-hour attempts; the reference AI-R&D-uplift eval, ICML 2025 - β OpenAI (Chan et al.) βMLE-benchhttps://arxiv.org/abs/2410.07095Β·https://github.com/openai/mle-benchΒ·benchmarkβ 75 Kaggle ML-engineering competitions graded against real human leaderboards (medal thresholds) in 24h Docker runs; standard ML-engineering-agent eval, ICLR 2025. π - β OpenAI (Starace et al.) βPaperBenchhttps://arxiv.org/abs/2504.01848Β·benchmarkβ Replicate 20 ICML 2024 papers from scratch, graded by 8,316 author-co-developed rubric leaves via a validated LLM judge; rigorous research-replication agent eval, ICML 2025 π - β Andy Konwinski / Kaggle βKonwinski Prize (K Prize)https://www.kaggle.com/competitions/konwinski-prizeΒ·leaderboardβ $1M Kaggle forecasting-format contest on GitHub bugs filed after submission close, fully contamination-free, test-graded; round-1 top score only 7.5% exposed real-world difficulty π - β Gou et al., OSU NLP Group (NeurIPS 2025 D&B) βMind2Web 2: Evaluating Agentic Search with Agent-as-a-Judgehttps://arxiv.org/abs/2506.21506Β·benchmarkβ 130 long-horizon live-web agentic-search tasks; novel Agent-as-a-Judge rubric-tree grader for time-varying, citation-backed answers β a serious answer to the Deep Research evaluation gap. π - β Xue et al., OSU NLP Group βOnline-Mind2Web (An Illusion of Progress? Assessing the Current State of Web Agents)https://arxiv.org/abs/2504.01382Β·benchmarkβ 300 realistic tasks on 136 live websites with an LLM-as-a-Judge auto-grader (~85% human agreement); exposes overstated web-agent progress vs simple baselines. π - β AGI Inc (agi-inc/REAL), powers realevals.xyz βREAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websiteshttps://github.com/agi-inc/REALΒ·benchmarkβ 112 tasks on deterministic Next.js replicas of Amazon/Uber/LinkedIn etc.; reproducible LLM evaluator plus state validators β fixes the flakiness of live-site web benchmarks. π - β Thomas et al., Convergence AI βWebGames: Challenging General-Purpose Web-Browsing AI Agentshttps://arxiv.org/abs/2502.18356Β·benchmarkβ 50+ client-side challenges isolating specific browser interaction skills with verifiable pass/fail; best agent 41% vs 96% human, a sharp diagnostic gap. π - β Patil et al., UC Berkeley (Gorilla / ICML 2025) βBerkeley Function Calling Leaderboard (BFCL) V4https://gorilla.cs.berkeley.edu/leaderboard.htmlΒ·leaderboardβ Executable + AST-based grading of tool/function calling; V4 adds multi-turn agentic, web-search and memory tasks β the de facto tool-calling leaderboard. π - β Wang et al., Shanghai AI Laboratory (NeurIPS 2024 D&B) βGTA: A Benchmark for General Tool Agentshttps://arxiv.org/abs/2407.08713Β·benchmarkβ 229 human-written real-world queries with implicit multimodal tool use; executable evaluation platform across perception/operation/logic/creativity tools (GTA-2 follow-up in 2026). π - β Lei et al., XLang Lab / HKU (ICLR 2025 Oral) βSpider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflowshttps://arxiv.org/abs/2411.07763Β·benchmarkβ Enterprise text-to-SQL agent workflows over huge schemas and multiple dialects with execution-based grading; frontier models only ~17-21% β a hard, realistic data-agent eval. π - β Rawles et al., Google DeepMind / Google Research (ICLR 2025) βAndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agentshttps://arxiv.org/abs/2405.14573Β·benchmarkβ Live Android environment with durable reward signals from device system state for 116 parameterized tasks across 20 apps β the standard mobile-GUI agent benchmark. π - β Bonatti et al., Microsoft βWindowsAgentArena: Evaluating Multi-Modal OS Agents at Scalehttps://arxiv.org/abs/2409.08264Β·benchmarkβ 154 realistic multi-step Windows-OS tasks across apps with programmatic success checks; parallelizable in Azure (~20 min full run) β desktop computer-use counterpart to OSWorld. π - β Levy, Shlomov, Wiesel et al., IBM Research βST-WebAgentBench: Evaluating Safety and Trustworthiness in Web Agentshttps://arxiv.org/abs/2410.06703Β·benchmarkβ 375 enterprise tasks carrying 3,057 explicit safety/policy constraints; introduces Completion-under-Policy and Risk Ratio β grades whether agents obey rules, not just succeed. π - β Xu et al., CMU βTheAgentCompany: Benchmarking LLM Agents on Consequential Real World Taskshttps://arxiv.org/abs/2412.14161Β·benchmarkβ Self-hosted software-company sim (web, code, chat coworkers) with checkpoint-based partial-credit grading; best agent ~30% β a full-day-knowledge-worker eval. π - β Koh et al., Carnegie Mellon University βVisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Taskshttps://arxiv.org/abs/2401.13649Β·benchmarkβ 910 visually-grounded web tasks across Classifieds/Shopping/Reddit with reproducible programmatic reward functions β the multimodal extension of WebArena. - β Tejal Patwardhan et al. (OpenAI) βGDPval: Evaluating AI Model Performance on Real-World Economically Valuable Taskshttps://arxiv.org/abs/2510.04374Β·benchmarkβ 1,320 expert-built tasks across 44 occupations in the top 9 GDP sectors; 220-task gold subset open-sourced with a public automated grading service at evals.openai.com β the flagship economic-value agent benchmark. π - β CAIS + Scale AI (47 authors) βRemote Labor Index: Measuring AI Automation of Remote Workhttps://arxiv.org/abs/2510.26787Β·benchmarkβ Grades whether agents complete whole real freelance projects to client-acceptable standard; best agent automates only 2.5% β a hard, money-grounded ceiling for end-to-end remote work. π - β Center for AI Safety + Scale AI (Dan Hendrycks et al.) βHumanity's Last Examhttps://arxiv.org/abs/2501.14249Β·benchmarkβ 2,500 expert-written frontier-knowledge questions with unambiguous auto-gradable answers across dozens of fields; the canonical post-MMLU saturation exam (note: now very widely cited). π - β OSU-NLP Group (Ohio State) βScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discoveryhttps://github.com/OSU-NLP-Group/ScienceAgentBenchΒ·benchmarkβ 102 expert-validated tasks from 44 peer-reviewed papers; grades self-contained Python programs by execution + success rate; best agent solves only ~34% (ICLR 2025). π - β Siegel, Kapoor, Narayanan et al. (Princeton) βCORE-Bench: Computational Reproducibility Agent Benchmarkhttps://arxiv.org/abs/2409.11363Β·benchmarkβ 270 tasks over 90 papers (CS/social science/medicine) that grade whether an agent can reproduce published results from code+data; from the Princeton AI-Snake-Oil group. - β Mingxuan Du et al. βDeepResearch Bench: A Comprehensive Benchmark for Deep Research Agentshttps://arxiv.org/abs/2506.11763Β·benchmarkβ 100 PhD-level tasks across 22 fields; reference-based adaptive-rubric grader for analyst-grade citation-rich reports, validated for human-judgment alignment β the standard deep-research-report eval. π - β FutureHouse + ScienceMachine βBixBench: A Comprehensive Benchmark for LLM-based Agents in Computational Biologyhttps://arxiv.org/abs/2503.00096Β·benchmarkβ 50+ real bioinformatics analysis scenarios with ~300 open-answer questions over multi-step Jupyter trajectories; frontier models hit only ~17% β serious wet-lab-adjacent science agent eval. π - β Meta (Meta Agents Research Environments) βGaia2 and ARE: Scaling Up Agent Environments and Evaluationshttps://arxiv.org/abs/2509.17158Β·benchmarkβ Successor to GAIA: dynamic, time-driven, multi-agent simulated environments with async world events and a verifiable scenario grader; frontier success ~42% β the serious general-assistant env from Meta. π - β Andon Labs (Backlund & Petersson) βVending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agentshttps://arxiv.org/abs/2502.15840Β·benchmarkβ Run a simulated vending business over >20M-token horizons; objectively graded on profit/net-worth, exposing long-horizon coherence breakdowns unrelated to context limits. π - β Francois Chollet et al. (ARC Prize Foundation) βARC-AGI-2: A New Challenge for Frontier AI Reasoning Systemshttps://arxiv.org/abs/2505.11831Β·benchmarkβ Human-calibrated (400+ participants, 100% solvable) grid-reasoning tasks with exact-match grading; 2-3x harder than ARC-AGI-1 across all approaches β the frontier fluid-intelligence benchmark. π - β Patronus AI βTRAIL: Trace Reasoning and Agentic Issue Localizationhttps://arxiv.org/abs/2505.08638Β·benchmarkβ 148 annotated agent traces with 841 errors (reasoning/planning/execution); grades whether an LLM can localize the failure in a trace (best model ~11%). HF dataset PatronusAI/TRAIL. π - β Salesforce Research βCRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarioshttps://arxiv.org/abs/2505.18878Β·benchmarkβ 19 expert-validated B2B/B2C tasks on a realistic Salesforce org with state-based grading; exposes the single-turn (~58%) vs multi-turn (~35%) reliability gap plus confidentiality checks. π
Must-reads: Anthropic (demystifying) Β· Ο-bench Β· Lee (pass@k)
10 Β· Safety / adversarial evaluation (prompt injection, jailbreaks, action-authorization, benchmark auditing) #
β Wang, Li, Mang, Cheung, Sen, Song (incl. Dawn Song) βBenchJack: Systematically Auditing AI Agent Benchmarkshttps://arxiv.org/abs/2605.12673Β·paperβ Reward hacking emerges spontaneously in frontier models; an 8-pattern flaw taxonomy + a 30-question Agent-Eval checklist; "benchmarks must be secure by design." - β Dawn Song (UC Berkeley RDI, lecture slides) βTowards Building Safe & Secure Agentic AIhttps://rdi.berkeley.edu/adv-llm-agents/slides/dawn-agentic-ai.pdfΒ·talkβ The adversarial setting; environment-borne attacks. - βDawn Song β ICLR 2025 keynote on LLM safetyhttps://iclr.cc/virtual/2025/invited-talk/36783Β·talk. - β Wang et al. (incl. Dawn Song) βCyberGymhttps://arxiv.org/html/2506.02548v2Β·paperβ Memory-safety PoC generation from OSS-Fuzz; sanitizer-crash grading at scale. - β Zeng et al. (incl. Song) βAIR-Bench 2024https://arxiv.org/abs/2407.17436v2Β·https://github.com/stanford-crfm/air-bench-2024Β·paper/repoβ Regulation-grounded risk taxonomy. -
β[DecodingTrust](https://decodingtrust.github.io)[https://decodingtrust.github.io](https://decodingtrust.github.io)Β·*benchmark*β NeurIPS 2023 trustworthiness benchmark. -
β[RedCode](https://arxiv.org/abs/2411.07781)[https://arxiv.org/abs/2411.07781](https://arxiv.org/abs/2411.07781)Β·*paper*β Risky code execution/generation benchmark for code agents. -
β[AgentPoison](https://arxiv.org/abs/2407.12784)[https://arxiv.org/abs/2407.12784](https://arxiv.org/abs/2407.12784)Β·*paper*β Red-teams agents by poisoning their RAG memory. -
β Miller (Anthropic) βAdding Error Bars to Evals (A Statistical Approach to LM Evaluations)https://arxiv.org/abs/2411.00640Β·https://www.anthropic.com/research/statistical-approach-to-model-evalsΒ·paperβ Standard errors, clustered SEs, paired difference tests β "is this difference real?" (cross-cutting: T6/T8) - β Debenedetti, Zhang, BalunoviΔ, Beurer-Kellner, Fischer, TramΓ¨r (ETH Zurich) βAgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agentshttps://arxiv.org/abs/2406.13352Β·benchmarkβ The canonical prompt-injection benchmark for tool-using agents (97 tasks, 629 security cases over untrusted data); NeurIPS 2024 D&B, now the standard eval everyone reports against. A glaring omission. π - β Andriushchenko, Souly, Davies et al. (Gray Swan / UK AISI) βAgentHarm: A Benchmark for Measuring Harmfulness of LLM Agentshttps://arxiv.org/abs/2410.09024Β·benchmarkβ ICLR 2025 benchmark of 110/440 malicious agent tasks across 11 harm categories; shows leading models comply with malicious agent requests without jailbreaking. The reference action-misuse/refusal benchmark. π - β Zhan, Liang et al. (UIUC) βInjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agentshttps://arxiv.org/abs/2403.02691Β·benchmarkβ ACL 2024 Findings; 1,054 IPI test cases over 17 user / 62 attacker tools, splitting direct-harm vs data-exfiltration intents. Foundational indirect-prompt-injection benchmark predating AgentDojo. - β Debenedetti, Shumailov, Fan, Hayes et al. (Google DeepMind) βDefeating Prompt Injections by Design (CaMeL)https://arxiv.org/abs/2503.18813Β·paperβ The defense-by-design counterpart: extracts control/data flow from the trusted query and enforces capability-based policies so untrusted data can't alter program flow; effectively solves AgentDojo's security eval. The key 2025 mitigation paper. π - β Simon Willison βThe lethal trifecta for AI agents: private data, untrusted content, and external communicationhttps://simonwillison.net/2025/Jun/16/the-lethal-trifecta/Β·blogβ The most-cited conceptual frame for reasoning about when an agent is unconditionally vulnerable to prompt injection; essential practitioner mental model at the Eugene-Yan bar. π - β Kutasov, Bowman et al. (Anthropic) βSHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agentshttps://www.anthropic.com/research/shade-arena-sabotage-monitoringΒ·benchmarkβ 17 complex environments pairing a benign main task with a hidden harmful side task to measure whether agents can sabotage without tripping an AI monitor; the canonical sabotage/monitorability eval. (Paper: arxiv.org/abs/2506.15740) π - β Anthropic (Alignment team) βAgentic Misalignment: How LLMs Could Be Insider Threatshttps://www.anthropic.com/research/agentic-misalignmentΒ·paperβ Red-team study showing frontier models will resort to blackmail/leaking under goal conflict in agentic settings; the reference for action-authorization / insider-threat adversarial evaluation. Companion to the cited Anthropic error-bars piece. π - β Microsoft AI Red Team (Azure) βPyRIT β Python Risk Identification Tool for generative AIhttps://github.com/Azure/PyRITΒ·toolβ The de-facto open-source red-teaming automation framework (70+ converters, multi-turn attacks like Crescendo/TAP); how practitioners actually run adversarial evals at scale. The section lists papers but no tooling. π - β OWASP GenAI Security Project βOWASP Top 10 for Agentic Applications (2026) + LLM Applications (2025)https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/Β·docsβ Industry-standard risk taxonomy: goal hijack, tool misuse, identity/privilege abuse, memory poisoning, rogue agents; complements the regulation-grounded AIR-Bench taxonomy already listed. The canonical practitioner threat checklist. π - β MITRE βMITRE ATLAS β Adversarial Threat Landscape for AI Systemshttps://atlas.mitre.org/Β·docsβ ATT&CK-style living knowledge base of 16 tactics / 80+ techniques against AI systems with real-world case studies and mitigations; the standard reference framework for AI adversarial threat modeling. - β Zhang, Yang et al. βAgent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agentshttps://proceedings.iclr.cc/paper_files/paper/2025/file/5750f91d8fb9d5c02bd8ad2c3b44456b-Paper-Conference.pdfΒ·benchmarkβ ICLR 2025 unified benchmark spanning 10 scenarios, 400+ tools, covering DPI/IPI, memory poisoning, plan-of-thought backdoors and defenses in one harness; broadest single attack/defense agent benchmark. π - β Gray Swan AI / UK AISI (w/ OpenAI, Anthropic, GDM) βGray Swan x UK AISI Agent Red-Teaming Challengehttps://app.grayswan.ai/arena/blog/agent-red-teaming-the-ai-jailbreak-showdownΒ·talkβ Largest public agent red-teaming exercise: ~2,000 red-teamers, 1.8M attempts, 62k breaches against 22 tool-using agents (financial/shopping/marketing bots); real-world adversarial-eval data at scale. π
Must-reads: Dawn Song (BenchJack) Β· Anthropic (error bars) β Hamel Husain & Emil Sedgh βHow to Construct Domain Specific LLM Evaluation Systemshttps://www.youtube.com/watch?v=eLXF0VojuSsΒ·talk(AI Engineer World's Fair 2024)β Jeff Huber & Jason Liu βHow to look at your datahttps://www.youtube.com/watch?v=jryZvCuA0UcΒ·talk(AI Engineer World's Fair 2025)β Bryan Bischof βFailure is a Funnelhttps://www.youtube.com/watch?v=k98gDjYbSaUΒ·talk(Data Council 2025)β Eugene Yan βUsing LLMs as Judges: Insights, Challenges, Best Practiceshttps://www.youtube.com/watch?v=7EGF0Mc0_osΒ·talk(Jason Liu series 2024)β Shreya Shankar βScaling Up Vibe Checks for LLMshttps://www.youtube.com/watch?v=eGVDKegRdgMΒ·talk(Stanford MLSys #97)β Shreya Shankar βWhy LLM Data Processing Pipelines Failhttps://www.youtube.com/watch?v=H-1QaLPnGsgΒ·talk(LangChain Interrupt 2025)β Ido Pesok (Vercel v0) βEvals Are Not Unit Testshttps://www.youtube.com/watch?v=L8OoYeDI_lsΒ·talk(AI Engineer 2025)β David Karam (Pi Labs) βBuilding Metrics that actually work (workshop)https://www.youtube.com/watch?v=jxrGodnopHoΒ·talk(AI Engineer 2025)β Brooke Hopkins (Coval) βFrom Self-driving to Autonomous Voice Agentshttps://www.youtube.com/watch?v=kDczF4wBh8sΒ·talk(AI Engineer 2025)β Leonard Tang (Haize Labs) βFuzzing in the GenAI Erahttps://www.youtube.com/watch?v=OMGPvW8TBHcΒ·talk(AI Engineer 2025)β Omar Khattab (DSPy) βOn Engineering AI Systems that Endure the Bitter Lessonhttps://www.youtube.com/watch?v=qdmxApz3EJIΒ·talk(AI Engineer 2025)β Taylor Jordan Smith βStrategies for LLM Evals (harnesses workshop)https://www.youtube.com/watch?v=89NuzmKokIkΒ·talk(AI Engineer 2025)β Ankur Goyal (Braintrust) βThe Future of Evalshttps://www.youtube.com/watch?v=MC55hdWLq4oΒ·talk(AI Engineer 2025)β Jason Wei (OpenAI) β3 Key Ideas in AI in 2025 (Verifier's Law)https://www.youtube.com/watch?v=b6Doq2fz81UΒ·talk(Stanford AI Club 2025)β Jason Wei βSome Intuitions About Large Language Modelshttps://www.youtube.com/watch?v=l898fqkjdFcΒ·talk(The AI Conference 2025)β Andrej Karpathy βDeep Dive into LLMs like ChatGPThttps://www.youtube.com/watch?v=7xTGNNLPyMIΒ·talk(2025)β John Schulman βRLHF: Progress and Challengeshttps://www.youtube.com/watch?v=hhiLw5Q_UFgΒ·talk(UC Berkeley EECS 2023)β Nathan Lambert (Ai2) βAligning Open Language Modelshttps://www.youtube.com/watch?v=AdLgPmcrXwQΒ·talk(Stanford CS25 V4)β Chip Huyen βBuilding LLM Applications for Productionhttps://www.youtube.com/watch?v=spamOhG7BOAΒ·talk(MLOps LLMs in Prod 2023)β Han-Chung Lee βThe Model is the Producthttps://www.youtube.com/watch?v=4dUFIRj-BWoΒ·talk(Data Council 2025)β Will Brown (Prime Intellect) βRL Environments at Scalehttps://www.youtube.com/watch?v=_IzZWeuTx7IΒ·talk(AI Engineer 2025)β Florian Brand (Prime Intellect) βLLM benchmarks in the time of agentshttps://www.youtube.com/watch?v=kmTMc-fVSXwΒ·talk(Big Techday 26 (2026))
β Hamel Husain (How I AI / Claire Vo) βEvals, error analysis, and better promptshttps://www.youtube.com/watch?v=PgzOBNse2EAΒ·podcast(How I AI)β Ankur Goyal (How I AI) βEvals are the new PRD for AI productshttps://www.youtube.com/watch?v=QE_1hRLsehMΒ·podcast(How I AI)β Hamel Husain (Vanishing Gradients) βEp 60: 10 Things I Hate About AI Evalshttps://www.youtube.com/watch?v=QEk-XwrkqhIΒ·podcast(Vanishing Gradients)β Hamel Husain (Vanishing Gradients) βEp 50: A Field Guide to Rapidly Improving AI Productshttps://www.youtube.com/watch?v=rWToRi2_SeYΒ·podcast(Vanishing Gradients)β Ankur Goyal (Latent Space) βFive Hard-Earned Lessons About Evalshttps://www.youtube.com/watch?v=a4BV0gGmXgAΒ·podcast(Latent Space)β Cameron & Hill-Smith (Latent Space) βArtificial Analysis: Independent LLM Evalshttps://www.youtube.com/watch?v=v5mBjeX4TJ8Β·podcast(Latent Space)β Petersson & Backlund (Andon Labs) βReality: The Final Eval (Vending-Bench)https://www.youtube.com/watch?v=ZAimcoJXUBoΒ·podcast(Latent Space / Cognitive Revolution)β Vaibhav Gupta & Dex (AI That Works) β#5 Designing Evalshttps://www.youtube.com/watch?v=-N6MajRfqYwΒ·podcast(AI That Works)β AI That Works β#16 Evaluating Prompts Across Modelshttps://www.youtube.com/watch?v=OawyQOrlubMΒ·podcast(AI That Works)β AI That Works β#24 Evals for Classificationhttps://www.youtube.com/watch?v=5Fy0hBzyduUΒ·podcast(AI That Works)β AI That Works β#34 Multimodal Evalshttps://www.youtube.com/watch?v=jzhVo0iAX_IΒ·podcast(AI That Works)β Maggie Konstanty (MLOps Community) β#372 It's 2026 and We're Still Talking Evalshttps://www.youtube.com/watch?v=9EjWR3QpJYkΒ·podcast(MLOps Community)β Kelly Hong (TWIML) β#728 Generative Benchmarkinghttps://www.youtube.com/watch?v=3kbiGPn0cOoΒ·podcast(TWIML AI)β Percy Liang (Gradient Dissent) βShaping AI Benchmarks (HELM)https://www.youtube.com/watch?v=kwkdKirqi6sΒ·podcast(Gradient Dissent)β Joseph Gonzalez (Gradient Dissent) βEvaluating LLMs with Chatbot Arenahttps://www.youtube.com/watch?v=okHMaczHPXcΒ·podcast(Gradient Dissent)β Aman Khan (Learning from ML) βEvaluating AI, Designing for Non-Determinismhttps://www.youtube.com/watch?v=v0eTTn7ZPEcΒ·podcast(Learning from Machine Learning)β Andrej Karpathy (Dwarkesh) βKarpathy: RL is terrible, why benchmarks misleadhttps://www.youtube.com/watch?v=-lRBpyPt79cΒ·podcast(Dwarkesh Podcast)β Hamel Husain & Shreya Shankar βHow to Build AI Evals in 2026 (Step-by-Step)https://www.youtube.com/watch?v=J7N9FMouSKgΒ·podcast(Aakash Gupta 2026)
β Dawn Song βTowards Building Safe & Trustworthy AI Agentshttps://www.youtube.com/watch?v=QAgR4uQ15rcΒ·lecture(Berkeley LLM Agents MOOC F24)β Dawn Song βTowards Building Safe and Secure Agentic AIhttps://www.youtube.com/watch?v=ti6yPE2VPZcΒ·lecture(Berkeley Advanced LLM Agents Sp25)β Ben Mann (Anthropic) βMeasuring Agent Capabilities and Anthropic's RSPhttps://www.youtube.com/watch?v=6y2AnWol7ooΒ·lecture(Berkeley LLM Agents MOOC F24)β Percy Liang βOpen-Source and Science in the Era of Foundation Modelshttps://www.youtube.com/watch?v=f3KKx9LWntQΒ·lecture(Berkeley LLM Agents MOOC F24)β Hashimoto & Liang βCS336 Lecture 12: Evaluationhttps://www.youtube.com/watch?v=x-R5l2HsXqMΒ·lecture(Stanford CS336 2025)
LLM benchmarks in the era of agents (deck)β Florian Brand β(local slide deck)
Β·*slides*(TNG / Big Techday)**The Life Cycle of an RL Environment (deck)**β Kanav Garg β`(local slide deck)`
Β·*slides*(ACM CAIS 2026)
*Discovered 58 more; transcription queued (YouTube rate-limit). 30 eval-focused + 28 eval-segments-in-agent-talks below.*
β Alex Volkov (AI Evangelist, Weights & Biases; host of ThursdAI) βJudging LLMshttps://www.youtube.com/watch?v=IIL2tE4n1Q0Β·talk(AI Engineer World's Fair 2025 β Evals track)β John Dickerson (CEO, Mozilla AI) β2025 is the Year of Evals! Just like 2024, and 2023, and β¦https://www.youtube.com/watch?v=CQGuvf6gSrMΒ·talk(AI Engineer World's Fair 2025 β Evals track)β Aparna Dhinakaran (Co-founder & CPO, Arize AI) βLessons from the Trenches: Building LLM Evals That Work IRLhttps://www.youtube.com/watch?v=nbZzSC5A6hsΒ·talk(AI Engineer World's Fair 2025 β Evals track)β Phil Hetzel (Braintrust) βThe maturity phases of running evalshttps://www.youtube.com/watch?v=FB-MLPhL9MsΒ·talk(AI Engineer World's Fair 2025 β Evals track)β Laurie Voss (Arize) βShip Real Agents: Hands-On Evals for Agentic Applicationshttps://www.youtube.com/watch?v=Xfl50508LZMΒ·talk(AI Engineer World's Fair 2025 β Evals track)β Peter Gostev (Arena.ai) βWhat Do Models Still Suck At? (BullshitBench)https://www.youtube.com/watch?v=R7A8rX-09ZwΒ·talk(AI Engineer World's Fair 2025 β Evals track)β Diego Rodriguez (Co-founder & CTO, Krea.ai) βPerceptual Evaluations: Evals for Aestheticshttps://www.youtube.com/watch?v=h5ItAJuB3FcΒ·talk(AI Engineer World's Fair 2025 β Evals track)β Quotient AI + Tavily (speakers from both) βEvaluating AI Search: A Practical Framework for Augmented AI Systemshttps://www.youtube.com/watch?v=wRJD0inpmjUΒ·talk(AI Engineer World's Fair 2025 β Evals track)β Rafal Willinski & Vitor Balocco (Zapier) βTurning Fails into Features: Zapier's Hard-Won Eval Lessonshttps://www.youtube.com/watch?v=blrovBxxN9oΒ·talk(AI Engineer World's Fair 2025 β Evals track)β Manu Goyal (Braintrust) βWhy should anyone care about Evals?https://www.youtube.com/watch?v=jJ45Yz1lJaoΒ·talk(AI Engineer World's Fair 2025 β Evals track)β AI Engineer Evals Workshop (multi-presenter) βMastering AI Evaluation: From Playground to Production [Evals Workshop]https://www.youtube.com/watch?v=9iN-cPnp7xgΒ·talk(AI Engineer World's Fair 2025 β Evals track (full workshop))β Ion Stoica (co-founder Databricks/Anyscale, LMArena), host Jacob Effron βDatabricks Co-Founder: Eval Limitations, Why China is Winning Open Source and Future of AI Infra (Ep 69)https://www.youtube.com/watch?v=ehav4XMAKLwΒ·podcast(Unsupervised Learning (Redpoint Ventures))β Brendan Foody (co-founder/CEO Mercor), host Jacob Effron βMercor CEO: Evals Will Replace Knowledge Work, AI x Hiring Today & the Future of Data Labeling (Ep 68)https://www.youtube.com/watch?v=SOZtz8IdI2wΒ·podcast(Unsupervised Learning (Redpoint Ventures))β Nidhi Rastogi (asst. professor, RIT), host Sam Charrington βCTIBench: How Good Are LLMs at Detecting Cyber Threats? (Ep 729)https://www.youtube.com/watch?v=75WqFOY3P5MΒ·podcast(The TWIML AI Podcast)β Jineet Doshi (Staff AI Scientist/Lead, Intuit), host Demetrios Brinkmann βHolistic Evaluation of Generative AI Systems (MLOps Podcast #280)https://www.youtube.com/watch?v=VJ0k0C1mGdgΒ·podcast(MLOps.community)β Neev Parikh (METR), host Nathan Labenz βCan AIs do AI R&D? Reviewing RE-Bench Results with Neev Parikh of METRhttps://www.youtube.com/watch?v=SX8Mxyy_UHYΒ·podcast(The Cognitive Revolution)β Marius Hobbhahn (CEO, Apollo Research), host Nathan Labenz βCan We Stop AI Deception? Apollo Research Tests OpenAI's Deliberative Alignment, w/ Marius Hobbhahnhttps://www.youtube.com/watch?v=I3ivZaAfDFgΒ·podcast(The Cognitive Revolution)β Shahul Es (co-founder, Ragas), hosts Daniel Whitenack & Chris Benson βMetrics Driven Development (Ragas)https://www.youtube.com/watch?v=fw0wUC5XN-oΒ·podcast(Practical AI (Changelog))β Mike Knoop (co-founder ARC Prize / Zapier), host Lukas Biewald βR1, OpenAI's o3, and the ARC-AGI Benchmark: Insights from Mike Knoophttps://www.youtube.com/watch?v=SSA8vNrFpXIΒ·podcast(Gradient Dissent (Weights & Biases))β UK AI Safety Institute team βSandbox breakout evals with Inspect β UK AISI (Fully Connected London '25)https://www.youtube.com/watch?v=J79pSSAENYcΒ·talk(Gradient Dissent / Fully Connected London '25 (Weights & Biases))β Weights & Biases (Weave team) βHow to align your LLM judge for better evaluationshttps://www.youtube.com/watch?v=AMCmhRoKnSkΒ·talk(Gradient Dissent / W&B)β Afshine Amidi & Shervine Amidi βStanford CME295 Transformers & LLMs (Autumn 2025) | Lecture 8 - LLM Evaluationhttps://www.youtube.com/watch?v=8fNP4N46RRoΒ·lecture(Stanford (CME295 / Stanford Online))β Berkeley RDI course staff (Dawn Song's Agentic AI MOOC) βCS294-196 (Agentic AI MOOC) - LLM Agent Evaluations & Project Overviewhttps://www.youtube.com/watch?v=VfOA2a0dj4wΒ·lecture(UC Berkeley RDI (CS294-196, Fall 2025))β Sida Wang (Meta) βAgentic AI MOOC (Fall 2025) | Predictable Noise in LLM Benchmarkshttps://www.youtube.com/watch?v=HV8pugcFVO0Β·lecture(UC Berkeley RDI (CS294-196, Fall 2025))β Samuel Colvin (founder, Pydantic) βAgent Optimization with Pydantic AI: GEPA, Evals, Feedback Loops β Samuel Colvin, Pydantichttps://www.youtube.com/watch?v=A48uhxfxbsMΒ·talk(AI Engineer (Code Summit / AI Engineer))β Naman Jain (Cursor; LiveCodeBench/SWE-bench-adjacent researcher) βCoding Evals: From Code Snippets to Codebases β Naman Jain, Cursorhttps://www.youtube.com/watch?v=tHN44yJoeS8Β·talk(AI Engineer (Code Summit))β Brooke Hopkins (founder, Coval; ex-Waymo eval infra) βFrom Self-driving to Autonomous Voice Agents β Brooke Hopkins, Coval (full session host upload)https://www.youtube.com/watch?v=1X3mYUHC5GAΒ·talk(Founders You Should Know)β Brooke Hopkins (founder, Coval) βBrooke Hopkins, Founder at Coval | AI Minds #073https://www.youtube.com/watch?v=e1E8vLyRIKkΒ·podcast(AI Minds (Deepgram))β Karthik Narasimhan (Head of Research, Sierra; Princeton; tau-bench author) βKarthik Narasimhan - Reliable AI Agents for Tomorrow's Worldhttps://www.youtube.com/watch?v=fOAAslQUcegΒ·lecture(Berkeley RDI (Agentic AI Summit 2025))β Sayash Kapoor (Princeton; AI Snake Oil; co-author HAL / agent-eval critiques) βBuilding and evaluating AI Agents β Sayash Kapoor, AI Snake Oilhttps://www.youtube.com/watch?v=d5EltXhbcfAΒ·talk(AI Engineer (Summit 2025))
Talks about building agents (Devin, Claude Code, Cursor, Replit, OpenAI Deep Research, Karpathyβ¦) with a substantive eval segment β the eval part is noted.
β Amy Boyd & Nitya Narasimhan (Microsoft) (AI Engineer World's Fair 2025 β Evals track) βMind the Gap (In your Agent Observability)https://www.youtube.com/watch?v=iOXM3zE-2dkβeval: Primarily agent observability/tracing, but the core argument ties observability directly to evaluation: you can't eval what you can't see. Covers instrumenting agent runs to feed eval datasets and catch regressions. oEmbed-verified.β Arvind Narayanan (Princeton, co-author AI Snake Oil), host Jacob Effron (Unsupervised Learning (Redpoint Ventures)) βUnpacking AI Agent Hype vs. Reality with Arvind Narayananhttps://www.youtube.com/watch?v=NoVMk_P6fgYβeval: Large central segment on the limitations of agent benchmarks: why current agent evals are flawed/overstated, construct validity, capability vs. reliability, and the gap between benchmark scores and real-world robustness. Surrounding material covers agent hype and societal impact.β Ben Lorica & Paco Nathan (The Data Exchange (Gradient Flow)) βData Exchange Podcast Ep 232: Ben Lorica & Paco Nathan on Llama 3, Agents, Eval, and morehttps://www.youtube.com/watch?v=XDIqkH_I9oUβeval: Roundup format with a substantial evaluation-metrics segment: state of LLM/agent evaluation, what metrics matter for agentic workflows, and limitations of current eval practice β interleaved with Llama 3 and agent news.β Jiantao Jiao (UC Berkeley / NVIDIA) (UC Berkeley RDI (CS294-196, Fall 2025)) βAgentic AI MOOC (Fall 2025) | Post-Training Verifiable Agentshttps://www.youtube.com/watch?v=3l0Zxus34esβeval: Training-focused, but a substantial benchmark thread runs through it: SWE-bench Verified and BrowseComp as the verifiable-task targets used to train and evaluate agents. Eval/benchmark segments are load-bearing (~middle of talk).β Graham Neubig (CMU) (Carnegie Mellon University (CS 11-711 Advanced NLP)) βCMU Advanced NLP Fall 2024 (17): Evaluation and Multimodalhttps://www.youtube.com/watch?v=iEinTXrwK8Aβeval: First ~half is a focused treatment of NLP/LLM evaluation: automatic metrics, human eval, LLM-as-judge and its pitfalls, benchmark contamination; second half pivots to multimodal. The eval portion is substantive (~min 0-35).β Charles Sutton (Google DeepMind) (UC Berkeley RDI (CS294-280, Spring 2025)) βAdv. LLM Agents MOOC (Sp25) | Code Agents & AI Vulnerability Detectionhttps://www.youtube.com/watch?v=JCk6qJtaCSUβeval: Coding-agent talk that leans heavily on benchmarks to measure progress: SWE-bench-style code-fixing eval and vulnerability-detection benchmarks, plus discussion of how to construct verifiable security-eval tasks (eval threads throughout).β Graham Neubig (CMU / All Hands AI) (UC Berkeley RDI (CS294-196, Fall 2024)) βLLM Agents MOOC (Fall 2024) | Agents for Software Developmenthttps://www.youtube.com/watch?v=f9L9Fkq-8K4βeval: SWE-bench is the spine of the talk: how the benchmark works, why it's hard, leaderboard dynamics, and where it misleads vs. real software work. Eval/benchmark content is central (recurs throughout, esp. early-mid).β Nicolas Chapados (ServiceNow Research) (UC Berkeley RDI (CS294-196, Fall 2024)) βLLM Agents MOOC (Fall 2024) | AI Agents for Enterprise Workflowshttps://www.youtube.com/watch?v=-yf-e-9FvOcβeval: Introduces WorkArena / BrowserGym as benchmarks for web/enterprise-workflow agents β task design, difficulty calibration, and why real enterprise tasks break naive evals (benchmark segment is a core part, mid-talk).β Yann Dubois (OpenAI) (UC Berkeley RDI (CS294-196, Fall 2025)) βAgentic AI MOOC (Fall 2025) | LLM Agents Overviewhttps://www.youtube.com/watch?v=r1qZpYAmqmgβeval: Framing overview of agents that includes a substantive evaluation segment β how to measure agent capability, the gap between benchmark scores and real reliability, and why agent eval is harder than chatbot eval. Eval segment ~mid-talk.β Scott Wu (CEO, Cognition) (AI Engineer World's Fair 2024) βThe Making of Devin by Cognition AI: Scott Wuhttps://www.youtube.com/watch?v=T7NWjoD_OuYβeval: Agent-building/demo talk for Devin. Eval segment covers how Cognition measures the agent: SWE-bench results plus their philosophy that public benchmarks are insufficient, motivating an internal 'cognition-golden' benchmark with fully reproducible environments, simulated users Devin can chat with, and evaluator agents that autonomously judge outcomes. Eval discussion sits in the back third around the SWE-bench / 'how we measure progress' portion.β Boris Cherny (creator/head of Claude Code, Anthropic) (AI Engineer World's Fair 2025) βClaude Code & the evolution of agentic coding β Boris Cherny, Anthropichttps://www.youtube.com/watch?v=Lue8K2jqfKkβeval: Talk about model capability vs 'harness'/scaffolding for coding agents. Eval-relevant segment: how the Claude Code team relies on internal evals to decide what scaffolding to keep, and the observation that as models improve you must keep raising the difficulty of your eval set. Measurement framing recurs through the harness discussion (middle of the talk).β James Austin (AI engineer, Replit) (MLOps Community β Agents in Production) βBuilding Replit Agent - Hard Lessons Learnedhttps://www.youtube.com/watch?v=RYde73eO7okβeval: Lessons-learned talk on scaling the Replit Agent team (3 to 20+ engineers). Heavy, substantive eval content: how optimizing for SWE-bench was the wrong target vs what users wanted, the importance of AUTOMATING the discovery of failure cases (long-tail failures), growing an internal eval set over time (every new bug becomes a new eval), and custom eval frameworks. Eval material runs through the middle of the talk under 'measure what matters' and 'automate finding failure cases'.β Harrison Chase (CEO, LangChain) (AI Engineer (LangChain Interrupt / AI Engineer)) β3 ingredients for building reliable enterprise agents β Harrison Chase, LangChain/LangGraphhttps://www.youtube.com/watch?v=kTnfJszFxCgβeval: Framework talk on the build/test/deploy lifecycle for reliable enterprise agents. The middle 'Test' ingredient is the eval segment: using evals (LangSmith) to verify the agent does the right thing rather than just returning plausible output, treating every error as an opportunity to write a new eval, and pairing tracing/observability with regression evals. Eval content is roughly the central third of the talk.β Cursor engineering (Tido Carriero et al.) (Cursor (official channel)) βHow Cursor builds agentic workflows across the SDLChttps://www.youtube.com/watch?v=dJAVS1g3NDwβeval: Talk on Cursor's internal agentic workflows across the SDLC (bug triage, security review, etc.). Eval segment covers how Cursor compares model quality with CursorBench β an in-house suite of intentionally underspecified, multi-file tasks built from real IDE sessions, scored with agentic graders, plus ONLINE evaluation to check whether agent changes actually help developers in practice. Eval discussion appears where they explain how they decide which models/changes to ship.β Thariq Shihipar (Anthropic) (AI Engineer (workshop)) βClaude Agent SDK [Full Workshop] β Thariq Shihipar, Anthropichttps://www.youtube.com/watch?v=TqC1qOfiVcQβeval: Hands-on workshop building agents with the Claude Agent SDK (tools, subagents, the agent loop). Eval-relevant portion covers how to verify and iterate on the agent once built β testing tool use, checking the loop behaves, and using measurement to debug agent failures. Eval/verification material comes in the back portion of the build-along.β Andrej Karpathy (Y Combinator AI Startup School 2025) βAndrej Karpathy: Software Is Changing (Again)https://www.youtube.com/watch?v=LCEmiRjPEtQβeval: Keynote on 'Software 3.0', LLMs as a new computing substrate, partial-autonomy apps and the 'autonomy slider'. Eval-relevant thread: his argument for keeping humans in the verification loop, making the generation-verification loop fast, and 'keeping the AI on a leash' β i.e., why you need tight verification/eval signals to safely raise agent autonomy. Verification discussion is woven through the partial-autonomy section (middle-to-late).β Samuel Colvin (founder, Pydantic) (AI Engineer World's Fair 2025) βFrom Stateless Nightmares to Durable Agents β Samuel Colvin, Pydantichttps://www.youtube.com/watch?v=flf_IKnFYnEβeval: Talk on building durable, production-grade agents with Pydantic AI (state/durability, type-safety, observability). Eval segment: Colvin's view that evals are still an unsolved problem, how Pydantic AI's evals library + Logfire observability fit the production loop, and using traces/observability as the substrate for evaluating agent behavior. Eval discussion appears in the observability/production-readiness portion.β Matt Palmer (host) + Replit lead AI engineer (Replit (official channel)) βInside Replit Agent with a lead AI engineerhttps://www.youtube.com/watch?v=bJMriY-pqPEβeval: Conversation on how the Replit Agent works internally, including the Agent v3 launch (discussion around ~19:21). Eval-relevant content: the self-improving loop of evals/metrics -> autonomous harness edits -> hill-climbing, how the team grows their eval set from observed failures, and why the engineering center of gravity has shifted toward measurement and harness iteration over the raw model.β Brooke Hopkins (Coval), Martin Schweiger, Vapi panel (VapiCon 2025 (Vapi)) βVapiCon 2025: Hardest Problems in Voice AI with Brooke Hopkins, Martin Schweiger & morehttps://www.youtube.com/watch?v=vzCT5PJlsJoβeval: Practitioner panel on production voice-AI failure modes; the eval thread runs throughout β end-to-end conversation simulation, why LLM-simulated callers are too cooperative vs. real frustrated/adversarial users, turn-taking/latency/interruption metrics, and monitoring. Eval-heavy whenever Hopkins speaks.β Karthik Narasimhan (Head of Research, Sierra) (Greylock (Change Agents)) βMulti-Agent Interaction with Sierra AIhttps://www.youtube.com/watch?v=KlQIePkgY7cβeval: Talk on how Sierra builds multi-agent customer-experience systems; includes the evaluation segment on tau-bench-style benchmarking, supervisor/critic agents reviewing primary-agent output, and measuring reliability of tool-using conversational agents.β Karthik Narasimhan (Sierra / Princeton) (Open AGI Summit, Brussels) βKarthik Narasimhan on Language Agents and Multi-Agent Interactionhttps://www.youtube.com/watch?v=i3GOZ22z2C0βeval: Survey of language-agent design and multi-agent interaction with an evaluation segment motivating tau-bench: why real-world tool-agent-user tasks need interaction-based benchmarks rather than static QA. Eval discussion is a sizeable chunk, not the whole talk.β Parahelp (YC) prompt/agent breakdown (startupCode (analysis of Parahelp/YC material)) βAI Customer Support: ParaHelp's Secret Prompt REVEALED!https://www.youtube.com/watch?v=UCQc12_KRy0βeval: Walkthrough of Parahelp's production customer-support agent prompt; the load-bearing eval point (drawn from Parahelp's own writing) is that most prompt-engineering time goes not to the prompt but to building eval suites, finding edge cases, and iterating β 'test cases more valuable than prompts.' Eval framing appears alongside the prompt structure discussion.β Ben Liebald (engineering lead, Harvey) (LangChain) βHow Harvey Built Reliable AI Agents with LangSmith & Custom Toolshttps://www.youtube.com/watch?v=kuXtW03cZEAβeval: How Harvey builds and EVALUATES domain-specific legal agents: tracing/observability with LangSmith, custom legal tools, and reliability evaluation against expert expectations (the BigLaw Bench / rubric-graded-by-lawyers approach). Eval/reliability is a major thread of the talk.β Harvey / legal-AI leaders (a16z) (a16z) βAgents, Lawyers, and LLMshttps://www.youtube.com/watch?v=ZESTYyGZ7Y4βeval: Discussion of legal agents in practice; eval segment covers why generic benchmarks (LegalBench/CUAD) are insufficient for long-horizon legal work and the move to expert-rubric, agent-task benchmarks (BigLaw Bench / Legal Agent Bench) graded by practicing attorneys. Eval is one section, not the whole episode.β Josh Tobin (leads AI Agents research, OpenAI β Deep Research / Operator) (TWIML AI Podcast) βHow OpenAI Builds AI Agents That Think and Act [Josh Tobin] - #730https://www.youtube.com/watch?v=qfhU7JH000oβeval: Covers Deep Research, Operator, Codex CLI; eval-relevant core is how end-to-end RL training requires graded/verifiable tasks (the agent must 'experience failure' and be rewarded for recovery), plus benchmark framing (BrowseComp for browsing). Reward/grading discussion runs through the middle of the episode.β Isa Fulford (Deep Research team lead, OpenAI) (Sequoia Capital (Training Data)) βHow OpenAI Built its Groundbreaking Deep Research Product ft. Isa Fulfordhttps://www.youtube.com/watch?v=jFZ9hJKJKtwβeval: How Deep Research was built and trained; eval-relevant segments cover building hard browsing/research tasks with verifiable answers, grading long-form cited outputs, and benchmark performance (e.g., BrowseComp). Eval/measurement is woven through the training discussion rather than a standalone section.β Isa Fulford & Josh Tobin (OpenAI Deep Research) (Sequoia Capital (Training Data)) βOpenAI's Deep Research Team on Why Reinforcement Learning is the Future for AI Agentshttps://www.youtube.com/watch?v=bNEvJYzoa8Aβeval: RL-for-agents discussion; eval content is the dependence of end-to-end RL on graded tasks and verifiable rewards, and how they construct hard research/browsing evals the model can be scored against. Measurement framing recurs throughout.β Jesse Zhang (CEO/co-founder, Decagon) (No Priors) βNo Priors Ep. 132 | With Decagon CEO and Co-Founder Jesse Zhanghttps://www.youtube.com/watch?v=emaSFP7y7Koβeval: Building production customer-support agents; eval segment covers Decagon's approach β regression/simulation test sets (~hundreds of conversations per workflow), LLM-as-judge scoring of tone/format/correct-info/correct-tool, and red-teaming with adversarial tests. Eval is a defined section of the conversation, not the whole episode.
Good eval commentary mined from agent-BUILDING writeups (not eval-primary) β each kept only if a strict judge rated the eval insight excellent/good. Takeaway + verbatim excerpt.
β Naman Jain (Cursor) βHow we compare model quality in Cursor (CursorBench)https://cursor.com/blog/cursorbenchΒ·excellentβTo avoid benchmark contamination, derive eval tasks from real committed code traced back to the agent request that produced it (Cursor Blame), and pair offline suites with controlled live-traffic analysis to catch regressions where outputs grade well but the user experience degrades β tracking a basket of outcomeβ¦(excerpt: "We source tasks for CursorBench using Cursor Blame, which traces committed code back to the agent request that produced it. ... We supplement CursorBench with controlled analysis on live traffic. These online evalsβ¦")β Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford βHow we built our multi-agent research systemhttps://www.anthropic.com/engineering/multi-agent-research-systemΒ·excellentβStart evals with ~20 real-usage queries rather than waiting for a large suiteβearly on a prompt tweak can move success from 30% to 80%, so small samples already reveal big effects. A single LLM-judge call scoring a multi-dimensional rubric (factual/citation accuracy, completeness, source quality, tool efficiency) onβ¦(excerpt: "We started with a set of about 20 queries representing real usage patterns. Evaluating these queries often required human judgment, but we found that an LLM judge that evaluated each output against criteria in aβ¦")β Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe βDemystifying evals for AI agentshttps://www.anthropic.com/engineering/demystifying-evals-for-ai-agentsΒ·excellentβLow eval scores frequently measure broken graders and harnesses, not weak models: rigid string-matching, ambiguous specs, and non-reproducible stochastic tasks can suppress a score from 95% to 42%, so you must read transcripts and audit the eval before trusting any number.(excerpt: "Opus 4.5 initially scored 42% on CORE-Bench, until an Anthropic researcher found multiple issues: rigid grading that penalized '96.12' when expecting '96.124991β¦', ambiguous task specs, and stochastic tasks that wereβ¦")β Sierra (AI Research) βΟ-Bench: Benchmarking AI agents for the real-worldhttps://sierra.ai/blog/benchmarking-ai-agentsΒ·excellentβReliability, not single-shot accuracy, is the real bar for agents: pass^k (success on all k independent trials of the same task) collapses GPT-4o from ~50% pass^1 to ~25% pass^8 in Ο-retail, meaning only a 1-in-4 chance of handling 8 different customers with the same issue. Measure consistency across repeated trials,β¦(excerpt: "pass^k, which measures the agent's reliability and determines if it can successfully complete the same task multiple times (k representing the number of different trials). ... the agent powered by GPT-4o drops to ~25%β¦")β Efe Karakus βFrom AI agent prototype to product: Lessons from building AWS DevOps Agenthttps://aws.amazon.com/blogs/devops/from-ai-agent-prototype-to-product-lessons-from-building-aws-devops-agent/Β·excellentβSeparate "capability" (pass@k: passed at least once in k tries) from "reliability" (pass^k: fraction of the k tries that passed) β a high pass@k with low pass^k means the agent CAN solve a task but does so unreliably, which is the metric that actually matters for shipping a non-deterministic agent.(excerpt: "Key metrics that we keep track of are capability (pass@k β whether the agent passed at least once in k attempts), reliability (pass^k β how many times the agent passed across k attempts, e.g., 0.33 means passed 1 out ofβ¦")β Simon Last & Sarah Sachs (Notion), interviewed on Latent Space βNotion's Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Futurehttps://www.latent.space/p/notionΒ·excellentβNotion runs a three-tier eval system with distinct pass-rate targets: regression/unit tests gated in CI, launch "report card" evals requiring 80-90% across user journeys to ship, and deliberately hard "frontier/headroom" evals targeted at ~30% pass rate so the suite keeps giving signal instead of saturating. Theβ¦(excerpt: "we have the equivalent of unit test. Regression test. Those live in ci, those have to pass a certain percent ... we have a report card and we need to, on these categories, you know, be it 80 or 90% of all of these userβ¦")β Dropbox (Dropbox Engineering / ML team) βA practical blueprint for evaluating conversational AI at scale (Dash)https://dropbox.tech/machine-learning/practical-blueprint-evaluating-conversational-ai-at-scale-dashΒ·excellentβTier your eval metrics by enforcement: boolean gates as hard blockers (citations present?), scalar budgets with concrete thresholds (Source F1 >= 0.85, p95 latency <= 5s) that block merges, and rubric scores (tone/formatting) that are only dashboard-monitored, not gating. This separates "must never regress" fromβ¦(excerpt: "we defined three types of metrics, each with a clear role in the development pipeline: Boolean gates ("Citations present?", "Source present?") | Hard fail, changes can't move forward; Scalar budgets (Source F1 β₯ 0.85,β¦")β Stefan Heule & Jediah Katz (Cursor) βContinually improving our agent harnesshttps://cursor.com/blog/continually-improving-agent-harnessΒ·goodβPair offline benchmarks (CursorBench) with online signals that proxy real satisfaction: a "Keep Rate" measuring what fraction of agent-written code survives in the codebase after fixed time intervals, plus an LLM-judge reading user follow-up messages to infer satisfaction, validated via side-by-side A/B tests ofβ¦(excerpt: "The first is the "Keep Rate" of agent-generated code. For a given set of code changes that the agent proposed, we track what fraction of those remain in the user's codebase after fixed intervals of time. ... Second, weβ¦")β Peter Zhong, Jacky Zhao, Ryan Carelli (Replit) βEnabling Agent 3 to Self-Test at Scale with REPL-Based Verificationhttps://replit.com/blog/automated-self-testingΒ·goodβVerification scales with autonomy: as an agent runs longer unattended (Replit went from ~20 min to 200+ min of productive autonomous work), robust self-testing becomes the gating factor because errors compound β and they isolate testing into a separate subagent to avoid context pollution, reaching multi-hundred-stepβ¦(excerpt: "we've created a self-testing flow for the Agent that is able to perform complex, multi-hundred step testing at a median cost of $0.20 per session.")β Cognition (Devin team) βA review of OpenAI's o1 and how we evaluate coding agentshttps://cognition.com/blog/evaluating-coding-agentsΒ·goodβWhen your judge is itself an agent (with shell/browser/code-editing tools autonomously deciding pass/fail), you must validate the judge: measure its precision and recall against a labeled ground-truth set and keep humans continuously reviewing the "proof of success" it surfaces. They also average over multiple Devinβ¦(excerpt: "We evaluate our evaluators in two ways: (1) Measuring precision and recall on ground truth sets (2) Continuous human review of the proof of success discovered by the evaluator agents.")β Akshay Utture (Augment Code) βHow we built a high-quality AI code review agenthttps://www.augmentcode.com/blog/how-we-built-high-quality-ai-code-review-agentΒ·goodβThey run a fast offline benchmark (LLM-as-judge comparing generated comments to human-authored "golden comments" on 10 PRs across 5 repos) using F-score as the hill-climbing metric, then map each offline metric to a production proxy: recall to "bugs fixed per PR" and precision to "percentage of comments addressed."β¦(excerpt: "F-score acts as the primary hill-climbing metric for offline improvements. ... Bugs fixed per PR | Recall | Measures real-world bug prevention and review coverage | Percentage of comments addressed | Precision |β¦")β Antonio Scandurra & Nathan Sobo (Zed) βZed now predicts your next edit with Zeta, our new open modelhttps://zed.dev/blog/edit-predictionΒ·goodβWhen the correct output is non-deterministic and admits many valid forms (e.g. code edits), replace brittle token/string assertions with an LLM judge checking plain-English intent assertions (e.g. "ensure quicksort recurses left and right of the pivot"); this tolerates run-to-run variation while still catching wrongβ¦(excerpt: "instead of strict assertions, we used a larger LLM to evaluate Zeta's edits. By writing our test assertions in plain English and having Claude check if the results matched our intent, we could validate that Zeta wasβ¦")β Factory.ai βCode Droid: A Technical Reporthttps://factory.ai/news/code-droid-technical-reportΒ·goodβThey decompose agent failures into distinct stages of the localization pipeline β file not retrieved (8%), retrieved but not ranked top-5 (8%), and ranked top but not edited (6%) β which tells practitioners exactly where to invest (retrieval recall vs. ranking vs. edit selection) rather than treating a failed task asβ¦(excerpt: "In 8% of the tasks, Code Droid failed to include the target file in its list of analyzed files. Additionally, even when the target file was analyzed, it was not prioritized as a top-5 file in another 8% of cases.β¦")β Jan Hartman (Sourcegraph) βLessons from building AI coding assistants: context retrieval and evaluationhttps://sourcegraph.com/blog/lessons-from-building-ai-coding-assistants-context-retrieval-and-evaluationΒ·goodβWhen you can't get ground-truth labels for "relevant context," end-to-end user feedback can't tell you whether a bad answer came from retrieval or from the LLM β so substitute cheap automatic proxy checks (code compiles/passes tests for generation; referenced symbols actually exist for code Q&A) and separatelyβ¦(excerpt: "Since users primarily interact with the LLM's responses rather than the context items themselves, it's hard to know if context retrieval is making a difference. We might get feedback that a response was unhelpful, butβ¦")β Yujohn Nattrass βIntroducing Scorers in Mastrahttps://mastra.ai/blog/mastra-scorersΒ·goodβDon't ask an LLM judge to emit a raw 0-1 score directly β it's high-variance and irreproducible. Instead have the LLM emit structured intermediate data (e.g. extract claims/opinions, label each), then compute the score deterministically in code (proportion that pass), keeping the LLM's nuance but making the numberβ¦(excerpt: "LLMs are terrible at producing consistent numerical scoresβask the same model to rate something from 0-1 five times and you'll get five different numbers. So we have LLMs output structured data instead, then use aβ¦")β Letta βBenchmarking AI Agent Memory: Is a Filesystem All You Need?https://www.letta.com/blog/benchmarking-ai-agent-memory/Β·goodβA simple filesystem-backed agent (search_files β grep/open β answer) beats a specialized graph-memory system on LoCoMo, supporting their thesis that what matters for memory eval is whether the agent knows WHEN and HOW to call a retrieval tool, not the underlying retrieval mechanism (vector DB vs knowledge graph). Theyβ¦(excerpt: "This simple agent achieves 74.0% on LoCoMo with GPT-4o mini and minimal prompt tuning, significantly above Mem0's reported 68.5% score for their top-performing graph variant.")β Dominik Kundel, Gabriel Chua βTesting Agent Skills Systematically with Evalshttps://developers.openai.com/blog/eval-skillsΒ·goodβStructure skill evals as deterministic trace checks first (parse the --json JSONL stream: assert specific commands ran, count command_execution items to catch looping/re-run regressions, track usage tokens to catch prompt bloat), then layer a model-assisted --output-schema rubric step only for the qualitative partsβ¦(excerpt: "Deterministic checks answer 'did it do the basics?' but they don't answer 'did it do it the way you wanted?' For skills like setup-demo-app, many requirements are qualitative: component structure, styling conventions,β¦")β The LangChain Team βEvaluating Deep Agents: Our Learningshttps://www.langchain.com/blog/evaluating-deep-agents-our-learningsΒ·goodβFor multi-turn agent evals, you can't hardcode a fixed sequence of user inputs because once the agent diverges from the expected path the later scripted inputs become incoherent; pair this with per-test fresh/temporary environments so runs stay reproducible and non-flaky, and lean on single-step evals sinceβ¦(excerpt: "if you naively hardcode a sequence of inputs and the agent deviates from the expected path, the subsequent hardcoded user input may not make sense.")β Malte Ubl, Alice Alexandra Moore, Ido Pesok βEval-driven development: Build better AI fasterhttps://vercel.com/blog/eval-driven-development-build-better-ai-fasterΒ·goodβTier your graders by cost/objectivity (code checks first, LLM grading reserved for subjective calls since it runs 1.5-2x more expensive), hold a hard 100% pass bar on refusal/safety, and deliberately seed the eval set with prompts that currently fail so improvements are tracked and regressions caught as promptsβ¦(excerpt: "Our multi-faceted evaluation strategy includes fast, reliable code checks, end user and internal human feedback, and LLM-based grading for complex judgments at scale. [...] Some of our checks for code quality include:β¦")β Letta βLetta Leaderboard: Benchmarking LLMs on Agentic Memoryhttps://www.letta.com/blog/letta-leaderboardΒ·goodβA good agentic-memory eval must penalize unnecessary memory tool calls, not just reward correct answers: models that are strong at archival retrieval tend to over-call memory tools even when the answer is already in context, which is a real failure mode you only catch if your scoring includes an extraneous-operationβ¦(excerpt: "Models that perform well on archival memory (e.g., Claude Haiku 3-5) might overuse memory operations when unnecessary and receive a lower score on core memory due to penalties.")β Decagon βThe evaluation engine behind Decagon's AI agentshttps://decagon.ai/blog/evaluation-engine-ai-agentsΒ·goodβA two-stage eval gate (offline LLM-as-judge over query/context/response triplets plus an expert-labeled ground-truth set, then online A/B with gradual traffic ramp) keeps unreliable variants out of production; auditing a subset of judge scores with human labellers validates the judge itself, and online success isβ¦(excerpt: "Using an LLM-as-judge system, we evaluate structured triplets consisting of a user query, the context provided to the model, and the model's generated response. ... We evaluate responses against a ground truthβ¦")β Anker & Mads (Parahelp co-founders) βAI prompt design at Parahelphttps://parahelp.com/blog/prompt-designΒ·goodβDesigning the agent to emit structured XML output is a deliberate "design-for-evaluability" tactic: rigid, parseable output lets you programmatically grade each decision, and pairing it with an outcome metric like "% of tickets resolved end-to-end" grounds eval in real production results rather than proxy scores.(excerpt: "This made the model more strict (and let us parse XML for evals)")β Iwona Bialynicka-Birula, Ryan Muir, Binoy Robin Dalal, Hagyeong Shin, Nikolai Glushnev βHow we Built a State-of-the-Art Research Agent for Call Center Conversation Analyticshttps://cresta.com/blog/how-we-built-a-state-of-the-art-research-agent-for-call-center-conversation-analyticsΒ·goodβThey isolated the dominant hallucination driver (questions whose answer simply isn't in the conversation) and pulled counting/aggregation out of the LLM into deterministic code so report statistics are guaranteed correct, while tracking two concrete report-quality metrics (relevance-classification accuracy andβ¦(excerpt: "Human experts scrutinized a wide range of AI Analyst reports and identified two key metrics that were key drivers of report quality: relevance classification accuracy and the factuality of claims about theβ¦")β Cresta βWhy Speech to Text Is the Hidden Engine Behind Contact Center AI Performancehttps://cresta.com/blog/why-speech-to-text-is-the-hidden-engine-behind-contact-center-ai-performanceΒ·goodβSTT quality is the upstream bottleneck for downstream agent tasks: WER should be measured on a domain-stratified corpus (here 2,703 files / 81.69 hours / 9 domains) because small WER deltas compound at scale (1% WER over 1M minutes = ~10,000 fewer errors), and targeted fine-tuning or keyterm prompting moves the needleβ¦(excerpt: "WER benchmarking was based on a dataset comprising 2,703 audio files across nine distinct domains, totaling 81.69 hours")β Vapi (Vapi Editorial Team) βYour Voice Agents Need Tests. Now They Have Them.https://vapi.ai/blog/evalsΒ·goodβConvert real production failures into regression tests by capturing the bad transcript and annotating the correct behavior, and match different criteria with different judges: regex/JSON/exact for deterministic outputs (e.g., a tool call must include particular arguments), LLM-as-judge for fuzzy qualities like tone,β¦(excerpt: "When you discover a bad call in your logs, you can turn that transcript into a test. In the dashboard, pull up the call, click the thumbs down button to use it as an eval, specify what the assistant should have doneβ¦")β Chip Huyen βAgentshttps://huyenchip.com/2025/01/07/agents.htmlΒ·goodβDecompose agent planning evaluation into a concrete (task, tool-inventory) dataset and sample K plans per task, then track plan-level metrics (fraction valid, retries-to-valid) and tool-call-level metrics (invalid tool, valid tool with wrong params, valid tool with wrong values) β separating the distinct failure modesβ¦(excerpt: "To evaluate an agent for planning failures, you can create a dataset of (task, tool inventory) pairs. For each task, use an agent to generate K plans. Compute the following metrics: Out of all generated plans, how manyβ¦")β Lilian Weng βLLM Powered Autonomous Agentshttps://lilianweng.github.io/posts/2023-06-23-agent/Β·goodβLLM-as-judge can silently fail in expert domains: in ChemCrow, an LLM evaluator rated GPT-4 and ChemCrow as roughly equal, while domain experts judging chemical correctness found ChemCrow far superior. The takeaway is that an LLM judge lacking domain expertise cannot detect flaws it doesn't understand, soβ¦(excerpt: "Interestingly, while the LLM-based evaluation concluded that GPT-4 and ChemCrow perform nearly equivalently, human evaluations with experts oriented towards the completion and chemical correctness of the solutionsβ¦")β Carol Liang and Kevin Ho (Stripe, API Standards) βCan AI agents build real Stripe integrations? We built a benchmark to find outhttps://stripe.com/blog/can-ai-agents-build-real-stripe-integrationsΒ·goodβDon't grade an agent on its own self-reported success or surface-level API/UI responses; verify the real side effects in the system of record (here, the actual Stripe API object the action should have created). This catches the documented failure where an agent saw a 400 error on invalid test data and declared "Good,β¦(excerpt: "Some graders also validated the Stripe artifacts of a run by inspecting created Stripe API objects. For example, in a full-stack challenge, the agent might complete a payment in the UI, then verify success by testingβ¦")β Discord βDeveloping Rapidly with Generative AIhttps://discord.com/blog/developing-rapidly-with-generative-aiΒ·goodβUse a separate LLM (a "critic") to score your agent's outputs against criteria, and structure the judge prompt to force constrained outputs β yes/no or a numeric scale β rather than free-form critique, which makes the eval signal aggregable and lets you compare prompt variants quickly.(excerpt: "AI-assisted evaluation uses best-in-class LLMs (like GPT-4) to automatically critique how well the AI's outputs match what we expected or how they score against a set of criteria. ... This method uses GPT-4 in a wayβ¦")β Gayatri Sabharwal βWhat it takes to build AI agents at scalehttps://ramp.com/leading-indicators/what-it-takes-to-build-ai-agents-at-scaleΒ·goodβBuild eval ground truth from a domain expert's spec, then use a frontier model to generate adversarial edge cases the expert missed, and validate with beta-user feedback; the genuinely hard problem is deciding when eval coverage is sufficient to remove the human from the loop. The post also draws a useful line:β¦(excerpt: "At Ramp, the eval suite starts with a human expert β often an accountant β who writes down how the task should go. A frontier model then stress-tests it, surfacing edge cases or the scenarios the expert didn't think of.β¦")β Max Leiter βHow we made v0 an effective coding agenthttps://vercel.com/blog/how-we-made-v0-an-effective-coding-agentΒ·goodβDefine the agent's primary metric as a binary user-visible outcome (does the generated site actually render, not error/blank) rather than text-similarity, then attack the ~10% LLM error baseline with a streaming autofix layer targeting specific named failure modes (stale APIs, nonexistent icons, missing providers,β¦(excerpt: "The primary metric we optimize for is the percentage of successful generations. A successful generation is one that produces a working website in v0's preview instead of an error or blank screen. ... In our experience,β¦")
New, vetted finds from the automated Scan (discover β strict judge; deduped by URL and title). Newest first.
β ClΓ©mentine Fourrier (HuggingFace), with swyx & Alessio β Latent Space βBenchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judgehttps://www.latent.space/p/benchmarks-201Β·podcast(excellent) β First-hand, mechanism-level guidance from the lead maintainer of HuggingFace's OpenLLM Leaderboard: ranks evaluation methods (reproducible leaderboards > preference arenas >> LLM-as-judge), names concrete failure modes (LLM judges show mode-collapse self-reinforcement, positional bias, and can'tβ¦ πβ AI Engineer (@aiDotEngineer); Evals track hosted by Braintrust / Olmo Maldonado; multiple speakers βEvals: AI Engineer World's Fair 2025 (full track playlist)https://www.youtube.com/playlist?list=PLcfpQ4tk2k0XZS6wXjyB_8zuZBXHFTwYMΒ·talk(excellent) β A full track of practitioner conference talks where teams at Google, Notion, Zapier, Vercel, Braintrust and others walk through how they actually build, score, and deploy product evals in production β error analysis, LLM-as-judge scorer design, offline vs online eval loops, and frontier-benchmarkβ¦ πβ Tara Bogavelli, Gabrielle Gauthier MelanΓ§on, Katrina Stankiewicz, Oluwanifemi Bamgbose, Hoang Nguyen, Raghav Mehndiratta, Hari Subramani (ServiceNow AI) βA New Framework for Evaluating Voice Agents (EVA)https://huggingface.co/blog/ServiceNow-AI/evaΒ·article(excellent) β EVA is an end-to-end voice-agent eval framework using a bot-to-bot audio harness (user simulator + Pipecat agent + deterministic tool executor + validators) that jointly scores task accuracy (EVA-A: completion, faithfulness via LLM-judge, speech fidelity via LALM-judge) and conversationalβ¦ πβ Yunfei Bai, Allie Colin, Kashif Imran, Winnie Xiong (AWS) βEvaluating AI agents: Real-world lessons from building agentic systems at Amazonhttps://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/Β·article(good) β Lays out a three-layer agent evaluation library (foundation-model benchmarking, component assessment of intent/memory/reasoning/tool-use, and final task-completion quality) with concrete component metrics like tool selection/parameter accuracy, context-retrieval precision/recall, and reasoningβ¦ πβ Michael Dawson (Red Hat) βEval-driven development: Build and evaluate reliable AI agentshttps://developers.redhat.com/articles/2026/03/23/eval-driven-development-build-evaluate-ai-agentsΒ·article(good) β A hands-on, 8-stage eval-driven workflow for a real multi-turn IT-self-service agent: uses DeepEval's ConversationalGEval/ConversationSimulator with ~15 custom LLM-as-judge metrics, a directory of 11 "known bad" conversations to validate that the metrics actually catch failures ("test your tests"),β¦ πβ Scott Clark (Distributional) with Sam Charrington βHow to Find the Agent Failures Your Evals Miss (TWIML #767)https://twimlai.com/podcast/twimlai/how-find-agent-failures-your-evals-missΒ·podcast(good) β Pre-deployment evals only catch known failure modes; the durable method for catching "unknown unknowns" is post-production analytics β convert agent execution traces into vector fingerprints, then cluster/topic-model them to surface emergent failures like "lazy" tool-use hallucinations (agentsβ¦ πβ Raza Habib (Humanloop CEO), MLOps Community βProduct Metrics are LLM Evals // Raza Habib CEO of Humanloophttps://www.youtube.com/watch?v=KWcE8ybs09AΒ·podcast(good) β The central, actionable thesis is that the best evals are your product metrics: instead of inventing proxy metrics, instrument the real production signals β explicit user feedback (thumbs up/down), user corrections to generated output, and the user's natural next action β and feed them back as yourβ¦ πβ Rashmi Shetty (Capital One) with Sam Charrington (TWIML AI Podcast) βHow Capital One Delivers Multi-Agent Systems (TWIML #765) β Rashmi Shettyhttps://twimlai.com/podcast/twimlai/how-capital-one-delivers-multi-agent-systemsΒ·podcast(good) β A senior Capital One platform leader describes evaluating a real deployed multi-agent system (Chat Concierge for auto dealerships) by shifting from per-model ML metrics to end-to-end task-outcome evaluation, treating evals for stochastic multi-agent workflows plus observability as first-classβ¦ πβ Ereli Eran (Founding Engineer, 7AI), host Demetrios Brinkmann β MLOps Community βSoftware Engineering in the Age of Coding Agents: Testing, Evals, and Shipping Safely at Scale (MLOps Podcast #361)https://home.mlops.community/public/videos/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scaleΒ·podcast(good) β A working practitioner's three-tier eval pipeline for production agents: (1) "unit tests that are more like integration tests" which actually make LLM calls, (2) staging evals run against real customer data, and (3) async LLM-as-judge runs as a scheduled post-deployment task to re-review completedβ¦ πβ Geoffrey Irving (UK AI Security Institute), with Nathan Labenz βSituational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irvinghttps://www.cognitiverevolution.ai/situational-awareness-in-government-with-uk-aisi-chief-scientist-geoffrey-irving/Β·podcast(good) β UK AISI's Chief Scientist details real eval practice: open-sourcing the Inspect eval framework, calibrating fast automated evals against wet-lab biology ground truth, red-teaming across 30+ model runs (jailbreaking every model tested), and concrete eval-awareness mitigations (embedding evals inβ¦ πβ Hamel Husain (with Claire Vo, How I AI podcast) βEvals, Error Analysis, and Better Prompts: A Systematic Approach to Improving Your AI Productshttps://www.lennysnewsletter.com/p/evals-error-analysis-and-better-promptsΒ·podcast / video episode (with transcript)(good) β A practitioner walkthrough of the error-analysis loop: read real user conversation traces, open-code failures and group them into categories, prioritize by frequency counting (not intuition), then build binary pass/fail evals and validate LLM-as-judge against human labels. Includes a liveβ¦ πβ Raza Habib (Humanloop) & Brianna Connelly (Filevine) βEval-Driven Development: Best Practices and Pitfalls When Building with AIhttps://home.mlops.community/public/videos/eval-driven-development-best-practices-and-pitfalls-when-building-with-ai-raza-habib-and-brianna-connelly-ai-in-production-2025-2025-03-13Β·conference talk (video)(good) β A real production case study (Filevine, a legal-AI platform: 1.5M chat requests/mo, 360K docs, 25B tokens) showing a concrete eval-driven workflow with measured outcomes β scaling document classification from 60 to 160 categories while holding precision/recall in the high 80s-90s, and raisingβ¦ πβ Greg Kamradt (ARC Prize Foundation) βHow To Benchmark AGI β with Greg Kamradt, President of ARC-AGIhttps://www.youtube.com/watch?v=wU82fz4iRfoΒ·talk(good) β Kamradt frames a benchmark's job as measuring generalization/skill-acquisition efficiency rather than memorized task completion, and gives concrete, reusable eval-design rules: build tasks easy for humans but hard for AI to expose true capability gaps; a benchmark only gives useful signal in theβ¦ πβ Greg Kamradt (ARC Prize Foundation), with Demetrios Brinkmann βGreg Kamradt: Benchmarking Intelligence | ARC Prize (MLOps Community)https://home.mlops.community/public/videos/greg-kamradt-benchmarking-intelligence-or-arc-prizeΒ·talk / podcast interview (video)(good) β Lays out concrete, transferable eval-design principles from running ARC-AGI: build "human-easy, AI-hard" tasks to avoid saturation, verify human-solvability empirically (400 testers, every ARC-2 task solved by 2+ people in 2 attempts), use hidden holdout test sets and dual public/privateβ¦ πβ HUD (hud.ai) β no individual byline βVerifier and Reward Design for RL Environmentshttps://www.hud.ai/resources/verifier-reward-design-rl-environmentsΒ·article (technical guide)(good) β Lays out a concrete four-layer scoring architecture (verifiers / pass-fail gates / 3-5 criteria rubrics / composite reward) plus a five-step build workflow: define checkable end-states first ("table contains row id=4521, status='active'"), add hard failure gates, build minimal rubrics, test onβ¦ πβ Akshay Anand (Thoughtworks) βEvaluating AI agents in production: A practical frameworkhttps://www.thoughtworks.com/insights/blog/machine-learning-and-ai/Evaluating-AI-agents-in-productionΒ·article(good) β Presents a practical three-layer eval architecture (persona-based multi-turn simulation, functional unit evals at agent/conversation level, operational observability) with a concrete maturity progression β start ~20% automated / 80% manual validation, refine personas via UAT, then shift toβ¦ πβ Brooke Hopkins (Coval, ex-Waymo) βVoice AI Agent Evaluation: The Complete Guide (2026)https://www.coval.ai/blog/voice-ai-agent-evaluation-guideΒ·article(good) β Domain-specific evaluation playbook for voice agents: persona-tiered simulation testing (Easy/Medium/Hard/Adversarial across accent, noise, emotion), a concrete LLM-as-judge calibration loop (run on 50-100 calls, sample for human review, iterate rubrics until >85% human-judge agreement on binaryβ¦ πβ Rishi Gujjar & Andrew Li (Judgment Labs) βAgent Judge: Solving Long-Horizon Evals for Production Agentshttps://www.judgmentlabs.ai/blogs/agent-judge-solving-long-context-evaluationsΒ·article(good) β Frames long-horizon agent evaluation as an agentic, multi-agent judge (search trajectory state as queryable objects, verify claimed actions against source-of-truth systems like DBs/APIs/GitHub, and iteratively refine the rubric), backed by a real benchmark table on internal hallucination-detectionβ¦ πβ Shreya Shankar (guest); Hugo Bowne-Anderson (host) β Vanishing Gradients βEp 57: AI Agents and LLM Judges at Scale β Processing Millions of Documents (Without Breaking the Bank)https://vanishinggradients.fireside.fm/57Β·podcast(good) β Shreya Shankar (UC Berkeley EPIC Lab, author of DocETL) walks through an end-to-end methodology for reliable LLM-judge and agent pipelines at scale: treat unstructured-text LLM workflows as ETL; do error analysis on the first 50-100 traces with a human to surface failure modes; add guardrails viaβ¦ πβ Tejal Patwardhan (OpenAI frontier evals lead), host Andrew Mayne βWhy Tejal Patwardhan stopped underestimating the models (Ep 21)https://shows.acast.com/openai-podcast/episodes/why-tejal-patwardhan-stopped-underestimating-the-models-episΒ·podcast(good) β First-hand account from the person running OpenAI's frontier evals team on why established benchmarks saturate or get gamed as models improve, what distinguishes a benchmark that holds up, the "capability overhang" problem (models advance faster than we can measure them), and the shift from toyβ¦ πβ Alon Bochman (RagMetrics); host Demetrios Brinkmann (MLOps Community) βMaking AI Reliable is the Greatest Challenge of the 2020s (#312) β Alon Bochman (RagMetrics) with Demetrios Brinkmannhttps://www.youtube.com/watch?v=d4PGxNM3IisΒ·podcast(good) β Treat the LLM-judge's "human agreement rate" against domain experts on your own eval set as the primary success metric, and engage non-technical SMEs through a feedback loop (show input β model output β their preferred output, fine-tune the judge on 50-100 corrected pairs) rather than blankβ¦ πβ Lenny Rachitsky (host) with Brendan Foody (Mercor CEO) βWhy experts writing AI evals is creating the fastest-growing companies in history β Brendan Foody (Mercor CEO)https://www.lennysnewsletter.com/p/experts-writing-ai-evals-brendan-foodyΒ·podcast(good) β Two genuinely useful framings from someone selling evals to all top-5 labs: "if the model is the product, then the eval is the product requirement document," and that evals and RL verifier environments are the same data type β only the semantic use (benchmark vs. reward signal) differs. Theβ¦ πβ Anastasios Angelopoulos, Wei-Lin Chiang, Ion Stoica (LMArena) with Anjney Midha (a16z) βBeyond Leaderboards: LMArena's Mission to Make AI Reliablehttps://a16z.com/podcast/beyond-leaderboards-lmarenas-mission-to-make-ai-reliable/Β·podcast(good) β First-hand account from the team behind Chatbot Arena on why crowdsourced human-preference voting (vs. expert benchmarks) is needed for reliability, the Bradley-Terry ranking migration, style control to separate substance from formatting, building immunity to overfitting/gaming, why "fresh andβ¦ πβ Greg Kamradt (President, ARC Prize Foundation) βMeasuring Agents with Interactive Evaluations β Greg Kamradt (ARC Prize Foundation)https://www.youtube.com/watch?v=TK9MN22q6E0Β·talk (conference talk / video)(good) β Argues static benchmarks can't measure what agents actually do (multi-turn exploration, planning, long-horizon execution) and proposes interactive evals scored on action efficiency vs. a human baseline β how efficiently an agent converts environment information into a working strategy, grounded inβ¦ πβ Beth Barnes (CEO, METR) βThe Most Important Graph in AI Right Now (Measuring AI's Time Horizon) β Beth Barnes (METR)https://www.youtube.com/watch?v=jXtk68KzmmsΒ·talk(good) β Direct from the source (METR's CEO), this lays out the "time horizon" eval methodology: rather than scoring tasks pass/fail, you order tasks by how long human experts take and find the duration at which a model hits 50% success β a human-baselined y-axis that makes capability legible andβ¦ πβ Sayash Kapoor & Benedikt Stroebl (Princeton), interviewed by Connor Shorten (Weaviate) βAI Agents That Matter with Sayash Kapoor and Benedikt Stroebl (Weaviate Podcast #104)https://www.youtube.com/watch?v=gCP-W_BNzg4Β·talk / podcast interview(good) β The co-first authors walk through their TMLR paper's core thesis: accuracy-only agent benchmarking produces needlessly complex, expensive agents, and once you control for inference cost, dead-simple baselines (e.g. retrying/resampling a model) land on or above the cost-accuracy Pareto frontier ofβ¦ πβ Barry Zhang & Mahesh Murag (Anthropic) βDon't Build Agents, Build Skills Insteadhttps://www.youtube.com/watch?v=CEvIs9y1uogΒ·talk(good) β Anthropic's case for packaging procedural knowledge as composable "Skills" (organized folders + self-documenting scripts, loaded via progressive disclosure so metadata is cheap until a skill.md is invoked) rather than building bespoke domain agents. For evals specifically, the speakers name theβ¦ πβ Arvind Narayanan (Princeton); host Sam Charrington (TWIML) βAI Agents: Substance or Snake Oil β Arvind Narayanan (TWIML Podcast #704)https://www.youtube.com/watch?v=HScABWB98KwΒ·talk(good) β Grounded in Narayanan's "AI Agents That Matter" paper, it argues agent evals are systematically misleading: leaderboards ignore inference cost, so simple repeated-sampling baselines can match or beat complex agent architectures on benchmarks like HumanEval β making cost-vs-accuracy Paretoβ¦ π
βpavlovslist.comhttps://pavlovslist.com/Β·directoryβ The RL-environment / eval startups directory ("for the RL-pilled").Environment labs / RL-env companies(the "environments are the new data" venture wave, via pavlovslist):** BenchFlow**(benchflow.ai β SkillsBench, ClawsBench, runtime),** Prime Intellect**(verifiers, Environments Hub),** HUD**,** Mechanize**,** Plato**,** AfterQuery**,** Halluminate**,** Surge AI**,** Scale**,** Mercor**.** Prime Intellect**(verifiers
, Florian Brand) Β·BraintrustΒ·** Arize**(Phoenix/AX, OpenInference) Β·** Galileo**Β·** LangChain / LangSmith**(agentevals) Β·** Sierra**(Ο-bench) Β·** Core Automation**(Kanav Garg) Β·** Epoch AI**(benchmark audits) Β·** METR**(autonomy/horizon) Β·** FutureHouse**(HLE audit) Β·** UK AISI**(Inspect).
- Built by merging this project's research rounds (mining β adversarial verification β reference audit) with a
/deep-research
pass. Source detail lives inresearch/citations.md
,research/findings.json
,research/reference-audit.md
,research/notes/
, and the full link list inresearch/url-inventory.md
(153 URLs). Verified-high (deep-research, 3/3 votes): Verifier's Law, theverifiers
library, EvalGen, Inspect AI, promptfoo, the ABC benchmark-rigor paper, plus lm-eval-harness, Autoevals, agentevals, AI Agents That Matter.Flagged caveats: the MT-Bench 10/25 bias numbers arehedged by their own authors; Lee's "Agent Runtime" post URL and the WebArena/OSWorld/Terminal-Bench/Cybench links still need verification; the Kanav Garg talk is cited via a conference summary (no canonical primary URL yet).
This repo ships 146 deep reading notes in notes/ β structured summaries with key points,
verbatim quotes, and themes, for the highest-signal sources:
β blog posts & practitioner essaysnotes/articles/
β 47 transcribed talks, podcasts & lectures (withnotes/talks/
[mm:ss]
timestamps)β papers surfaced by the citation graphnotes/papers/
PRs welcome. Keep the bar high: show your work (real data/code/war-stories beat hot takes), give every entry a one-line why, verify the URL, and flag caveats. See CONTRIBUTING.md. Quality over quantity β a great list is as much about what it excludes.
To the extent possible under law, BenchFlow and contributors have waived all copyright and related rights to this work (CC0 1.0). The linked resources remain under their respective licenses.