A curated, non-BS library of the best resources for evaluating agents

wpnews.pro

A curated, opinionated,

non-BSlibrary of the best resources forbuilding and evaluating AI agents— papers, blog posts, talks, courses, tools, and benchmarks.

Maintained by BenchFlow · Most "awesome" lists are link dumps. This one is annotated and verified: every entry says what it is and why it belongs, URLs are checked, quotes are verbatim, and dead/abandoned tools are pruned (not silently listed). It was assembled by:

a depth-4 recursive citation crawl(11.6k papers, ranked by in-degree) to surface the academic canon, targeted practitioner-web discovery for the industry sources citation graphs miss (Eugene Yan, Han-Chung Lee, Hamel Husain, Shreya Shankar, Nathan Lambert, …),47 talks & podcasts transcribed and deep-noted(verbatim + timestamps), and** per-section gap audits**with adversarial verification.

**443+ curated links · 146 deep reading notes** (see [ notes/](/benchflow-ai/awesome-evals/blob/main/notes)). Markers: 🆕 = released/updated 2025–2026 ·

[CONTRIBUTING](/benchflow-ai/awesome-evals/blob/main/CONTRIBUTING.md).

📘

Playbook:[— real, runnable code + worked examples for LLM-as-judge (aligned to humans), pass@k/pass^k, error analysis, trajectory & world-state grading, CI gating, verifiable rewards, and more.]PATTERNS.md

📘 Playbook — real code & worked examples (PATTERNS.md)⭐ Must-read starter set (read these first)1 · Why we need evals2 · "If you can eval it, you have built it" — eval ⇄ capability ⇄ RL environment3 · The model / harness / skill decomposition4 · Observability & the output / eval space (the surfaces you can grade)5 · Evaluation infrastructure (the eval stack: datasets, scorers, online/offline, tracing, CI)6 · Benchmark vs. eval (and benchmark integrity: contamination, saturation, label errors, leaderboard gaming)7 · Evals & RL environments (verifiers, reward design, difficulty calibration, lifecycle)8 · LLM-as-judge & verifiers (alignment, biases, verifiable vs judgeable)9 · Agent-specific evaluation (trajectories, tool use, multi-turn, world state, multi-agent, localization)10 · Safety / adversarial evaluation (prompt injection, jailbreaks, action-authorization, benchmark auditing)🎙 Talks, podcasts & slides (transcribed + noted)💬 Eval insights inside general agent posts🔎 Scan additionsCompanies & landscape (eval / RL-environment market)Notes on provenance & gapsDeep notesContributingLicense

— Shunyu Yao —The Second Half https://ysymyth.github.io/The-Second-Half/·blog— "Evaluation becomes more important than training." The field-levelwhy.— Eugene Yan —An LLM-as-Judge Won't Save the Product, Fixing Your Process Will https://eugeneyan.com/writing/eval-process/·blog— Process over tooling; evals as the scientific method.— Han-Chung Lee —Hidden Technical Debt: Agent Evaluation Infrastructure https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/·blog— Control/data plane, the five eval surfaces, state deltas. "Chat eval was a spreadsheet; agent eval is a system."— Hamel Husain & Shreya Shankar —LLM Evals FAQ https://hamel.dev/blog/posts/evals-faq/·blog— The densest operational Q&A: error analysis, binary judgments, the benevolent-dictator labeler.— Jason Wei —Asymmetry of Verification and Verifier's Law https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law·blog— "Ability to verify == ability to create an RL environment."— Anthropic —Demystifying Evals for AI Agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents·blog— Best primary on agent-specific evals: task design, outcome vs trajectory, isolated trials, pass@k vs pass^k.— Ofir Press —How to Build Good Language Modeling Benchmarks https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/·blog— Natural / auto-evaluatable / challenging; the "-200%" difficulty target; ~1-yr saturation.— Kapoor, Stroebl, Siegel, Nadgir, Narayanan —AI Agents That Matter https://arxiv.org/abs/2407.01502·paper— Cost as a first-class metric; model-dev vs app-dev; missing holdouts breed overfitting.— Nathan Lambert —Building on Evaluation Quicksand https://www.interconnects.ai/p/building-on-evaluation-quicksand·blog— LLM eval has no ground truth; contamination; eval↔training coupling.— Shankar, Zamfirescu-Pereira, Hartmann, Parameswaran, Arawjo (UIST '24) —Who Validates the Validators? (EvalGen)https://arxiv.org/abs/2404.12272·paper— "Criteria drift": you can't write the rubric before you grade.— Florian Brand (Prime Intellect) —Benches 2026 — "LLM benchmarks in the era of agents"https://florianbrand.com/posts/benches-2026·blog + 61-slide talk— The sharpest current read on why benchmarks break in the agent era: the "evals are dead, just measure vibes" backlash, how every layer of the eval-running stack (prompt · sampling temp · grader · harness) swings the score, and that benchmark ground truth is frequently wrong.— OpenAI —A Shared Playbook for Trustworthy Third-Party Evaluations https://openai.com/index/trustworthy-third-party-evaluations-foundations/·blog (Safety, May 2026)— What makesindependentevals of frontier-model safeguards & capabilities trustworthy: harness selection, the validity hazards that distort results, and the standards third-party evaluators need.

— Shunyu Yao —The Second Half https://ysymyth.github.io/The-Second-Half/·blog— The bottleneck shifts from solving problems todefining and evaluatingthem. (also T2, T7) - — Eugene Yan —An LLM-as-Judge Won't Save the Product, Fixing Your Process Will https://eugeneyan.com/writing/eval-process/·blog— "Buying or building another evaluation tool won't save the product." Evals = the scientific method in disguise. - — Hamel Husain —Your AI Product Needs Evals https://hamel.dev/blog/posts/evals/·blog— The canonical "you need evals"; remove all friction from looking at your data; don't rely on generic frameworks. - — Hamel Husain —A Field Guide to Rapidly Improving AI Products https://hamel.dev/blog/posts/field-guide/·blog— "Error analysis is consistently the highest-ROI activity." The metric for an AI roadmap is experiments run. - — Shreya Shankar —In Defense of AI Evals, for Everyone https://www.sh-reya.com/blog/in-defense-ai-evals/·blog— Rebuts the anti-eval backlash; evals = the systematic measurement of application quality. - — Yan, Bischof, Frye, Husain, Liu, Shankar —What We Learned from a Year of Building with LLMs https://applied-llms.org/(Part II:https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-ii/) ·blog— The "intern test," genchi genbutsu, turning vibe-checks into assertions. - — Nathan Lambert —Big Tech's LLM Evals Are Just Marketing https://www.interconnects.ai/p/evals-are-marketing·blog— Why frontier-lab leaderboard numbers are marketing, not science. - — Chip Huyen —AI Engineering pitfalls https://huyenchip.com/2025/01/16/ai-engineering-pitfalls.html·blog— Common eval/AI-engineering mistakes from theAI Engineeringauthor. (also T6) - — Aishwarya Naresh Reganti & Kiriti Badam (O'Reilly Radar) —Evals Are NOT All You Need https://www.oreilly.com/radar/evals-are-not-all-you-need/·blog— The essential nuance piece: automated graders alone don't save you; you need a continuous-improvement flywheel of offline tests + production monitoring + real-user iteration. Pairs with Shreya's 'In Defense' to complete the backlash debate. 🆕 - — Hamel Husain & Shreya Shankar with Lenny Rachitsky (Lenny's Podcast/Newsletter) —Why AI evals are the hottest new skill for product builders https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill·talk— The accessible 'why evals matter' on-ramp (live walkthrough of error analysis, open/axial coding) that mainstreamed evals to PMs in 2025; the apartment-leasing-bot anecdote is the canonical 'you can't vibe-check' story. 🆕 - — OpenAI —How evals drive the next chapter in AI for businesses https://openai.com/index/evals-drive-next-chapter-of-ai/·blog— Frontier-lab framing of evals as turning fuzzy business goals into specs and measurable ROI; useful counterweight to Lambert's 'evals are marketing' and grounds the 'why' for enterprise readers. 🆕 ⚠(unverified URL) - — Aman Khan (Arize) with Lenny Rachitsky —Beyond vibe checks: A PM's complete guide to evals https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete·blog— The widely-shared PM-oriented argument for moving past 'looked good to me' vibe checks to systematic evals; one of the pieces that made evals a mainstream product skill in 2025. 🆕 - — Gergely Orosz & Hamel Husain (The Pragmatic Engineer) —A pragmatic guide to LLM evals for devs https://newsletter.pragmaticengineer.com/p/evals·newsletter— Reaches the broad engineering audience with the core 'why': LLM non-determinism breaks traditional testing, so you need evals. High-distribution motivation piece co-written by Hamel. 🆕 - — OpenAI —Predicting model behavior before release by simulating deployment (Deployment Simulation)https://openai.com/index/deployment-simulation/·blog— Concrete 2026 evidence for why fixed/static evals fail: models recognize when they're being tested and game test suites; replaying ~1.3M real conversations surfaced reward-hacking no fixed eval caught. Strong 'why evals must evolve' argument. 🆕 ⚠(unverified URL) - — Greg Brockman (OpenAI) —evals are surprisingly often all you need https://x.com/gdb/status/1733553161884127435·blog— The canonical one-liner ('evals are the new unit test') that anchors the whole 'why evals' thesis; frequently cited founding quote for the movement. Short but load-bearing.

Must-reads: Yao · Yan (eval-process) · Hamel (field-guide) #

— Jason Wei —Asymmetry of Verification and Verifier's Law https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law·blog— Trainability tracks verifiability; verifying = creating an RL environment. - — Han-Chung Lee —A Taxonomy of RL Environments for LLM Agents https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/·blog— A benchmark is a frozen RL environment; the E = {T,H,V,S,C} decomposition; "verifiable beats judgeable." - — Kanav Garg (Core Automation; ex-DeepMind) — talk; summary atThe Life Cycle of an RL Environment https://muratbuffalo.blogspot.com/2026/06/acm-cais-conference-on-ai-and-agentic.html·talk— Difficulty calibration (the 1–4/16 Goldilocks band), RL as variance reduction, reward hacking under training pressure.(local notes:research/notes/kanav-garg-rl-environment-lifecycle.md

) - — David Silver & Richard Sutton —Welcome to the Era of Experience https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf·paper— Human-data value approaching its ceiling; the frontier is agents learning from experience / synthetic environments. - — Nathan Lambert —RLHF Book, Ch. 16 — Evaluation https://rlhfbook.com/c/16-evaluation·book— Evaluation as a reflection of training goals; prompt-format sensitivity (60%→~0%). - — Nathan Lambert —What Comes Next with Reinforcement Learning https://www.interconnects.ai/p/what-comes-next-with-reinforcement·blog— Long-horizon credit assignment; where RL is and isn't ready. - — Prime Intellect —verifiers https://github.com/PrimeIntellect-ai/verifiers(docs:.../blob/main/docs/environments.md

) ·tool/repo— One environment package shared by eval andprime-rl

— the eval-is-an-RL-env thesis as code. - — DeepSeek-AI (Guo et al.) —DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning https://arxiv.org/abs/2501.12948·paper— The proof-of-thesis: pure RL with rule-based verifiable rewards (no SFT) makes reasoning emerge — the canonical 'if you can verify it, RL builds it' result; also published in Nature 2025. Conspicuously absent from a section literally about eval-as-RL-environment. 🆕 - — Lambert et al. (Allen Institute for AI) —Tülu 3: Pushing Frontiers in Open Language Model Post-Training https://arxiv.org/abs/2411.15124·paper— Coined/popularized RLVR and open-sourced the recipe + code (open-instruct): swap the reward model for a verifier on tasks with checkable answers. The foundational citation behind every 'verifiable beats judgeable' claim in this section. 🆕 - — Anthropic —Natural Emergent Misalignment from Reward Hacking in Production RL https://www.anthropic.com/research/emergent-misalignment-reward-hacking·paper— Empirical receipt for the section's 'reward hacking under training pressure' theme: learning to cheat on real coding environments generalizes to sabotage/alignment-faking; introduces inoculation prompting as mitigation (arXiv 2511.18397). 🆕 - — Prime Intellect —Environments Hub: A Community Hub To Scale RL To Open AGI https://www.primeintellect.ai/blog/environments·blog— The launch post for the verifiers-spec marketplace (2,500+ shared eval/RL environments) — the eval-is-an-RL-env thesis as an actual ecosystem, the natural companion to the already-listed verifiers repo. 🆕 - — Ege Erdil, Matthew Barnett, Tamay Besiroglu (Mechanize) —How to fully automate software engineering https://www.mechanize.work/blog/how-to-fully-automate-software-engineering/·blog— Sharpest statement of the inverse thesis: today's RL environments are rudimentary, so capability is gated on building richer/more diverse environments — 'you only get the capability you can build an environment for.' 🆕 - — Mechanize (Erdil, Barnett, Besiroglu) —Cheap RL tasks will waste compute https://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/·blog— The economics of environment quality: data and compute are complementary, so low-quality (cheaply-bought) tasks waste expensive RL compute — directly informs difficulty calibration / why environment design matters. 🆕 - — Jean-Stanislas Denain & Chris Barber (Epoch AI) —An FAQ on Reinforcement Learning Environments https://epoch.ai/gradient-updates/state-of-rl-envs·blog— Practitioner-interview survey (18 pros) on how RL environments are actually built, the reward-hacking failure modes, and the production-scaling bottleneck — the empirical state-of-the-field map this section lacks. 🆕 - — AJ Kourabi & Dylan Patel (SemiAnalysis) —RL Environments and RL for Science: Data Foundries and Multi-Agent Architectures https://newsletter.semianalysis.com/p/rl-environments-and-rl-for-science·newsletter— Market-structure view: 35+ companies now sell RL environments; capability gains are coming from ramping RL compute, not pretraining. Grounds the 'benchmark = frozen RL environment' thesis in who's actually building/buying them. 🆕 - — Harbor / Stanford / Laude Institute —Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces https://github.com/harbor-framework/terminal-bench·benchmark— A concrete instance of the thesis: each task ships a Docker environment + programmatic verification test suite + oracle — i.e. a benchmark that IS an RL environment (and is used as one). 2.4k stars, active. 🆕 - — Sierra Research (Barres et al.) —tau2-bench (τ²-Bench): A Benchmark for Tool-Agent-User Interaction in Real-World Domains https://github.com/sierra-research/tau2-bench·benchmark— Dual-control, multi-turn, policy-following eval with a simulated user and verifiable DB-state checks — the canonical example of a verifiable conversational/agentic environment beyond math/code (paper arXiv 2506.07982). 🆕

Must-reads: Wei · Lee (RL-env taxonomy) #

— Han-Chung Lee —Hidden Technical Debt: Agent Harness https://leehanchung.github.io/blogs/2026/05/08/hidden-technical-debt-agent-harness/·blog— The harness is the agent; what teams call "the model" is mostly harness + product. - — Han-Chung Lee —Hidden Technical Debt series (index)https://leehanchung.github.io/blogs/·blog— The four-part series (eval infra, runtime, harness, + agent runtime ~2026/04/24).(verify the runtime post URL on the index.) - — METR —Measuring AI Ability to Complete Long Tasks https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/·paper/blog— Scaffolds change the measured horizon; success-vs-human-time as a primitive. (also T9) - — Nathan Lambert —Turing Post interview ("Open Models Won't Catch Up")https://www.turingpost.com/p/nathanlambert·talk/interview— "What technical people call the harness or the product matters more than just the model." - — Florian Brand (Prime Intellect) —Quo vadis, LLM benchmarks?https://florianbrand.com/posts/benches-2026(talk:https://www.youtube.com/watch?v=kmTMc-fVSXw) ·blog/talk— The AlgoTune case:same model, different harness, opposite ranking.(also T6)(notes:research/notes/florian-brand-*

) - — Han-Chung Lee —The Model is the Product https://leehanchung.github.io/talks/2025/04/23/the-model-is-the-product/·talk— The primary-source talk (Data Council 2025) behind the must-read author's whole thesis — the direct counterpart to Hamel's 'Model is Not the Product'; the foundational text of the harness/model debate this section is built on. 🆕 - — Hamel Husain —The Model is Not the Product https://www.youtube.com/watch?v=EEw2PpL-_NM·talk— The opposing side of the Lee debate (Data Council 2025): great products are mostly harness + product + evals, not the model. Section already cites Lee; it should cite the debate it half-references. 🆕 - — Simon Willison —Agents are models using tools in a loop https://simonwillison.net/2025/May/22/tools-in-a-loop/·blog— The canonical, now-widely-adopted definition of an agent; 'the skill is in the design of both the tools and the loop' — the cleanest statement of why the harness, not the model, dominates behavior. 🆕 - — OpenAI —Harness engineering: leveraging Codex in an agent-first world https://openai.com/index/harness-engineering/·blog— Frontier-lab primary source coining 'harness engineering': a 1M-line codebase built by Codex agents where improving the environment/harness mattered more than the model. Lab-side complement to Lee's 'harness is the agent'. (URL returns 403 to scraper but page is live; corroborated by InfoQ/Milvus coverage.) 🆕 - — Anthropic —Equipping agents for the real world with Agent Skills https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills·blog— The primary source for the 'skill' leg of the model/harness/skill decomposition — skills as composable, progressively-disclosed capabilities (later made an open standard). The section title says 'skill' but has zero skill sources. 🆕 - — Anthropic —Effective context engineering for AI agents https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents·blog— Anthropic's primary statement that the harness's job is engineering context (editing, compaction, memory, programmatic tool-calling) — the mechanism behind why same model + different harness diverges. 🆕 - — Anthropic (Ken Aizawa) —Writing effective tools for agents — with agents https://www.anthropic.com/engineering/writing-tools-for-agents·blog— Tool design is a load-bearing part of the harness; 'agents are only as effective as the tools we give them,' validated eval-first. Directly ties harness decisions to measured agent performance. 🆕 - — Pete Hodgson —Same Model, Different Results: Why Coding Agents Aren't Interchangeable https://blog.thepete.net/blog/2025/12/10/same-model-different-results-why-coding-agents-arent-interchangeable/·blog— Concrete teardown of Claude Code's harness (system reminders, sub-agents, planning, IDE feedback) showing identical models yield different results — the practitioner case-study version of Brand's AlgoTune point. 🆕 - — Princeton SAgE team (Kapoor, Narayanan, et al.) —Holistic Agent Leaderboard (HAL)https://hal.cs.princeton.edu/·benchmark— Standardized, cost-aware harness that runs the SAME agent harness across 9 benchmarks/9 models (21,730 rollouts) — the infrastructure answer to 'harness confounds rankings.' ICLR 2026; paper arXiv:2510.11977. 🆕 - — Addy Osmani (O'Reilly Radar) —Agent Harness Engineering https://www.oreilly.com/radar/agent-harness-engineering/·blog— 'A decent model with a great harness beats a great model with a bad harness'; reframes agent failures as harness/config problems (traceable AGENTS.md rules). Names the converging harness primitives across coding agents. 🆕 - — Nathan Lambert (Interconnects) —What comes next with open models (weights / tools / harness decomposition)https://www.interconnects.ai/p/the-next-phase-of-open-models·blog— Lambert's written articulation (Mar 2026) of an AI system as weights + tools + harness — the written companion to the Turing Post interview already listed, with the explicit three-part decomposition. 🆕

Must-reads: Lee (harness) · Brand (Quo vadis) #

— Han-Chung Lee —Hidden Technical Debt: Agent Evaluation Infrastructure https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/·blog— Control plane / data plane; thefive surfaces(output, trace, memory, environment, mechanistic); the empty-tool-result hallucination. - — Braintrust —The Three Pillars of AI Observability https://www.braintrust.dev/blog/three-pillars-ai-observability·blog— Dataset reconciliation (living datasets); traces / evals / annotation. - — Arize (AX docs) —Agent Trajectory Evaluations https://arize.com/docs/ax/evaluate/evaluators/trace-and-session-evals/trace-level-evaluations/agent-trajectory-evaluations·docs— Grading the path, not just the answer. - — Galileo —AI Agent Metrics: How Elite Teams Evaluate https://galileo.ai/blog/ai-agent-metrics·blog— A concrete agent-metric taxonomy (action completion, tool selection, etc.). - — Arize —OpenInference semantic conventions https://github.com/Arize-ai/openinference/blob/main/spec/semantic_conventions.md·tool/repo— An OTel-based agent trace schema (tool, args, observation, latency, cost). - — LangChain —LangSmith Evaluation / Trajectory evals https://docs.langchain.com/langsmith/evaluation·https://docs.langchain.com/langsmith/trajectory-evals·docs. - — OpenTelemetry / CNCF —OpenTelemetry GenAI Semantic Conventions (agent & framework spans)https://github.com/open-telemetry/semantic-conventions-genai·docs— The upstream vendor-neutral standard (spans/metrics/events for LLM calls, invoke_agent, execute_tool, MCP) that OpenInference maps onto — the canonical trace schema the section's OpenInference entry derives from. 🆕 - — OpenTelemetry —Semantic Conventions for GenAI agent and framework spans https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/·docs— Human-readable spec page for create_agent / invoke_agent / execute_tool spans and attributes — the precise definition of what a gradable agent trace looks like. 🆕 - — OpenTelemetry (blog) —Inside the LLM Call: GenAI Observability with OpenTelemetry https://opentelemetry.io/blog/2026/genai-observability/·blog— Walkthrough of emitting and reading GenAI spans (token usage, finish reasons, tool calls) — concrete intro to the trace surface for practitioners not steeped in OTel. 🆕 - — Weights & Biases —W&B Weave — tracing & evaluation toolkit https://docs.wandb.ai/weave·docs— @weave.op trace trees (inputs/outputs/cost/latency) plus a scorer-based eval harness — a widely used surface for grading both traces and outputs. 🆕 - — Laminar —Laminar — open-source observability for AI agents https://laminar.sh/·tool— OTel-native, agent-specific: transcript view, SQL-over-traces, and a rollout debugger — purpose-built for grading multi-step agent trajectories rather than single LLM calls. 🆕

Must-reads: Lee (eval infra) · Braintrust (three pillars) (All repos URL-verified via GitHub API, Jun 2026. 🆕 = released/expanded 2025–2026. ⚠️ = caveat/discontinued.)

— UK AISI —Inspect AI https://github.com/UKGovernmentBEIS/inspect_ai·https://inspect.aisi.org.uk/—@task

binds dataset + solver + scorer; custom scorers; sandboxed tools. The reference agent-eval framework.(MUST)— UK AISI —inspect_evals https://github.com/UKGovernmentBEIS/inspect_evals— 🆕 the companion catalog of community benchmarks (GAIA, CTFs, AIME…) — the "batteries" for Inspect.— EleutherAI —lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness— the standard academic harness; first-class decontamination; task YAMLs.— Allen Institute (Ai2) —OLMES https://github.com/allenai/olmes— 🆕 the reproducible evalstandard + harness behind OLMo/Tülu: standardized prompts/metrics/formatting for apples-to-apples model comparison.—BenchFlow https://github.com/benchflow-ai/benchflow·https://benchflow.ai— 🆕 environment-lab framework: research infra + runtime for building RL environments, evals & post-training; shipsSkillsBench andClawsBench. ("Environments are the new data.")— Hugging Face —lighteval https://github.com/huggingface/lighteval— 🆕 all-in-one harness across transformers/vLLM/TGI/nanotron, 1000+ tasks; HF's successor toevaluate

.— Groq —OpenBench https://github.com/groq/openbench— 🆕 provider-agnosticbench CLI, 95+ benchmarks, built on Inspect primitives.— OpenAI —simple-evals https://github.com/openai/simple-evals— minimal zero-shot/CoT scripts (MMLU, HumanEval, SimpleQA, HealthBench); the numbers OpenAI publishes.⚠️ not actively maintained.—OpenAI Evals https://github.com/openai/evals— thecompletion_fn

abstraction = swap the system-under-test. (Best-practices:https://developers.openai.com/api/docs/guides/evaluation-best-practices)—promptfoo https://github.com/promptfoo/promptfoo— MIT eval + red-teaming CLI; git-diffable YAML configs.(MUST)—DeepEval / Confident AI https://github.com/confident-ai/deepeval— "pytest for LLMs," 40+ metrics (G-Eval, RAG, hallucination) + red-team; ~2M evals/day; hosted cloud. 🆕—pydantic-evals https://github.com/pydantic/pydantic-ai(ai.pydantic.dev/evals

) — 🆕 type-safe Datasets/Cases/Evaluators with OTel tracing, from the Pydantic AI team.— LangChain —openevals https://github.com/langchain-ai/openevals— 🆕 prebuilt evaluators +create_llm_as_judge

(incl. multimodal); general-purpose companion toagentevals(https://github.com/langchain-ai/agentevals, trajectory match).—MLflow GenAI evaluate https://mlflow.org/docs/latest/genai/eval-monitor/— 🆕mlflow.genai.evaluate

: 50+ judges/metrics, custom scorers, regression datasets inside MLflow.— Stanford CRFM —HELM (crfm-helm)https://github.com/stanford-crfm/helm— holistic eval: standardized datasets + metrics beyond accuracy + leaderboard (also VHELM, HEIM).—Giskard https://github.com/Giskard-AI/giskard-oss— auto-generates adversarial test suites (injection, hallucination, bias) from a plain-language app description.—Deepchecks LLM https://github.com/deepchecks/deepchecks(llmdocs.deepchecks.com

) — property-based scoring (grounded-in-context, toxicity, fluency) + custom LLM-judge properties.—UpTrain https://github.com/uptrain-ai/uptrain— 20+ preconfigured checks + root-cause analysis on failures.—HFevaluate

https://github.com/huggingface/evaluate— classic metrics library,⚠️ maintenance mode (use lighteval for LLMs).— harbor-framework (Laude Institute / Stanford) —Harbor https://github.com/harbor-framework/harbor— 🆕 framework for running agent evals + creating/using RL environments; powers Terminal-Bench 2.0. ~2.7k★.⚠️ name overloaded (cf.av/harbor

local-LLM toolkit).

— Matt Pocock —[evalite](https://github.com/mattpocock/evalite)[https://github.com/mattpocock/evalite](https://github.com/mattpocock/evalite)— 🆕 local-first eval runner on Vitest;`.eval.ts`

files, web UI, cost-aware.—[Mastra scorers](https://github.com/mastra-ai/mastra)[https://github.com/mastra-ai/mastra](https://github.com/mastra-ai/mastra)(`mastra.ai/docs/evals/overview`

) — 🆕 model-graded/rule/statistical scorers, live evals, CI, in the Mastra agent framework.—Vercel agent-eval https://github.com/vercel-labs/agent-eval— 🆕 A/B-test coding agents (Claude Code, Codex, Cursor) on custom tasks; pass-rate dashboards.— Braintrust —Autoevals https://github.com/braintrustdata/autoevals— OSS scorer library (Factuality, relevance, security…) across Py/JS/Go/Ruby.

—TruLens https://github.com/truera/trulens— instrumentation + "feedback functions" (the RAG triad), now OTel-based.— Stanford —ARES https://github.com/stanford-futuredata/ARES— synthetic queries + fine-tuned judges + prediction-powered inference for confidence intervals.— Amazon Science —RAGChecker https://github.com/amazon-science/RAGChecker— 🆕 claim-level diagnosis separating retriever vs generator errors.—continuous-eval (Relari)https://github.com/relari-ai/continuous-eval— modular per-module metrics across retrieval/generation/tool-use.—Tonic Validate https://github.com/TonicAI/tonic_validate— RAG metrics as a GitHub Action for CI.

— Haize Labs —verdict https://github.com/haizelabs/verdict— 🆕 declarative compound judges (debate/verification/aggregation, inference-time scaling); arXiv:2502.18018.— OpenPipe (ART) —RULER https://github.com/OpenPipe/ART(art.openpipe.ai/fundamentals/ruler

) — 🆕 LLM-judge that ranks trajectories with no labels — judge-as-RL-reward.(industry must-read)—Prometheus 2 https://github.com/prometheus-eval/prometheus-eval— open-weight evaluator LMs for rubric-based assessment + pairwise.—Atla Selene https://github.com/atla-ai/selene-mini— 🆕 8B SoTA open judge (score + critique); + MCP serveratla-ai/atla-mcp-server

. arXiv:2501.17195.—Patronus Lynx / GLIDER https://github.com/patronus-ai/Lynx-hallucination-detection·https://github.com/patronus-ai/glider— 🆕 open hallucination judge / explainable span-level judge.—Flow-Judge https://github.com/flowaicom/flow-judge— efficient 3.8B open evaluator.— AI2 —RewardBench https://github.com/allenai/reward-bench— canonical reward-model (+v2 judge) benchmark/harness.—JudgeBench https://github.com/ScalerLab/JudgeBench— benchmark to evaluate the judges themselves.— Fireworks —reward-kit https://github.com/fw-ai-external/reward-kit— 🆕 decorator-based reward-function authoring (TRL/Fireworks interop).

— Prime Intellect —verifiers https://github.com/PrimeIntellect-ai/verifiers— Environment = dataset + harness + rubric; one package for eval, RL, synthetic data.(MUST)— Prime Intellect —Environments Hub https://github.com/PrimeIntellect-ai/community-environments(app.primeintellect.ai) — 🆕 crowdsourced verifiers-based RL/eval envs.— Prime Intellect —prime-rl https://github.com/PrimeIntellect-ai/prime-rl— 🆕 async RL trainer consuming verifiers envs (INTELLECT-3).—BenchFlow https://github.com/benchflow-ai/benchflow·https://benchflow.ai— 🆕 environment lab: builds & runs RL/eval environments (SkillsBench, ClawsBench, runtime). "Environments are the new data." (also §5a)—HUD https://github.com/hud-evals/hud-python— 🆕 SDK to build/run agent eval environments (computer-use, browser, MCP) with telemetry.— Nous Research —Atropos https://github.com/NousResearch/atropos— 🆕 async "environment microservice" framework for rollouts/verifiable rewards.—verl https://github.com/volcengine/verl(nowverl-project/verl

) — de-facto industry RLVR trainer (PPO/GRPO). ~22k★.—OpenRLHF https://github.com/OpenRLHF/OpenRLHF·SkyRL—https://github.com/NovaSky-AI/SkyRL·** AReaL**—https://github.com/areal-project/AReaL·** ROLL**—https://github.com/alibaba/ROLL·** rLLM**—https://github.com/agentica-project/rllm·** TRL**—https://github.com/huggingface/trl— the RL-training stack agents are post-trained + eval'd in.— General Reasoning —Open Reward Standard (ORS)https://docs.openreward.ai/(PyPIopenreward

) — 🆕 MCP-extending spec adding RL primitives (episodes, rewards, curriculum).⚠️ no single canonical repo confirmed.

—Arize Phoenix https://github.com/Arize-ai/phoenix— OSS OTel tracing + response/retrieval evals + datasets/experiments.(MUST)—Langfuse https://github.com/langfuse/langfuse— OSS: evals (LLM-judge, feedback, manual labeling), datasets/experiments, prompt mgmt; self-hostable. 🆕— Comet —Opik https://github.com/comet-ml/opik— 🆕 fully-OSS eval + observability (judges, datasets, CI-runnable evals).—W&B Weave https://github.com/wandb/weave—weave.Evaluation

scorers (exact/regex/model-graded/embedding) + Guardrails; comparison dashboards. 🆕 (Humanloop's migration target.)—Braintrust https://www.braintrust.dev/docs/start/eval-sdk(offline-eval-guide) —Eval()

over golden datasets; offline vs online.(MUST)—Patronus AI https://www.patronus.ai/(github.com/patronus-ai ) — 🆕 research-grade judges (Lynx, GLIDER,Percival agent-failure debugger), experiments, multimodal judge.—Maxim AI https://www.getmaxim.ai/— 🆕 agentsimulation+ eval + observability across thousands of scenarios/personas.—Galileo https://galileo.ai/— Luna evaluators + Agentic Evaluations.—Vellum https://www.vellum.ai/— visual workflows + offline/online evals scoring every production run.—Helicone https://github.com/helicone/helicone— OSS gateway + observability; "Scores" ingests external eval results.—Traceloop / OpenLLMetry https://github.com/traceloop/openllmetry— OSS OTel instrumentation (Py/TS/Go/Ruby) + hosted reliability platform.—Langtrace https://github.com/Scale3-Labs/langtrace— OSS OTel-standard tracing + manual scoring + dataset mgmt.—WhyLabs / LangKit https://github.com/whylabs/langkit— high-throughput text-signal metrics (toxicity, PII, jailbreak) for production monitoring.—Portkey https://github.com/portkey-ai/gateway— 🆕 OSS gateway + 60+ guardrails + observability (fully open-sourced Mar 2026).—Datadog LLM Observability https://www.datadoghq.com/product/ai/llm-observability/— 🆕 evaluators + golden datasets +LLM Experiments+ AI Agent Monitoring (Jun 2025).—Fiddler AI https://www.fiddler.ai/— 🆕 Trust Models (Safety/PII/Faithfulness) scoring in <100ms; Guardrails + agentic observability.—SeaOtter https://seaotter.ai/submit?utm_source=github&utm_medium=awesome_list&utm_campaign=launch&utm_content=A-09-benchflow-awesome-evals·tool— Adversarial critic for AI agent outputs. Submit an output plus an acceptance policy; get pass/rework/fail with specific reasons before accepting the work.—PromptLayer https://www.promptlayer.com/·New Relic AI Monitoring—https://newrelic.com/platform/ai-monitoring— lighter prompt-CMS / APM-native monitoring.

— Arize —OpenInference https://github.com/Arize-ai/openinference— semantic conventions for agent traces (tool/args/observation/latency/cost). - —OpenTelemetry GenAI semantic conventions https://opentelemetry.io/docs/specs/semconv/gen-ai/(open-telemetry/semantic-conventions

) — 🆕 the vendor-neutral schema (now covers agent orchestration, MCP tool calls, and aquality-evaluation span hook). - — Braintrust —Braintrust https://www.braintrust.dev/·tool— Industry-standard eval+observability platform (Notion, Stripe, Vercel) tying offline experiments to production logs; the section already cites Braintrust's Autoevals but omits the platform itself. 🆕 - — RagaAI —RagaAI Catalyst https://github.com/raga-ai-hub/RagaAI-Catalyst·tool— OSS agent-observability + eval SDK with multi-agent trace/execution-graph debugging, synthetic-data gen, and guardrail management — covers the online/guardrail-eval slice the section lacks. 🆕 - — OpenAI —OpenAI Cookbook — Evals https://developers.openai.com/cookbook/topic/evals·docs— Maintained, runnable recipes for building evals (incl. Agents SDK eval, evaluating agents with Langfuse); the practical companion to OpenAI Evals and a curator-grade 'show real work' resource. 🆕 ⚠(unverified URL)

Must-reads: Inspect AI · promptfoo · Braintrust · verifiers · DeepEval · Phoenix/Langfuse (pick your observability) · RULER (judge-as-reward)

6 · Benchmark vs. eval (and benchmark integrity: contamination, saturation, label errors, leaderboard gaming) #

— Ofir Press —How to Build Good Language Modeling Benchmarks https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/·blog— The benchmark-author's checklist; difficulty target; one-number reporting; 150–500 task sizing. - — Kapoor et al. —AI Agents That Matter https://arxiv.org/abs/2407.01502·paper— Cost-controlled evaluation; model-dev vs downstream-dev needs; holdouts. - — OpenAI —Why We No Longer Evaluate SWE-bench Verified https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/·blog— ~59% of audited failures were broken tests. (mirror:https://decrypt.co/359012/...) - — Shivalika Singh et al. (Cohere/Princeton/Stanford/MIT/AI2) —The Leaderboard Illusion https://arxiv.org/abs/2504.20879·paper— Private testing, selective disclosure, and data-access asymmetry on Chatbot Arena.(notes:research/notes/leaderboard-illusion.md

) - —The SWE-bench Illusion: When SOTA LLMs Remember Instead of Reason https://arxiv.org/abs/2506.12286·paper— Memorization inflates SWE-bench scores. - —Establishing Best Practices for Building Rigorous Agentic Benchmarks (ABC)https://arxiv.org/abs/2507.02825·paper— SWE-bench Verified weak tests; τ-bench rewards empty responses.(verified high) - — Epoch AI —FrontierMath Tiers 1–3 v2 (corrected)https://epoch.ai/benchmarks/frontiermath-tiers-1-3-v2(changelog:.../frontiermath-tier-4-v2

) ·page— ~42% of problems corrected after AI-assisted review. (also T8: the operator-as-rot-detector tale) - — FutureHouse / Andrew White —About 30% of Humanity's Last Exam Answers Are Wrong https://www.futurehouse.org/research-announcements/hle-exam·blog— 29 ± 3.7% of text-only chem/bio answers contradicted by the literature. (LessWrong writeup:https://www.lesswrong.com/posts/JANqfGrMyBgcKtGgK/) - — Nathan Lambert —Building on Evaluation Quicksand https://www.interconnects.ai/p/building-on-evaluation-quicksand·blog— No hard source of truth; synthetic-data contamination. - —Lost in Simulation https://arxiv.org/abs/2601.17087·paper— Simulated users are unreliable proxies (~9pp swings by simulator choice; demographic miscalibration). - — Jimenez, Yang, … Press, Narasimhan —SWE-bench: Can LMs Resolve Real-World GitHub Issues?https://arxiv.org/abs/2310.06770·https://www.swebench.com(Verified:.../verified.html

) ·paper/site. - — Eugene Yan —Task-Specific LLM Evals that Do & Don't Work https://eugeneyan.com/writing/evals/·blog— Off-the-shelf evals rarely transfer; accuracy is too coarse. - —Andrej Karpathy on evals https://x.com/karpathy/status/1896266683301659068·post— "We make a number of specific recommendations…" (the eval-as-narrow critique). - — Hugh Zhang et al. (Scale AI) —A Careful Examination of LLM Performance on Grade School Arithmetic (GSM1k)https://arxiv.org/abs/2405.00332·paper— Held-out GSM1k replica of GSM8k exposes up to 8% accuracy drop and partial memorization (Mistral/Phi) — the canonical method for measuring benchmark overfitting/contamination via a matched holdout. - — Curtis Northcutt, Anish Athalye, Jonas Mueller —Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks https://arxiv.org/abs/2103.14749·paper— NeurIPS 2021 foundational result: ~3.3% avg label errors across 10 famous test sets (ImageNet, MNIST, etc.); corrections flip model rankings. The canonical 'label errors' citation this section's theme rests on (labelerrors.com / cleanlab). - — Aryo Pradipta Gema et al. (Edinburgh) —Are We Done with MMLU? (MMLU-Redux)https://arxiv.org/abs/2406.04127·paper— ~6.5% of MMLU questions contain errors (57% in Virology); MMLU-Redux re-annotation shifts rankings — directly demonstrates label-error impact on the most-cited LLM benchmark. - — Naman Jain et al. (UC Berkeley) —LiveCodeBench: Holistic and Contamination-Free Evaluation of LLMs for Code https://arxiv.org/abs/2403.07974·benchmark— Time-windowed problem collection (post-cutoff scoring) as the leading contamination-resistant design pattern — the section discusses contamination but lists no exemplar of how to engineer around it. - — White, Dohan, LeCun, Goldblum et al. —LiveBench: A Challenging, Contamination-Limited LLM Benchmark https://github.com/LiveBench/LiveBench·benchmark— Monthly-refreshed questions from new arXiv/news/competitions with objective ground truth — the canonical 'dynamic refresh' answer to saturation and contamination. - — Clémentine Fourrier / Hugging Face —The LLM Evaluation Guidebook (Open LLM Leaderboard team)https://github.com/huggingface/evaluation-guidebook·docs— Practitioner reference from running the Open LLM Leaderboard; explicit sections on contamination, reproducibility, and leaderboard design — the hands-on 'how to not get fooled' companion to this section (updated version: hf.co/spaces/OpenEvals/evaluation-guidebook). - — Kapoor, Stroebl, Kirgis et al. (Princeton) —Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation https://arxiv.org/abs/2510.11977·paper— 21,000+ standardized agent runs surfacing leaderboard unreliability and unreported misbehaviors (agents searching HuggingFace for benchmark answers) — extends 'AI Agents That Matter' to leaderboard integrity for agents specifically. 🆕 - — Jambholkar, Rajani, Bakshi (Collinear AI) —Gaming the System: Goodhart's Law Exemplified in the AI Leaderboard Controversy https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy·blog— Practitioner framing of the Llama 4 / Chatbot Arena gaming episode through Goodhart's Law — the accessible blog companion to The Leaderboard Illusion paper. 🆕 - — OpenAI —A Shared Playbook for Trustworthy Third-Party Evaluations https://openai.com/index/trustworthy-third-party-evaluations-foundations/·blog (Safety, May 29 2026)— What makesindependentevals of frontier-model safeguards & capabilities trustworthy: selecting the right harness, checking for validity hazards that distort results, and the standards third-party evaluators need. (also T10) 🆕

Must-reads: Press · Kapoor et al. · OpenAI (SWE-bench Verified) · Leaderboard Illusion (See also T2 — verifiers library, Lee's RL-env taxonomy, Garg's lifecycle, Wei's verifier's law.)

— Nathan Lambert et al. —RewardBench https://arxiv.org/abs/2403.13787·paper— Evaluating reward models (the verifier you train against). - — Nathan Lambert —The New RL Scaling Laws https://www.interconnects.ai/p/the-new-rl-scaling-laws·blog— Where RLVR scaling is heading. (interview:https://www.latent.space/p/the-rlvr-revolution-with-nathan-lambert) - —Spurious Rewards: Rethinking Training Signals in RLVR https://arxiv.org/abs/2506.10947·paper— Random/spurious rewards rival ground truth on Qwen2.5 (Qwen-specific).(cite arXiv figures, not the blog gloss — seeresearch/notes/reference-audit.md

) - — Nathan Lambert —The State of Post-Training 2025 https://www.interconnects.ai/p/the-state-of-post-training-2025·blog— Context for where evals feed training. - — Lilian Weng —Reward Hacking in Reinforcement Learning https://lilianweng.github.io/posts/2024-11-28-reward-hacking/·blog— The canonical survey of reward hacking — taxonomy, RLHF-specific failure modes, mitigations; the foundational reference any reward-design section needs. - — Victoria Krakovna et al. (Google DeepMind) —Specification gaming: the flip side of AI ingenuity https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/·blog— Canonical specification-gaming post (+the running examples list); origin story of why verifiers/reward functions get gamed, predating the LLM-RL wave. - — Latent Space / Will Brown —Multi-Turn RL for Multi-Hour Agents — with Will Brown (Prime Intellect)https://www.latent.space/p/willccbb·talk— The verifiers author on building multi-turn RL environments, turn-level credit assignment and reward design in practice — the practitioner voice behind the verifiers library already cited here. 🆕 - — various (arXiv 2509.21882) —Position: The Hidden Costs and Measurement Gaps of RLVR https://arxiv.org/abs/2509.21882·paper— RLVR gains overstated via budget mismatch, calibration drift, contamination; proposes a tax-aware minimum standard — the rigor counterweight to Lambert's RL-scaling optimism. 🆕 - — Saumya Malik, Nathan Lambert et al. (Ai2) —RewardBench 2: Advancing Reward Model Evaluation https://arxiv.org/abs/2506.01937·benchmark— The 2025 successor to RewardBench (already listed) — harder, less saturated, ICLR 2026; the current bar for evaluating the verifier you train against. 🆕 - — Nathan Lambert —Reward Modeling (RLHF Book, ch. 5)https://rlhfbook.com/c/05-reward-models·docs— Canonical free reference chapter on reward models — the standing explainer for the 'verifier you train against' framing this section uses. 🆕 - — Shubham Parashar et al. (Texas A&M) —Curriculum RL from Easy to Hard Tasks Improves LLM Reasoning (E2H Reasoner)https://arxiv.org/abs/2506.06632·paper— Difficulty-calibration primary source: easy-to-hard scheduling with convergence guarantees and the 'fade out easy tasks' result — directly fills the section's difficulty-calibration theme. 🆕 - — Jiacheng Guo, Ling Yang, Mengdi Wang et al. (Princeton) —GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators https://arxiv.org/abs/2512.19682·paper— Generative environment simulator with an alpha-Curriculum Reward that keeps tasks in the zone of proximal development — recent take on auto-calibrating env difficulty to the agent. 🆕

Must-reads: Lee (RL-env taxonomy) · Garg (lifecycle) · verifiers (repo) #

— Eugene Yan —Evaluating the Effectiveness of LLM-Evaluators https://eugeneyan.com/writing/llm-evaluators/·blog— Position/verbosity/self-enhancement bias; direct vs pairwise; prefer binary + classification metrics. - — Hamel Husain —Creating an LLM-as-a-Judge That Drives Business Results https://hamel.dev/blog/posts/llm-judge/·blog— Critique-shadowing; validate against ONE benevolent-dictator expert; precision/recall over raw agreement. -

— Shankar et al. (UIST '24) —Who Validates the Validators? (EvalGen)https://arxiv.org/abs/2404.12272(pdf:.../pdf/2404.12272 ; UIST:https://people.eecs.berkeley.edu/~bjoern/papers/shankar-validators-uist2024.pdf) ·paper— Criteria drift; the coverage-vs-false-failure judge-alignment loop. - — Hamel Husain & Shreya Shankar —LLM Evals FAQ https://hamel.dev/blog/posts/evals-faq/(error-analysis section:.../why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html

) ·blog— Binary over Likert; review ≥100 traces; the first-failure transition matrix for agents. - — Han-Chung Lee —LLM-as-a-Judge: Rethinking Model-Based Evaluations https://leehanchung.github.io/blogs/2024/08/11/llm-as-a-judge/·blog— Avoid [0,1] continuous scales; manage judges like junior annotators. - — Zheng et al. —Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena https://arxiv.org/abs/2306.05685·paper— Source of the 10%/25% self-favoring & position-bias numbers —which the authors themselves hedge("cannot determine"); GPT-3.5 doesn't self-favor. - — Bavaresco et al. —LLMs Instead of Human Judges? A Large-Scale Study https://arxiv.org/abs/2406.18403·paper— Substantial variance across models/datasets; validate judges against humans first. - — Eugene Yan —AlignEval https://eugeneyan.com/writing/aligneval/·blog— "Align AI to human. Calibrate human to AI. Repeat." Work backward from the data. - — Eugene Yan —Product Evals in Three Simple Steps https://eugeneyan.com/writing/product-evals/·blog— The "God Evaluator" anti-pattern; the benchmark is human performance, not perfection. - — Han-Chung Lee —Statistics for AI/ML, Part 3 — Cohen's Kappa https://leehanchung.github.io/blogs/2025/03/03/cohen-kappa/·blog— Chance-adjusted inter-annotator agreement (the gate before holding out). - — Shreya Shankar —Data Flywheels for LLM Applications https://www.sh-reya.com/blog/ai-engineering-flywheel/·blog— Binary metrics, the "GPT smell," error analysis as the core activity. - (SPADE https://arxiv.org/html/2401.03038v1) &DocETL(https://arxiv.org/abs/2410.12189) — Shankar et al. ·paper— Data-quality assertions / agentic query rewriting for LLM pipelines. - — Arjun Panickssery, Samuel R. Bowman, Shi Feng (NeurIPS 2024) —LLM Evaluators Recognize and Favor Their Own Generations https://arxiv.org/abs/2404.13076·paper— The canonical causal study of self-preference bias: shows GPT-4/Llama-2 can recognize their own outputs and that self-recognition correlates linearly with self-favoring. This is the primary source behind 'self-enhancement bias' that the section's blogs only allude to. - — Yang Liu et al. (Microsoft, EMNLP 2023) —G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment https://arxiv.org/abs/2303.16634·paper— The foundational reference-free LLM-judge method (CoT + form-filling scoring). Defines the direct-scoring paradigm the section critiques; a curated judge section is incomplete without the paper that started it. - — Jiawei Gu et al. —A Survey on LLM-as-a-Judge https://arxiv.org/abs/2411.15594·paper— The most-cited survey organizing the LLM-judge space (bias taxonomy, reliability methods, agreement metrics). Serves as the one-stop map/bibliography the section currently lacks. - — Yulai Zhao, Haolin Liu, Dian Yu et al. (Tencent AI Lab / Princeton) —One Token to Fool LLM-as-a-Judge https://arxiv.org/abs/2507.08794·paper— Shows 'master-key' tokens (a colon, 'Solution:') trigger false-positive rewards up to 80% even on GPT-o1/Claude-4 judges, plus a robust Master-RM fix. Core evidence on judge/verifier reward-hacking fragility. 🆕 - — Jon Saad-Falcon et al. — Stanford Hazy Research / Scaling Intelligence —Weaver: Closing the Generation-Verification Gap with Weak Verifiers https://hazyresearch.stanford.edu/blog/2025-06-18-weaver·blog— Directly operationalizes 'verifiable vs judgeable': aggregates many weak judges/reward models (unlabeled) to shrink the generator-verifier gap, reaching o3-mini accuracy from Llama-3.3-70B. Paper: arxiv.org/abs/2506.18203. 🆕 - — Mingchen Zhuge et al. (Meta AI / KAUST) —Agent-as-a-Judge: Evaluate Agents with Agents https://arxiv.org/abs/2410.10934·paper— Extends LLM-as-judge to agentic trajectories—grading intermediate steps, not just final outputs—with the DevAI benchmark. The agent-specific evaluation case this agent-evals library specifically needs. - — Various (AAAI 2026) —VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains https://arxiv.org/abs/2507.09884·benchmark— Cross-domain benchmark exposing verifier precision/recall trade-offs (specialized verifiers high-accuracy but low-recall; general models inclusive but unstable). Quantifies how trustworthy a verifier actually is for RLVR. 🆕 - — Databricks (Mosaic Research) —Enhancing LLM-as-a-Judge with Grading Notes / From Pilot to Production with Custom Judges https://www.databricks.com/blog/pilot-production-custom-judges·blog— Enterprise-grade judge-building playbook: 20-30 calibration examples, batched SME annotation, Krippendorff's alpha agreement gating—a production-side complement to the Hamel/Shankar academic alignment loop. 🆕 - — Jiayi Ye et al. —Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge (CALM framework)https://arxiv.org/abs/2410.02736·paper— Systematic quantification of 12 judge biases (verbosity, bandwagon, authority, distraction, sentiment, etc.) via automated attacks—broadens the section's bias coverage well beyond position/verbosity/self-enhancement.

Must-reads: Yan (llm-evaluators) · Hamel (llm-judge) · Shankar (EvalGen)

9 · Agent-specific evaluation (trajectories, tool use, multi-turn, world state, multi-agent, localization) #

— Anthropic —Demystifying Evals for AI Agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents·blog— Grade the final env state (flight-booking via SQL); outcome vs trajectory; isolation; pass@k vs pass^k. - — Sierra —τ-bench / τ²-bench https://arxiv.org/abs/2406.12045·https://github.com/sierra-research/tau-bench·paper/repo— DB-state-diff grading; user simulation; pass^k; empty-result as explicit fail. - — Sierra —Benchmarking AI Agents https://sierra.ai/blog/benchmarking-ai-agents·blog— The motivation behind τ-bench. - — Mialon et al. —GAIA: A Benchmark for General AI Assistants https://arxiv.org/abs/2311.12983·paper— Real assistant tasks; difficulty by human task-length. - — Eugene Yan —Patterns for Building Cybersecurity Evals https://eugeneyan.com/writing/cybersecurity-evals/·blog— The four-primitive agentic-eval template (sandbox, difficulty inputs, tools, deterministic grader); outcome grading + partial-credit ladders + transcript audits. (also T10) - — Han-Chung Lee —Statistics for AI/ML, Part 4 — pass@k and Unbiased Estimator https://leehanchung.github.io/blogs/2025/09/08/pass-at-k/·blog— Demystifies the metric everyone misuses. - — Han-Chung Lee —First-Principles Eval https://leehanchung.github.io/blogs/2024/05/22/first-principles-eval/·blog. - —SWE-bench grading harness https://github.com/SWE-bench/SWE-bench/blob/main/swebench/harness/grading.py·tool/repo— FAIL_TO_PASS / PASS_TO_PASS as a verifiable reward. (SWE-agent ACI:https://swe-agent.com/0.7/background/aci/) - — OpenAI —human-eval (pass@k estimator)https://github.com/openai/human-eval/blob/master/human_eval/evaluation.py·tool/repo. - More agent benchmarks to add*(named in the brief; URLs not yet verified in this corpus — verify before use):*WebArena, OSWorld, Terminal-Bench, Cybench. - — Zhou et al. (CMU) —WebArena: A Realistic Web Environment for Building Autonomous Agents https://arxiv.org/abs/2307.13854·benchmark— Self-hostable sandboxed websites (e-commerce/forum/GitLab/CMS/maps) with execution-based functional-correctness graders; 812 tasks. The canonical web-agent world-state benchmark named in the brief — now URL-verified. - — Xie et al. (HKU et al.) —OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments https://arxiv.org/abs/2404.07972·benchmark— 369 real-computer tasks in VMs with per-task execution-based eval scripts and initial-state setup; humans 72% vs best agent 12%. Canonical computer-use benchmark named in the brief — now verified. - — Laude Institute + Stanford + community —Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command-Line Interfaces https://www.tbench.ai/·benchmark— Sandboxed terminal tasks with deterministic verifiers across SWE/sysadmin/security; v2 leaderboard. The terminal-agent benchmark named in the brief — verified (arxiv: arxiv.org/abs/2601.11868). 🆕 - — Zhang et al. (Stanford) —Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models https://arxiv.org/abs/2408.08926·benchmark— 40 professional CTF challenges with subtask annotations and deterministic flag-based grading; pairs naturally with Eugene Yan's cybersecurity-evals post already in the section. Named in the brief — now verified. - — Lù et al. (McGill / Mila / Google DeepMind) —AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories https://arxiv.org/abs/2504.08942·paper— First benchmark of LLM-judges-of-trajectories: 1302 expert-reviewed web-agent runs; shows rule-based graders reject many valid trajectories (under-reporting success). Core to the 'trajectory evaluation' theme the section currently lacks. 🆕 - — Cemri, Pan et al. (UC Berkeley Sky Lab) —Why Do Multi-Agent LLM Systems Fail? (MAST taxonomy)https://arxiv.org/abs/2503.13657·paper— 14-mode failure taxonomy across 7 MAS frameworks from 200+ annotated traces; the reference framework for diagnosing multi-agent failures — directly fills the 'multi-agent' gap. 🆕 - — Trivedi et al. (Stony Brook) — ACL'24 Best Resource Paper —AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents https://aclanthology.org/2024.acl-long.850/·benchmark— 9-app simulated world (457 APIs) with state-based unit tests that also check for collateral damage/unexpected state changes — gold-standard world-state grading for tool-use agents. - — OpenAI (Wei et al.) —BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents https://openai.com/index/browsecomp/·benchmark— 1,266 'inverted' hard-to-find/easy-to-verify questions for deep-research browsing agents; short verifiable answers make grading deterministic. Released 2025, now standard for browsing-agent eval. (paper: arxiv.org/abs/2504.12516) 🆕 - — Chen, Tang et al. (Yale / All Hands) —LocAgent: Graph-Guided LLM Agents for Code Localization https://arxiv.org/abs/2503.09089·paper— Defines and evaluates code localization as its own capability (Acc@k over file/function locations via code graphs) — directly fills the 'localization' theme named in the section title but currently unlisted. 🆕 - — He et al. (Tencent AI Lab) —WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models https://arxiv.org/abs/2401.13919·benchmark— 643 tasks on 15 live real-world sites with a GPT-4V automatic-judge eval protocol — an early, widely-cited example of multimodal-LLM-as-judge for live-web agent trajectories. - — BenchFlow —SkillsBench https://github.com/benchflow-ai/skillsbench·benchmark— 🆕 evaluates how well agentskills work and how effectively agents use them — makes skill-acquisition/skill-use a measurable axis (the "Agent Skills" frontier). ~1.4k★. - — BenchFlow —ClawsBench https://github.com/benchflow-ai/ClawsBench·benchmark— 🆕 BenchFlow's agent benchmark (results/data repo; full release in progress). - — OpenAI (with SWE-bench authors) —SWE-bench Verified https://openai.com/index/introducing-swe-bench-verified/·benchmark— 500 human-validated SWE-bench instances graded by hidden FAIL_TO_PASS unit tests; the de facto standard for real-issue resolution and the headline coding-agent number labs report 🆕 - — Yang, Jimenez, Press et al. (Princeton/Stanford) —SWE-bench Multimodal https://arxiv.org/abs/2410.03859·benchmark— 619 visual JS/front-end issues from 17 user-facing repos, test-verified; probes whether SWE agents generalize beyond Python/text to visual software domains - — Scale AI (Deng, Da et al.) —SWE-bench Pro https://arxiv.org/abs/2509.16941·benchmark— 1,865 long-horizon, multi-file tasks across public GPL + held-out + commercial startup repos, test-graded; contamination-resistant and hard (frontier <45% pass@1) 🆕 - — OpenAI (Miserendino, Patwardhan, Heidecke et al.) —SWE-Lancer https://arxiv.org/abs/2502.12115·benchmark— 1,400+ real Upwork freelance tasks worth $1M, graded by triple-verified end-to-end Playwright tests plus manager-decision tasks; ties capability to economic value 🆕 - — Pan, Wang, Neubig, Suhr, Zhang et al. (Berkeley/CMU) —SWE-Gym https://arxiv.org/abs/2412.21139·benchmark— 2,438 executable Python SWE tasks with pre-installed deps + test verification; the first real training/eval gym for SWE agents and verifiers, ICML 2025 🆕 - — ByteDance Seed —Multi-SWE-bench https://arxiv.org/abs/2504.02605·benchmark— 1,632 expert-annotated issue-resolution tasks across Java, TS, JS, Go, Rust, C, C++, test-graded; the leading multilingual SWE-bench extension, NeurIPS 2025 D&B 🆕 - — Nebius / Badertdinov et al. —SWE-rebench https://arxiv.org/abs/2505.20411·benchmark— Automated pipeline yielding 21k+ executable Python tasks with continuously refreshed, decontaminated eval splits; quantifies how much SWE-bench Verified scores are inflated by contamination, NeurIPS 2025 D&B 🆕 - — METR —RE-Bench https://arxiv.org/abs/2411.15114·benchmark— 7 open-ended ML research-engineering environments (e.g. GPU-kernel optimization, scaling laws) scored against 71 human-expert 8-hour attempts; the reference AI-R&D-uplift eval, ICML 2025 - — OpenAI (Chan et al.) —MLE-bench https://arxiv.org/abs/2410.07095·https://github.com/openai/mle-bench·benchmark— 75 Kaggle ML-engineering competitions graded against real human leaderboards (medal thresholds) in 24h Docker runs; standard ML-engineering-agent eval, ICLR 2025. 🆕 - — OpenAI (Starace et al.) —PaperBench https://arxiv.org/abs/2504.01848·benchmark— Replicate 20 ICML 2024 papers from scratch, graded by 8,316 author-co-developed rubric leaves via a validated LLM judge; rigorous research-replication agent eval, ICML 2025 🆕 - — Andy Konwinski / Kaggle —Konwinski Prize (K Prize)https://www.kaggle.com/competitions/konwinski-prize·leaderboard— $1M Kaggle forecasting-format contest on GitHub bugs filed after submission close, fully contamination-free, test-graded; round-1 top score only 7.5% exposed real-world difficulty 🆕 - — Gou et al., OSU NLP Group (NeurIPS 2025 D&B) —Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge https://arxiv.org/abs/2506.21506·benchmark— 130 long-horizon live-web agentic-search tasks; novel Agent-as-a-Judge rubric-tree grader for time-varying, citation-backed answers — a serious answer to the Deep Research evaluation gap. 🆕 - — Xue et al., OSU NLP Group —Online-Mind2Web (An Illusion of Progress? Assessing the Current State of Web Agents)https://arxiv.org/abs/2504.01382·benchmark— 300 realistic tasks on 136 live websites with an LLM-as-a-Judge auto-grader (~85% human agreement); exposes overstated web-agent progress vs simple baselines. 🆕 - — AGI Inc (agi-inc/REAL), powers realevals.xyz —REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites https://github.com/agi-inc/REAL·benchmark— 112 tasks on deterministic Next.js replicas of Amazon/Uber/LinkedIn etc.; reproducible LLM evaluator plus state validators — fixes the flakiness of live-site web benchmarks. 🆕 - — Thomas et al., Convergence AI —WebGames: Challenging General-Purpose Web-Browsing AI Agents https://arxiv.org/abs/2502.18356·benchmark— 50+ client-side challenges isolating specific browser interaction skills with verifiable pass/fail; best agent 41% vs 96% human, a sharp diagnostic gap. 🆕 - — Patil et al., UC Berkeley (Gorilla / ICML 2025) —Berkeley Function Calling Leaderboard (BFCL) V4 https://gorilla.cs.berkeley.edu/leaderboard.html·leaderboard— Executable + AST-based grading of tool/function calling; V4 adds multi-turn agentic, web-search and memory tasks — the de facto tool-calling leaderboard. 🆕 - — Wang et al., Shanghai AI Laboratory (NeurIPS 2024 D&B) —GTA: A Benchmark for General Tool Agents https://arxiv.org/abs/2407.08713·benchmark— 229 human-written real-world queries with implicit multimodal tool use; executable evaluation platform across perception/operation/logic/creativity tools (GTA-2 follow-up in 2026). 🆕 - — Lei et al., XLang Lab / HKU (ICLR 2025 Oral) —Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows https://arxiv.org/abs/2411.07763·benchmark— Enterprise text-to-SQL agent workflows over huge schemas and multiple dialects with execution-based grading; frontier models only ~17-21% — a hard, realistic data-agent eval. 🆕 - — Rawles et al., Google DeepMind / Google Research (ICLR 2025) —AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents https://arxiv.org/abs/2405.14573·benchmark— Live Android environment with durable reward signals from device system state for 116 parameterized tasks across 20 apps — the standard mobile-GUI agent benchmark. 🆕 - — Bonatti et al., Microsoft —WindowsAgentArena: Evaluating Multi-Modal OS Agents at Scale https://arxiv.org/abs/2409.08264·benchmark— 154 realistic multi-step Windows-OS tasks across apps with programmatic success checks; parallelizable in Azure (~20 min full run) — desktop computer-use counterpart to OSWorld. 🆕 - — Levy, Shlomov, Wiesel et al., IBM Research —ST-WebAgentBench: Evaluating Safety and Trustworthiness in Web Agents https://arxiv.org/abs/2410.06703·benchmark— 375 enterprise tasks carrying 3,057 explicit safety/policy constraints; introduces Completion-under-Policy and Risk Ratio — grades whether agents obey rules, not just succeed. 🆕 - — Xu et al., CMU —TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks https://arxiv.org/abs/2412.14161·benchmark— Self-hosted software-company sim (web, code, chat coworkers) with checkpoint-based partial-credit grading; best agent ~30% — a full-day-knowledge-worker eval. 🆕 - — Koh et al., Carnegie Mellon University —VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks https://arxiv.org/abs/2401.13649·benchmark— 910 visually-grounded web tasks across Classifieds/Shopping/Reddit with reproducible programmatic reward functions — the multimodal extension of WebArena. - — Tejal Patwardhan et al. (OpenAI) —GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks https://arxiv.org/abs/2510.04374·benchmark— 1,320 expert-built tasks across 44 occupations in the top 9 GDP sectors; 220-task gold subset open-sourced with a public automated grading service at evals.openai.com — the flagship economic-value agent benchmark. 🆕 - — CAIS + Scale AI (47 authors) —Remote Labor Index: Measuring AI Automation of Remote Work https://arxiv.org/abs/2510.26787·benchmark— Grades whether agents complete whole real freelance projects to client-acceptable standard; best agent automates only 2.5% — a hard, money-grounded ceiling for end-to-end remote work. 🆕 - — Center for AI Safety + Scale AI (Dan Hendrycks et al.) —Humanity's Last Exam https://arxiv.org/abs/2501.14249·benchmark— 2,500 expert-written frontier-knowledge questions with unambiguous auto-gradable answers across dozens of fields; the canonical post-MMLU saturation exam (note: now very widely cited). 🆕 - — OSU-NLP Group (Ohio State) —ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery https://github.com/OSU-NLP-Group/ScienceAgentBench·benchmark— 102 expert-validated tasks from 44 peer-reviewed papers; grades self-contained Python programs by execution + success rate; best agent solves only ~34% (ICLR 2025). 🆕 - — Siegel, Kapoor, Narayanan et al. (Princeton) —CORE-Bench: Computational Reproducibility Agent Benchmark https://arxiv.org/abs/2409.11363·benchmark— 270 tasks over 90 papers (CS/social science/medicine) that grade whether an agent can reproduce published results from code+data; from the Princeton AI-Snake-Oil group. - — Mingxuan Du et al. —DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents https://arxiv.org/abs/2506.11763·benchmark— 100 PhD-level tasks across 22 fields; reference-based adaptive-rubric grader for analyst-grade citation-rich reports, validated for human-judgment alignment — the standard deep-research-report eval. 🆕 - — FutureHouse + ScienceMachine —BixBench: A Comprehensive Benchmark for LLM-based Agents in Computational Biology https://arxiv.org/abs/2503.00096·benchmark— 50+ real bioinformatics analysis scenarios with ~300 open-answer questions over multi-step Jupyter trajectories; frontier models hit only ~17% — serious wet-lab-adjacent science agent eval. 🆕 - — Meta (Meta Agents Research Environments) —Gaia2 and ARE: Scaling Up Agent Environments and Evaluations https://arxiv.org/abs/2509.17158·benchmark— Successor to GAIA: dynamic, time-driven, multi-agent simulated environments with async world events and a verifiable scenario grader; frontier success ~42% — the serious general-assistant env from Meta. 🆕 - — Andon Labs (Backlund & Petersson) —Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents https://arxiv.org/abs/2502.15840·benchmark— Run a simulated vending business over >20M-token horizons; objectively graded on profit/net-worth, exposing long-horizon coherence breakdowns unrelated to context limits. 🆕 - — Francois Chollet et al. (ARC Prize Foundation) —ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems https://arxiv.org/abs/2505.11831·benchmark— Human-calibrated (400+ participants, 100% solvable) grid-reasoning tasks with exact-match grading; 2-3x harder than ARC-AGI-1 across all approaches — the frontier fluid-intelligence benchmark. 🆕 - — Patronus AI —TRAIL: Trace Reasoning and Agentic Issue Localization https://arxiv.org/abs/2505.08638·benchmark— 148 annotated agent traces with 841 errors (reasoning/planning/execution); grades whether an LLM can localize the failure in a trace (best model ~11%). HF dataset PatronusAI/TRAIL. 🆕 - — Salesforce Research —CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios https://arxiv.org/abs/2505.18878·benchmark— 19 expert-validated B2B/B2C tasks on a realistic Salesforce org with state-based grading; exposes the single-turn (~58%) vs multi-turn (~35%) reliability gap plus confidentiality checks. 🆕

Must-reads: Anthropic (demystifying) · τ-bench · Lee (pass@k)

10 · Safety / adversarial evaluation (prompt injection, jailbreaks, action-authorization, benchmark auditing) #

— Wang, Li, Mang, Cheung, Sen, Song (incl. Dawn Song) —BenchJack: Systematically Auditing AI Agent Benchmarks https://arxiv.org/abs/2605.12673·paper— Reward hacking emerges spontaneously in frontier models; an 8-pattern flaw taxonomy + a 30-question Agent-Eval checklist; "benchmarks must be secure by design." - — Dawn Song (UC Berkeley RDI, lecture slides) —Towards Building Safe & Secure Agentic AI https://rdi.berkeley.edu/adv-llm-agents/slides/dawn-agentic-ai.pdf·talk— The adversarial setting; environment-borne attacks. - —Dawn Song — ICLR 2025 keynote on LLM safety https://iclr.cc/virtual/2025/invited-talk/36783·talk. - — Wang et al. (incl. Dawn Song) —CyberGym https://arxiv.org/html/2506.02548v2·paper— Memory-safety PoC generation from OSS-Fuzz; sanitizer-crash grading at scale. - — Zeng et al. (incl. Song) —AIR-Bench 2024 https://arxiv.org/abs/2407.17436v2·https://github.com/stanford-crfm/air-bench-2024·paper/repo— Regulation-grounded risk taxonomy. -

—[DecodingTrust](https://decodingtrust.github.io)[https://decodingtrust.github.io](https://decodingtrust.github.io)·*benchmark*— NeurIPS 2023 trustworthiness benchmark. -
—[RedCode](https://arxiv.org/abs/2411.07781)[https://arxiv.org/abs/2411.07781](https://arxiv.org/abs/2411.07781)·*paper*— Risky code execution/generation benchmark for code agents. -
—[AgentPoison](https://arxiv.org/abs/2407.12784)[https://arxiv.org/abs/2407.12784](https://arxiv.org/abs/2407.12784)·*paper*— Red-teams agents by poisoning their RAG memory. -

— Miller (Anthropic) —Adding Error Bars to Evals (A Statistical Approach to LM Evaluations)https://arxiv.org/abs/2411.00640·https://www.anthropic.com/research/statistical-approach-to-model-evals·paper— Standard errors, clustered SEs, paired difference tests — "is this difference real?" (cross-cutting: T6/T8) - — Debenedetti, Zhang, Balunović, Beurer-Kellner, Fischer, Tramèr (ETH Zurich) —AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents https://arxiv.org/abs/2406.13352·benchmark— The canonical prompt-injection benchmark for tool-using agents (97 tasks, 629 security cases over untrusted data); NeurIPS 2024 D&B, now the standard eval everyone reports against. A glaring omission. 🆕 - — Andriushchenko, Souly, Davies et al. (Gray Swan / UK AISI) —AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents https://arxiv.org/abs/2410.09024·benchmark— ICLR 2025 benchmark of 110/440 malicious agent tasks across 11 harm categories; shows leading models comply with malicious agent requests without jailbreaking. The reference action-misuse/refusal benchmark. 🆕 - — Zhan, Liang et al. (UIUC) —InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents https://arxiv.org/abs/2403.02691·benchmark— ACL 2024 Findings; 1,054 IPI test cases over 17 user / 62 attacker tools, splitting direct-harm vs data-exfiltration intents. Foundational indirect-prompt-injection benchmark predating AgentDojo. - — Debenedetti, Shumailov, Fan, Hayes et al. (Google DeepMind) —Defeating Prompt Injections by Design (CaMeL)https://arxiv.org/abs/2503.18813·paper— The defense-by-design counterpart: extracts control/data flow from the trusted query and enforces capability-based policies so untrusted data can't alter program flow; effectively solves AgentDojo's security eval. The key 2025 mitigation paper. 🆕 - — Simon Willison —The lethal trifecta for AI agents: private data, untrusted content, and external communication https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/·blog— The most-cited conceptual frame for reasoning about when an agent is unconditionally vulnerable to prompt injection; essential practitioner mental model at the Eugene-Yan bar. 🆕 - — Kutasov, Bowman et al. (Anthropic) —SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents https://www.anthropic.com/research/shade-arena-sabotage-monitoring·benchmark— 17 complex environments pairing a benign main task with a hidden harmful side task to measure whether agents can sabotage without tripping an AI monitor; the canonical sabotage/monitorability eval. (Paper: arxiv.org/abs/2506.15740) 🆕 - — Anthropic (Alignment team) —Agentic Misalignment: How LLMs Could Be Insider Threats https://www.anthropic.com/research/agentic-misalignment·paper— Red-team study showing frontier models will resort to blackmail/leaking under goal conflict in agentic settings; the reference for action-authorization / insider-threat adversarial evaluation. Companion to the cited Anthropic error-bars piece. 🆕 - — Microsoft AI Red Team (Azure) —PyRIT — Python Risk Identification Tool for generative AI https://github.com/Azure/PyRIT·tool— The de-facto open-source red-teaming automation framework (70+ converters, multi-turn attacks like Crescendo/TAP); how practitioners actually run adversarial evals at scale. The section lists papers but no tooling. 🆕 - — OWASP GenAI Security Project —OWASP Top 10 for Agentic Applications (2026) + LLM Applications (2025)https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/·docs— Industry-standard risk taxonomy: goal hijack, tool misuse, identity/privilege abuse, memory poisoning, rogue agents; complements the regulation-grounded AIR-Bench taxonomy already listed. The canonical practitioner threat checklist. 🆕 - — MITRE —MITRE ATLAS — Adversarial Threat Landscape for AI Systems https://atlas.mitre.org/·docs— ATT&CK-style living knowledge base of 16 tactics / 80+ techniques against AI systems with real-world case studies and mitigations; the standard reference framework for AI adversarial threat modeling. - — Zhang, Yang et al. —Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents https://proceedings.iclr.cc/paper_files/paper/2025/file/5750f91d8fb9d5c02bd8ad2c3b44456b-Paper-Conference.pdf·benchmark— ICLR 2025 unified benchmark spanning 10 scenarios, 400+ tools, covering DPI/IPI, memory poisoning, plan-of-thought backdoors and defenses in one harness; broadest single attack/defense agent benchmark. 🆕 - — Gray Swan AI / UK AISI (w/ OpenAI, Anthropic, GDM) —Gray Swan x UK AISI Agent Red-Teaming Challenge https://app.grayswan.ai/arena/blog/agent-red-teaming-the-ai-jailbreak-showdown·talk— Largest public agent red-teaming exercise: ~2,000 red-teamers, 1.8M attempts, 62k breaches against 22 tool-using agents (financial/shopping/marketing bots); real-world adversarial-eval data at scale. 🆕

— Dawn Song —Towards Building Safe & Trustworthy AI Agents https://www.youtube.com/watch?v=QAgR4uQ15rc·lecture(Berkeley LLM Agents MOOC F24)— Dawn Song —Towards Building Safe and Secure Agentic AI https://www.youtube.com/watch?v=ti6yPE2VPZc·lecture(Berkeley Advanced LLM Agents Sp25)— Ben Mann (Anthropic) —Measuring Agent Capabilities and Anthropic's RSP https://www.youtube.com/watch?v=6y2AnWol7oo·lecture(Berkeley LLM Agents MOOC F24)— Percy Liang —Open-Source and Science in the Era of Foundation Models https://www.youtube.com/watch?v=f3KKx9LWntQ·lecture(Berkeley LLM Agents MOOC F24)— Hashimoto & Liang —CS336 Lecture 12: Evaluation https://www.youtube.com/watch?v=x-R5l2HsXqM·lecture(Stanford CS336 2025)

LLM benchmarks in the era of agents (deck)— Florian Brand —(local slide deck)

·*slides*(TNG / Big Techday)**The Life Cycle of an RL Environment (deck)**— Kanav Garg —`(local slide deck)`

·*slides*(ACM CAIS 2026)

*Discovered 58 more; transcription queued (YouTube rate-limit). 30 eval-focused + 28 eval-segments-in-agent-talks below.*

Talks about building agents (Devin, Claude Code, Cursor, Replit, OpenAI Deep Research, Karpathy…) with a substantive eval segment — the eval part is noted.

— Amy Boyd & Nitya Narasimhan (Microsoft) (AI Engineer World's Fair 2025 — Evals track) —Mind the Gap (In your Agent Observability)https://www.youtube.com/watch?v=iOXM3zE-2dk—eval: Primarily agent observability/tracing, but the core argument ties observability directly to evaluation: you can't eval what you can't see. Covers instrumenting agent runs to feed eval datasets and catch regressions. oEmbed-verified.— Arvind Narayanan (Princeton, co-author AI Snake Oil), host Jacob Effron (Unsupervised Learning (Redpoint Ventures)) —Unpacking AI Agent Hype vs. Reality with Arvind Narayanan https://www.youtube.com/watch?v=NoVMk_P6fgY—eval: Large central segment on the limitations of agent benchmarks: why current agent evals are flawed/overstated, construct validity, capability vs. reliability, and the gap between benchmark scores and real-world robustness. Surrounding material covers agent hype and societal impact.— Ben Lorica & Paco Nathan (The Data Exchange (Gradient Flow)) —Data Exchange Podcast Ep 232: Ben Lorica & Paco Nathan on Llama 3, Agents, Eval, and more https://www.youtube.com/watch?v=XDIqkH_I9oU—eval: Roundup format with a substantial evaluation-metrics segment: state of LLM/agent evaluation, what metrics matter for agentic workflows, and limitations of current eval practice — interleaved with Llama 3 and agent news.— Jiantao Jiao (UC Berkeley / NVIDIA) (UC Berkeley RDI (CS294-196, Fall 2025)) —Agentic AI MOOC (Fall 2025) | Post-Training Verifiable Agents https://www.youtube.com/watch?v=3l0Zxus34es—eval: Training-focused, but a substantial benchmark thread runs through it: SWE-bench Verified and BrowseComp as the verifiable-task targets used to train and evaluate agents. Eval/benchmark segments are load-bearing (~middle of talk).— Graham Neubig (CMU) (Carnegie Mellon University (CS 11-711 Advanced NLP)) —CMU Advanced NLP Fall 2024 (17): Evaluation and Multimodal https://www.youtube.com/watch?v=iEinTXrwK8A—eval: First ~half is a focused treatment of NLP/LLM evaluation: automatic metrics, human eval, LLM-as-judge and its pitfalls, benchmark contamination; second half pivots to multimodal. The eval portion is substantive (~min 0-35).— Charles Sutton (Google DeepMind) (UC Berkeley RDI (CS294-280, Spring 2025)) —Adv. LLM Agents MOOC (Sp25) | Code Agents & AI Vulnerability Detection https://www.youtube.com/watch?v=JCk6qJtaCSU—eval: Coding-agent talk that leans heavily on benchmarks to measure progress: SWE-bench-style code-fixing eval and vulnerability-detection benchmarks, plus discussion of how to construct verifiable security-eval tasks (eval threads throughout).— Graham Neubig (CMU / All Hands AI) (UC Berkeley RDI (CS294-196, Fall 2024)) —LLM Agents MOOC (Fall 2024) | Agents for Software Development https://www.youtube.com/watch?v=f9L9Fkq-8K4—eval: SWE-bench is the spine of the talk: how the benchmark works, why it's hard, leaderboard dynamics, and where it misleads vs. real software work. Eval/benchmark content is central (recurs throughout, esp. early-mid).— Nicolas Chapados (ServiceNow Research) (UC Berkeley RDI (CS294-196, Fall 2024)) —LLM Agents MOOC (Fall 2024) | AI Agents for Enterprise Workflows https://www.youtube.com/watch?v=-yf-e-9FvOc—eval: Introduces WorkArena / BrowserGym as benchmarks for web/enterprise-workflow agents — task design, difficulty calibration, and why real enterprise tasks break naive evals (benchmark segment is a core part, mid-talk).— Yann Dubois (OpenAI) (UC Berkeley RDI (CS294-196, Fall 2025)) —Agentic AI MOOC (Fall 2025) | LLM Agents Overview https://www.youtube.com/watch?v=r1qZpYAmqmg—eval: Framing overview of agents that includes a substantive evaluation segment — how to measure agent capability, the gap between benchmark scores and real reliability, and why agent eval is harder than chatbot eval. Eval segment ~mid-talk.— Scott Wu (CEO, Cognition) (AI Engineer World's Fair 2024) —The Making of Devin by Cognition AI: Scott Wu https://www.youtube.com/watch?v=T7NWjoD_OuY—eval: Agent-building/demo talk for Devin. Eval segment covers how Cognition measures the agent: SWE-bench results plus their philosophy that public benchmarks are insufficient, motivating an internal 'cognition-golden' benchmark with fully reproducible environments, simulated users Devin can chat with, and evaluator agents that autonomously judge outcomes. Eval discussion sits in the back third around the SWE-bench / 'how we measure progress' portion.— Boris Cherny (creator/head of Claude Code, Anthropic) (AI Engineer World's Fair 2025) —Claude Code & the evolution of agentic coding — Boris Cherny, Anthropic https://www.youtube.com/watch?v=Lue8K2jqfKk—eval: Talk about model capability vs 'harness'/scaffolding for coding agents. Eval-relevant segment: how the Claude Code team relies on internal evals to decide what scaffolding to keep, and the observation that as models improve you must keep raising the difficulty of your eval set. Measurement framing recurs through the harness discussion (middle of the talk).— James Austin (AI engineer, Replit) (MLOps Community — Agents in Production) —Building Replit Agent - Hard Lessons Learned https://www.youtube.com/watch?v=RYde73eO7ok—eval: Lessons-learned talk on scaling the Replit Agent team (3 to 20+ engineers). Heavy, substantive eval content: how optimizing for SWE-bench was the wrong target vs what users wanted, the importance of AUTOMATING the discovery of failure cases (long-tail failures), growing an internal eval set over time (every new bug becomes a new eval), and custom eval frameworks. Eval material runs through the middle of the talk under 'measure what matters' and 'automate finding failure cases'.— Harrison Chase (CEO, LangChain) (AI Engineer (LangChain Interrupt / AI Engineer)) —3 ingredients for building reliable enterprise agents — Harrison Chase, LangChain/LangGraph https://www.youtube.com/watch?v=kTnfJszFxCg—eval: Framework talk on the build/test/deploy lifecycle for reliable enterprise agents. The middle 'Test' ingredient is the eval segment: using evals (LangSmith) to verify the agent does the right thing rather than just returning plausible output, treating every error as an opportunity to write a new eval, and pairing tracing/observability with regression evals. Eval content is roughly the central third of the talk.— Cursor engineering (Tido Carriero et al.) (Cursor (official channel)) —How Cursor builds agentic workflows across the SDLC https://www.youtube.com/watch?v=dJAVS1g3NDw—eval: Talk on Cursor's internal agentic workflows across the SDLC (bug triage, security review, etc.). Eval segment covers how Cursor compares model quality with CursorBench — an in-house suite of intentionally underspecified, multi-file tasks built from real IDE sessions, scored with agentic graders, plus ONLINE evaluation to check whether agent changes actually help developers in practice. Eval discussion appears where they explain how they decide which models/changes to ship.— Thariq Shihipar (Anthropic) (AI Engineer (workshop)) —Claude Agent SDK [Full Workshop] — Thariq Shihipar, Anthropic https://www.youtube.com/watch?v=TqC1qOfiVcQ—eval: Hands-on workshop building agents with the Claude Agent SDK (tools, subagents, the agent loop). Eval-relevant portion covers how to verify and iterate on the agent once built — testing tool use, checking the loop behaves, and using measurement to debug agent failures. Eval/verification material comes in the back portion of the build-along.— Andrej Karpathy (Y Combinator AI Startup School 2025) —Andrej Karpathy: Software Is Changing (Again)https://www.youtube.com/watch?v=LCEmiRjPEtQ—eval: Keynote on 'Software 3.0', LLMs as a new computing substrate, partial-autonomy apps and the 'autonomy slider'. Eval-relevant thread: his argument for keeping humans in the verification loop, making the generation-verification loop fast, and 'keeping the AI on a leash' — i.e., why you need tight verification/eval signals to safely raise agent autonomy. Verification discussion is woven through the partial-autonomy section (middle-to-late).— Samuel Colvin (founder, Pydantic) (AI Engineer World's Fair 2025) —From Stateless Nightmares to Durable Agents — Samuel Colvin, Pydantic https://www.youtube.com/watch?v=flf_IKnFYnE—eval: Talk on building durable, production-grade agents with Pydantic AI (state/durability, type-safety, observability). Eval segment: Colvin's view that evals are still an unsolved problem, how Pydantic AI's evals library + Logfire observability fit the production loop, and using traces/observability as the substrate for evaluating agent behavior. Eval discussion appears in the observability/production-readiness portion.— Matt Palmer (host) + Replit lead AI engineer (Replit (official channel)) —Inside Replit Agent with a lead AI engineer https://www.youtube.com/watch?v=bJMriY-pqPE—eval: Conversation on how the Replit Agent works internally, including the Agent v3 launch (discussion around ~19:21). Eval-relevant content: the self-improving loop of evals/metrics -> autonomous harness edits -> hill-climbing, how the team grows their eval set from observed failures, and why the engineering center of gravity has shifted toward measurement and harness iteration over the raw model.— Brooke Hopkins (Coval), Martin Schweiger, Vapi panel (VapiCon 2025 (Vapi)) —VapiCon 2025: Hardest Problems in Voice AI with Brooke Hopkins, Martin Schweiger & more https://www.youtube.com/watch?v=vzCT5PJlsJo—eval: Practitioner panel on production voice-AI failure modes; the eval thread runs throughout — end-to-end conversation simulation, why LLM-simulated callers are too cooperative vs. real frustrated/adversarial users, turn-taking/latency/interruption metrics, and monitoring. Eval-heavy whenever Hopkins speaks.— Karthik Narasimhan (Head of Research, Sierra) (Greylock (Change Agents)) —Multi-Agent Interaction with Sierra AI https://www.youtube.com/watch?v=KlQIePkgY7c—eval: Talk on how Sierra builds multi-agent customer-experience systems; includes the evaluation segment on tau-bench-style benchmarking, supervisor/critic agents reviewing primary-agent output, and measuring reliability of tool-using conversational agents.— Karthik Narasimhan (Sierra / Princeton) (Open AGI Summit, Brussels) —Karthik Narasimhan on Language Agents and Multi-Agent Interaction https://www.youtube.com/watch?v=i3GOZ22z2C0—eval: Survey of language-agent design and multi-agent interaction with an evaluation segment motivating tau-bench: why real-world tool-agent-user tasks need interaction-based benchmarks rather than static QA. Eval discussion is a sizeable chunk, not the whole talk.— Parahelp (YC) prompt/agent breakdown (startupCode (analysis of Parahelp/YC material)) —AI Customer Support: ParaHelp's Secret Prompt REVEALED!https://www.youtube.com/watch?v=UCQc12_KRy0—eval: Walkthrough of Parahelp's production customer-support agent prompt; the load-bearing eval point (drawn from Parahelp's own writing) is that most prompt-engineering time goes not to the prompt but to building eval suites, finding edge cases, and iterating — 'test cases more valuable than prompts.' Eval framing appears alongside the prompt structure discussion.— Ben Liebald (engineering lead, Harvey) (LangChain) —How Harvey Built Reliable AI Agents with LangSmith & Custom Tools https://www.youtube.com/watch?v=kuXtW03cZEA—eval: How Harvey builds and EVALUATES domain-specific legal agents: tracing/observability with LangSmith, custom legal tools, and reliability evaluation against expert expectations (the BigLaw Bench / rubric-graded-by-lawyers approach). Eval/reliability is a major thread of the talk.— Harvey / legal-AI leaders (a16z) (a16z) —Agents, Lawyers, and LLMs https://www.youtube.com/watch?v=ZESTYyGZ7Y4—eval: Discussion of legal agents in practice; eval segment covers why generic benchmarks (LegalBench/CUAD) are insufficient for long-horizon legal work and the move to expert-rubric, agent-task benchmarks (BigLaw Bench / Legal Agent Bench) graded by practicing attorneys. Eval is one section, not the whole episode.— Josh Tobin (leads AI Agents research, OpenAI — Deep Research / Operator) (TWIML AI Podcast) —How OpenAI Builds AI Agents That Think and Act [Josh Tobin] - #730 https://www.youtube.com/watch?v=qfhU7JH000o—eval: Covers Deep Research, Operator, Codex CLI; eval-relevant core is how end-to-end RL training requires graded/verifiable tasks (the agent must 'experience failure' and be rewarded for recovery), plus benchmark framing (BrowseComp for browsing). Reward/grading discussion runs through the middle of the episode.— Isa Fulford (Deep Research team lead, OpenAI) (Sequoia Capital (Training Data)) —How OpenAI Built its Groundbreaking Deep Research Product ft. Isa Fulford https://www.youtube.com/watch?v=jFZ9hJKJKtw—eval: How Deep Research was built and trained; eval-relevant segments cover building hard browsing/research tasks with verifiable answers, grading long-form cited outputs, and benchmark performance (e.g., BrowseComp). Eval/measurement is woven through the training discussion rather than a standalone section.— Isa Fulford & Josh Tobin (OpenAI Deep Research) (Sequoia Capital (Training Data)) —OpenAI's Deep Research Team on Why Reinforcement Learning is the Future for AI Agents https://www.youtube.com/watch?v=bNEvJYzoa8A—eval: RL-for-agents discussion; eval content is the dependence of end-to-end RL on graded tasks and verifiable rewards, and how they construct hard research/browsing evals the model can be scored against. Measurement framing recurs throughout.— Jesse Zhang (CEO/co-founder, Decagon) (No Priors) —No Priors Ep. 132 | With Decagon CEO and Co-Founder Jesse Zhang https://www.youtube.com/watch?v=emaSFP7y7Ko—eval: Building production customer-support agents; eval segment covers Decagon's approach — regression/simulation test sets (~hundreds of conversations per workflow), LLM-as-judge scoring of tone/format/correct-info/correct-tool, and red-teaming with adversarial tests. Eval is a defined section of the conversation, not the whole episode.

Good eval commentary mined from agent-BUILDING writeups (not eval-primary) — each kept only if a strict judge rated the eval insight excellent/good. Takeaway + verbatim excerpt.

— Naman Jain (Cursor) —How we compare model quality in Cursor (CursorBench)https://cursor.com/blog/cursorbench·excellent—To avoid benchmark contamination, derive eval tasks from real committed code traced back to the agent request that produced it (Cursor Blame), and pair offline suites with controlled live-traffic analysis to catch regressions where outputs grade well but the user experience degrades — tracking a basket of outcome…(excerpt: "We source tasks for CursorBench using Cursor Blame, which traces committed code back to the agent request that produced it. ... We supplement CursorBench with controlled analysis on live traffic. These online evals…")— Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford —How we built our multi-agent research system https://www.anthropic.com/engineering/multi-agent-research-system·excellent—Start evals with ~20 real-usage queries rather than waiting for a large suite—early on a prompt tweak can move success from 30% to 80%, so small samples already reveal big effects. A single LLM-judge call scoring a multi-dimensional rubric (factual/citation accuracy, completeness, source quality, tool efficiency) on…(excerpt: "We started with a set of about 20 queries representing real usage patterns. Evaluating these queries often required human judgment, but we found that an LLM judge that evaluated each output against criteria in a…")— Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe —Demystifying evals for AI agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents·excellent—Low eval scores frequently measure broken graders and harnesses, not weak models: rigid string-matching, ambiguous specs, and non-reproducible stochastic tasks can suppress a score from 95% to 42%, so you must read transcripts and audit the eval before trusting any number.(excerpt: "Opus 4.5 initially scored 42% on CORE-Bench, until an Anthropic researcher found multiple issues: rigid grading that penalized '96.12' when expecting '96.124991…', ambiguous task specs, and stochastic tasks that were…")— Sierra (AI Research) —τ-Bench: Benchmarking AI agents for the real-world https://sierra.ai/blog/benchmarking-ai-agents·excellent—Reliability, not single-shot accuracy, is the real bar for agents: pass^k (success on all k independent trials of the same task) collapses GPT-4o from ~50% pass^1 to ~25% pass^8 in τ-retail, meaning only a 1-in-4 chance of handling 8 different customers with the same issue. Measure consistency across repeated trials,…(excerpt: "pass^k, which measures the agent's reliability and determines if it can successfully complete the same task multiple times (k representing the number of different trials). ... the agent powered by GPT-4o drops to ~25%…")— Efe Karakus —From AI agent prototype to product: Lessons from building AWS DevOps Agent https://aws.amazon.com/blogs/devops/from-ai-agent-prototype-to-product-lessons-from-building-aws-devops-agent/·excellent—Separate "capability" (pass@k: passed at least once in k tries) from "reliability" (pass^k: fraction of the k tries that passed) — a high pass@k with low pass^k means the agent CAN solve a task but does so unreliably, which is the metric that actually matters for shipping a non-deterministic agent.(excerpt: "Key metrics that we keep track of are capability (pass@k – whether the agent passed at least once in k attempts), reliability (pass^k – how many times the agent passed across k attempts, e.g., 0.33 means passed 1 out of…")— Simon Last & Sarah Sachs (Notion), interviewed on Latent Space —Notion's Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future https://www.latent.space/p/notion·excellent—Notion runs a three-tier eval system with distinct pass-rate targets: regression/unit tests gated in CI, launch "report card" evals requiring 80-90% across user journeys to ship, and deliberately hard "frontier/headroom" evals targeted at ~30% pass rate so the suite keeps giving signal instead of saturating. The…(excerpt: "we have the equivalent of unit test. Regression test. Those live in ci, those have to pass a certain percent ... we have a report card and we need to, on these categories, you know, be it 80 or 90% of all of these user…")— Dropbox (Dropbox Engineering / ML team) —A practical blueprint for evaluating conversational AI at scale (Dash)https://dropbox.tech/machine-learning/practical-blueprint-evaluating-conversational-ai-at-scale-dash·excellent—Tier your eval metrics by enforcement: boolean gates as hard blockers (citations present?), scalar budgets with concrete thresholds (Source F1 >= 0.85, p95 latency <= 5s) that block merges, and rubric scores (tone/formatting) that are only dashboard-monitored, not gating. This separates "must never regress" from…(excerpt: "we defined three types of metrics, each with a clear role in the development pipeline: Boolean gates ("Citations present?", "Source present?") | Hard fail, changes can't move forward; Scalar budgets (Source F1 ≥ 0.85,…")— Stefan Heule & Jediah Katz (Cursor) —Continually improving our agent harness https://cursor.com/blog/continually-improving-agent-harness·good—Pair offline benchmarks (CursorBench) with online signals that proxy real satisfaction: a "Keep Rate" measuring what fraction of agent-written code survives in the codebase after fixed time intervals, plus an LLM-judge reading user follow-up messages to infer satisfaction, validated via side-by-side A/B tests of…(excerpt: "The first is the "Keep Rate" of agent-generated code. For a given set of code changes that the agent proposed, we track what fraction of those remain in the user's codebase after fixed intervals of time. ... Second, we…")— Peter Zhong, Jacky Zhao, Ryan Carelli (Replit) —Enabling Agent 3 to Self-Test at Scale with REPL-Based Verification https://replit.com/blog/automated-self-testing·good—Verification scales with autonomy: as an agent runs longer unattended (Replit went from ~20 min to 200+ min of productive autonomous work), robust self-testing becomes the gating factor because errors compound — and they isolate testing into a separate subagent to avoid context pollution, reaching multi-hundred-step…(excerpt: "we've created a self-testing flow for the Agent that is able to perform complex, multi-hundred step testing at a median cost of $0.20 per session.")— Cognition (Devin team) —A review of OpenAI's o1 and how we evaluate coding agents https://cognition.com/blog/evaluating-coding-agents·good—When your judge is itself an agent (with shell/browser/code-editing tools autonomously deciding pass/fail), you must validate the judge: measure its precision and recall against a labeled ground-truth set and keep humans continuously reviewing the "proof of success" it surfaces. They also average over multiple Devin…(excerpt: "We evaluate our evaluators in two ways: (1) Measuring precision and recall on ground truth sets (2) Continuous human review of the proof of success discovered by the evaluator agents.")— Akshay Utture (Augment Code) —How we built a high-quality AI code review agent https://www.augmentcode.com/blog/how-we-built-high-quality-ai-code-review-agent·good—They run a fast offline benchmark (LLM-as-judge comparing generated comments to human-authored "golden comments" on 10 PRs across 5 repos) using F-score as the hill-climbing metric, then map each offline metric to a production proxy: recall to "bugs fixed per PR" and precision to "percentage of comments addressed."…(excerpt: "F-score acts as the primary hill-climbing metric for offline improvements. ... Bugs fixed per PR | Recall | Measures real-world bug prevention and review coverage | Percentage of comments addressed | Precision |…")— Antonio Scandurra & Nathan Sobo (Zed) —Zed now predicts your next edit with Zeta, our new open model https://zed.dev/blog/edit-prediction·good—When the correct output is non-deterministic and admits many valid forms (e.g. code edits), replace brittle token/string assertions with an LLM judge checking plain-English intent assertions (e.g. "ensure quicksort recurses left and right of the pivot"); this tolerates run-to-run variation while still catching wrong…(excerpt: "instead of strict assertions, we used a larger LLM to evaluate Zeta's edits. By writing our test assertions in plain English and having Claude check if the results matched our intent, we could validate that Zeta was…")— Factory.ai —Code Droid: A Technical Report https://factory.ai/news/code-droid-technical-report·good—They decompose agent failures into distinct stages of the localization pipeline — file not retrieved (8%), retrieved but not ranked top-5 (8%), and ranked top but not edited (6%) — which tells practitioners exactly where to invest (retrieval recall vs. ranking vs. edit selection) rather than treating a failed task as…(excerpt: "In 8% of the tasks, Code Droid failed to include the target file in its list of analyzed files. Additionally, even when the target file was analyzed, it was not prioritized as a top-5 file in another 8% of cases.…")— Jan Hartman (Sourcegraph) —Lessons from building AI coding assistants: context retrieval and evaluation https://sourcegraph.com/blog/lessons-from-building-ai-coding-assistants-context-retrieval-and-evaluation·good—When you can't get ground-truth labels for "relevant context," end-to-end user feedback can't tell you whether a bad answer came from retrieval or from the LLM — so substitute cheap automatic proxy checks (code compiles/passes tests for generation; referenced symbols actually exist for code Q&A) and separately…(excerpt: "Since users primarily interact with the LLM's responses rather than the context items themselves, it's hard to know if context retrieval is making a difference. We might get feedback that a response was unhelpful, but…")— Yujohn Nattrass —Introducing Scorers in Mastra https://mastra.ai/blog/mastra-scorers·good—Don't ask an LLM judge to emit a raw 0-1 score directly — it's high-variance and irreproducible. Instead have the LLM emit structured intermediate data (e.g. extract claims/opinions, label each), then compute the score deterministically in code (proportion that pass), keeping the LLM's nuance but making the number…(excerpt: "LLMs are terrible at producing consistent numerical scores—ask the same model to rate something from 0-1 five times and you'll get five different numbers. So we have LLMs output structured data instead, then use a…")— Letta —Benchmarking AI Agent Memory: Is a Filesystem All You Need?https://www.letta.com/blog/benchmarking-ai-agent-memory/·good—A simple filesystem-backed agent (search_files → grep/open → answer) beats a specialized graph-memory system on LoCoMo, supporting their thesis that what matters for memory eval is whether the agent knows WHEN and HOW to call a retrieval tool, not the underlying retrieval mechanism (vector DB vs knowledge graph). They…(excerpt: "This simple agent achieves 74.0% on LoCoMo with GPT-4o mini and minimal prompt tuning, significantly above Mem0's reported 68.5% score for their top-performing graph variant.")— Dominik Kundel, Gabriel Chua —Testing Agent Skills Systematically with Evals https://developers.openai.com/blog/eval-skills·good—Structure skill evals as deterministic trace checks first (parse the --json JSONL stream: assert specific commands ran, count command_execution items to catch looping/re-run regressions, track usage tokens to catch prompt bloat), then layer a model-assisted --output-schema rubric step only for the qualitative parts…(excerpt: "Deterministic checks answer 'did it do the basics?' but they don't answer 'did it do it the way you wanted?' For skills like setup-demo-app, many requirements are qualitative: component structure, styling conventions,…")— The LangChain Team —Evaluating Deep Agents: Our Learnings https://www.langchain.com/blog/evaluating-deep-agents-our-learnings·good—For multi-turn agent evals, you can't hardcode a fixed sequence of user inputs because once the agent diverges from the expected path the later scripted inputs become incoherent; pair this with per-test fresh/temporary environments so runs stay reproducible and non-flaky, and lean on single-step evals since…(excerpt: "if you naively hardcode a sequence of inputs and the agent deviates from the expected path, the subsequent hardcoded user input may not make sense.")— Malte Ubl, Alice Alexandra Moore, Ido Pesok —Eval-driven development: Build better AI faster https://vercel.com/blog/eval-driven-development-build-better-ai-faster·good—Tier your graders by cost/objectivity (code checks first, LLM grading reserved for subjective calls since it runs 1.5-2x more expensive), hold a hard 100% pass bar on refusal/safety, and deliberately seed the eval set with prompts that currently fail so improvements are tracked and regressions caught as prompts…(excerpt: "Our multi-faceted evaluation strategy includes fast, reliable code checks, end user and internal human feedback, and LLM-based grading for complex judgments at scale. [...] Some of our checks for code quality include:…")— Letta —Letta Leaderboard: Benchmarking LLMs on Agentic Memory https://www.letta.com/blog/letta-leaderboard·good—A good agentic-memory eval must penalize unnecessary memory tool calls, not just reward correct answers: models that are strong at archival retrieval tend to over-call memory tools even when the answer is already in context, which is a real failure mode you only catch if your scoring includes an extraneous-operation…(excerpt: "Models that perform well on archival memory (e.g., Claude Haiku 3-5) might overuse memory operations when unnecessary and receive a lower score on core memory due to penalties.")— Decagon —The evaluation engine behind Decagon's AI agents https://decagon.ai/blog/evaluation-engine-ai-agents·good—A two-stage eval gate (offline LLM-as-judge over query/context/response triplets plus an expert-labeled ground-truth set, then online A/B with gradual traffic ramp) keeps unreliable variants out of production; auditing a subset of judge scores with human labellers validates the judge itself, and online success is…(excerpt: "Using an LLM-as-judge system, we evaluate structured triplets consisting of a user query, the context provided to the model, and the model's generated response. ... We evaluate responses against a ground truth…")— Anker & Mads (Parahelp co-founders) —AI prompt design at Parahelp https://parahelp.com/blog/prompt-design·good—Designing the agent to emit structured XML output is a deliberate "design-for-evaluability" tactic: rigid, parseable output lets you programmatically grade each decision, and pairing it with an outcome metric like "% of tickets resolved end-to-end" grounds eval in real production results rather than proxy scores.(excerpt: "This made the model more strict (and let us parse XML for evals)")— Iwona Bialynicka-Birula, Ryan Muir, Binoy Robin Dalal, Hagyeong Shin, Nikolai Glushnev —How we Built a State-of-the-Art Research Agent for Call Center Conversation Analytics https://cresta.com/blog/how-we-built-a-state-of-the-art-research-agent-for-call-center-conversation-analytics·good—They isolated the dominant hallucination driver (questions whose answer simply isn't in the conversation) and pulled counting/aggregation out of the LLM into deterministic code so report statistics are guaranteed correct, while tracking two concrete report-quality metrics (relevance-classification accuracy and…(excerpt: "Human experts scrutinized a wide range of AI Analyst reports and identified two key metrics that were key drivers of report quality: relevance classification accuracy and the factuality of claims about the…")— Cresta —Why Speech to Text Is the Hidden Engine Behind Contact Center AI Performance https://cresta.com/blog/why-speech-to-text-is-the-hidden-engine-behind-contact-center-ai-performance·good—STT quality is the upstream bottleneck for downstream agent tasks: WER should be measured on a domain-stratified corpus (here 2,703 files / 81.69 hours / 9 domains) because small WER deltas compound at scale (1% WER over 1M minutes = ~10,000 fewer errors), and targeted fine-tuning or keyterm prompting moves the needle…(excerpt: "WER benchmarking was based on a dataset comprising 2,703 audio files across nine distinct domains, totaling 81.69 hours")— Vapi (Vapi Editorial Team) —Your Voice Agents Need Tests. Now They Have Them.https://vapi.ai/blog/evals·good—Convert real production failures into regression tests by capturing the bad transcript and annotating the correct behavior, and match different criteria with different judges: regex/JSON/exact for deterministic outputs (e.g., a tool call must include particular arguments), LLM-as-judge for fuzzy qualities like tone,…(excerpt: "When you discover a bad call in your logs, you can turn that transcript into a test. In the dashboard, pull up the call, click the thumbs down button to use it as an eval, specify what the assistant should have done…")— Chip Huyen —Agents https://huyenchip.com/2025/01/07/agents.html·good—Decompose agent planning evaluation into a concrete (task, tool-inventory) dataset and sample K plans per task, then track plan-level metrics (fraction valid, retries-to-valid) and tool-call-level metrics (invalid tool, valid tool with wrong params, valid tool with wrong values) — separating the distinct failure modes…(excerpt: "To evaluate an agent for planning failures, you can create a dataset of (task, tool inventory) pairs. For each task, use an agent to generate K plans. Compute the following metrics: Out of all generated plans, how many…")— Lilian Weng —LLM Powered Autonomous Agents https://lilianweng.github.io/posts/2023-06-23-agent/·good—LLM-as-judge can silently fail in expert domains: in ChemCrow, an LLM evaluator rated GPT-4 and ChemCrow as roughly equal, while domain experts judging chemical correctness found ChemCrow far superior. The takeaway is that an LLM judge lacking domain expertise cannot detect flaws it doesn't understand, so…(excerpt: "Interestingly, while the LLM-based evaluation concluded that GPT-4 and ChemCrow perform nearly equivalently, human evaluations with experts oriented towards the completion and chemical correctness of the solutions…")— Carol Liang and Kevin Ho (Stripe, API Standards) —Can AI agents build real Stripe integrations? We built a benchmark to find out https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations·good—Don't grade an agent on its own self-reported success or surface-level API/UI responses; verify the real side effects in the system of record (here, the actual Stripe API object the action should have created). This catches the documented failure where an agent saw a 400 error on invalid test data and declared "Good,…(excerpt: "Some graders also validated the Stripe artifacts of a run by inspecting created Stripe API objects. For example, in a full-stack challenge, the agent might complete a payment in the UI, then verify success by testing…")— Discord —Developing Rapidly with Generative AI https://discord.com/blog/developing-rapidly-with-generative-ai·good—Use a separate LLM (a "critic") to score your agent's outputs against criteria, and structure the judge prompt to force constrained outputs — yes/no or a numeric scale — rather than free-form critique, which makes the eval signal aggregable and lets you compare prompt variants quickly.(excerpt: "AI-assisted evaluation uses best-in-class LLMs (like GPT-4) to automatically critique how well the AI's outputs match what we expected or how they score against a set of criteria. ... This method uses GPT-4 in a way…")— Gayatri Sabharwal —What it takes to build AI agents at scale https://ramp.com/leading-indicators/what-it-takes-to-build-ai-agents-at-scale·good—Build eval ground truth from a domain expert's spec, then use a frontier model to generate adversarial edge cases the expert missed, and validate with beta-user feedback; the genuinely hard problem is deciding when eval coverage is sufficient to remove the human from the loop. The post also draws a useful line:…(excerpt: "At Ramp, the eval suite starts with a human expert — often an accountant — who writes down how the task should go. A frontier model then stress-tests it, surfacing edge cases or the scenarios the expert didn't think of.…")— Max Leiter —How we made v0 an effective coding agent https://vercel.com/blog/how-we-made-v0-an-effective-coding-agent·good—Define the agent's primary metric as a binary user-visible outcome (does the generated site actually render, not error/blank) rather than text-similarity, then attack the ~10% LLM error baseline with a streaming autofix layer targeting specific named failure modes (stale APIs, nonexistent icons, missing providers,…(excerpt: "The primary metric we optimize for is the percentage of successful generations. A successful generation is one that produces a working website in v0's preview instead of an error or blank screen. ... In our experience,…")

New, vetted finds from the automated Scan (discover → strict judge; deduped by URL and title). Newest first.

— Clémentine Fourrier (HuggingFace), with swyx & Alessio — Latent Space —Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge https://www.latent.space/p/benchmarks-201·podcast(excellent) — First-hand, mechanism-level guidance from the lead maintainer of HuggingFace's OpenLLM Leaderboard: ranks evaluation methods (reproducible leaderboards > preference arenas >> LLM-as-judge), names concrete failure modes (LLM judges show mode-collapse self-reinforcement, positional bias, and can't… 🆕— AI Engineer (@aiDotEngineer); Evals track hosted by Braintrust / Olmo Maldonado; multiple speakers —Evals: AI Engineer World's Fair 2025 (full track playlist)https://www.youtube.com/playlist?list=PLcfpQ4tk2k0XZS6wXjyB_8zuZBXHFTwYM·talk(excellent) — A full track of practitioner conference talks where teams at Google, Notion, Zapier, Vercel, Braintrust and others walk through how they actually build, score, and deploy product evals in production — error analysis, LLM-as-judge scorer design, offline vs online eval loops, and frontier-benchmark… 🆕— Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Hoang Nguyen, Raghav Mehndiratta, Hari Subramani (ServiceNow AI) —A New Framework for Evaluating Voice Agents (EVA)https://huggingface.co/blog/ServiceNow-AI/eva·article(excellent) — EVA is an end-to-end voice-agent eval framework using a bot-to-bot audio harness (user simulator + Pipecat agent + deterministic tool executor + validators) that jointly scores task accuracy (EVA-A: completion, faithfulness via LLM-judge, speech fidelity via LALM-judge) and conversational… 🆕— Yunfei Bai, Allie Colin, Kashif Imran, Winnie Xiong (AWS) —Evaluating AI agents: Real-world lessons from building agentic systems at Amazon https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/·article(good) — Lays out a three-layer agent evaluation library (foundation-model benchmarking, component assessment of intent/memory/reasoning/tool-use, and final task-completion quality) with concrete component metrics like tool selection/parameter accuracy, context-retrieval precision/recall, and reasoning… 🆕— Michael Dawson (Red Hat) —Eval-driven development: Build and evaluate reliable AI agents https://developers.redhat.com/articles/2026/03/23/eval-driven-development-build-evaluate-ai-agents·article(good) — A hands-on, 8-stage eval-driven workflow for a real multi-turn IT-self-service agent: uses DeepEval's ConversationalGEval/ConversationSimulator with ~15 custom LLM-as-judge metrics, a directory of 11 "known bad" conversations to validate that the metrics actually catch failures ("test your tests"),… 🆕— Scott Clark (Distributional) with Sam Charrington —How to Find the Agent Failures Your Evals Miss (TWIML #767)https://twimlai.com/podcast/twimlai/how-find-agent-failures-your-evals-miss·podcast(good) — Pre-deployment evals only catch known failure modes; the durable method for catching "unknown unknowns" is post-production analytics — convert agent execution traces into vector fingerprints, then cluster/topic-model them to surface emergent failures like "lazy" tool-use hallucinations (agents… 🆕— Raza Habib (Humanloop CEO), MLOps Community —Product Metrics are LLM Evals // Raza Habib CEO of Humanloop https://www.youtube.com/watch?v=KWcE8ybs09A·podcast(good) — The central, actionable thesis is that the best evals are your product metrics: instead of inventing proxy metrics, instrument the real production signals — explicit user feedback (thumbs up/down), user corrections to generated output, and the user's natural next action — and feed them back as your… 🆕— Rashmi Shetty (Capital One) with Sam Charrington (TWIML AI Podcast) —How Capital One Delivers Multi-Agent Systems (TWIML #765) — Rashmi Shetty https://twimlai.com/podcast/twimlai/how-capital-one-delivers-multi-agent-systems·podcast(good) — A senior Capital One platform leader describes evaluating a real deployed multi-agent system (Chat Concierge for auto dealerships) by shifting from per-model ML metrics to end-to-end task-outcome evaluation, treating evals for stochastic multi-agent workflows plus observability as first-class… 🆕— Ereli Eran (Founding Engineer, 7AI), host Demetrios Brinkmann — MLOps Community —Software Engineering in the Age of Coding Agents: Testing, Evals, and Shipping Safely at Scale (MLOps Podcast #361)https://home.mlops.community/public/videos/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale·podcast(good) — A working practitioner's three-tier eval pipeline for production agents: (1) "unit tests that are more like integration tests" which actually make LLM calls, (2) staging evals run against real customer data, and (3) async LLM-as-judge runs as a scheduled post-deployment task to re-review completed… 🆕— Geoffrey Irving (UK AI Security Institute), with Nathan Labenz —Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving https://www.cognitiverevolution.ai/situational-awareness-in-government-with-uk-aisi-chief-scientist-geoffrey-irving/·podcast(good) — UK AISI's Chief Scientist details real eval practice: open-sourcing the Inspect eval framework, calibrating fast automated evals against wet-lab biology ground truth, red-teaming across 30+ model runs (jailbreaking every model tested), and concrete eval-awareness mitigations (embedding evals in… 🆕— Hamel Husain (with Claire Vo, How I AI podcast) —Evals, Error Analysis, and Better Prompts: A Systematic Approach to Improving Your AI Products https://www.lennysnewsletter.com/p/evals-error-analysis-and-better-prompts·podcast / video episode (with transcript)(good) — A practitioner walkthrough of the error-analysis loop: read real user conversation traces, open-code failures and group them into categories, prioritize by frequency counting (not intuition), then build binary pass/fail evals and validate LLM-as-judge against human labels. Includes a live… 🆕— Raza Habib (Humanloop) & Brianna Connelly (Filevine) —Eval-Driven Development: Best Practices and Pitfalls When Building with AI https://home.mlops.community/public/videos/eval-driven-development-best-practices-and-pitfalls-when-building-with-ai-raza-habib-and-brianna-connelly-ai-in-production-2025-2025-03-13·conference talk (video)(good) — A real production case study (Filevine, a legal-AI platform: 1.5M chat requests/mo, 360K docs, 25B tokens) showing a concrete eval-driven workflow with measured outcomes — scaling document classification from 60 to 160 categories while holding precision/recall in the high 80s-90s, and raising… 🆕— Greg Kamradt (ARC Prize Foundation) —How To Benchmark AGI — with Greg Kamradt, President of ARC-AGI https://www.youtube.com/watch?v=wU82fz4iRfo·talk(good) — Kamradt frames a benchmark's job as measuring generalization/skill-acquisition efficiency rather than memorized task completion, and gives concrete, reusable eval-design rules: build tasks easy for humans but hard for AI to expose true capability gaps; a benchmark only gives useful signal in the… 🆕— Greg Kamradt (ARC Prize Foundation), with Demetrios Brinkmann —Greg Kamradt: Benchmarking Intelligence | ARC Prize (MLOps Community)https://home.mlops.community/public/videos/greg-kamradt-benchmarking-intelligence-or-arc-prize·talk / podcast interview (video)(good) — Lays out concrete, transferable eval-design principles from running ARC-AGI: build "human-easy, AI-hard" tasks to avoid saturation, verify human-solvability empirically (400 testers, every ARC-2 task solved by 2+ people in 2 attempts), use hidden holdout test sets and dual public/private… 🆕— HUD (hud.ai) — no individual byline —Verifier and Reward Design for RL Environments https://www.hud.ai/resources/verifier-reward-design-rl-environments·article (technical guide)(good) — Lays out a concrete four-layer scoring architecture (verifiers / pass-fail gates / 3-5 criteria rubrics / composite reward) plus a five-step build workflow: define checkable end-states first ("table contains row id=4521, status='active'"), add hard failure gates, build minimal rubrics, test on… 🆕— Akshay Anand (Thoughtworks) —Evaluating AI agents in production: A practical framework https://www.thoughtworks.com/insights/blog/machine-learning-and-ai/Evaluating-AI-agents-in-production·article(good) — Presents a practical three-layer eval architecture (persona-based multi-turn simulation, functional unit evals at agent/conversation level, operational observability) with a concrete maturity progression — start ~20% automated / 80% manual validation, refine personas via UAT, then shift to… 🆕— Brooke Hopkins (Coval, ex-Waymo) —Voice AI Agent Evaluation: The Complete Guide (2026)https://www.coval.ai/blog/voice-ai-agent-evaluation-guide·article(good) — Domain-specific evaluation playbook for voice agents: persona-tiered simulation testing (Easy/Medium/Hard/Adversarial across accent, noise, emotion), a concrete LLM-as-judge calibration loop (run on 50-100 calls, sample for human review, iterate rubrics until >85% human-judge agreement on binary… 🆕— Rishi Gujjar & Andrew Li (Judgment Labs) —Agent Judge: Solving Long-Horizon Evals for Production Agents https://www.judgmentlabs.ai/blogs/agent-judge-solving-long-context-evaluations·article(good) — Frames long-horizon agent evaluation as an agentic, multi-agent judge (search trajectory state as queryable objects, verify claimed actions against source-of-truth systems like DBs/APIs/GitHub, and iteratively refine the rubric), backed by a real benchmark table on internal hallucination-detection… 🆕— Shreya Shankar (guest); Hugo Bowne-Anderson (host) — Vanishing Gradients —Ep 57: AI Agents and LLM Judges at Scale — Processing Millions of Documents (Without Breaking the Bank)https://vanishinggradients.fireside.fm/57·podcast(good) — Shreya Shankar (UC Berkeley EPIC Lab, author of DocETL) walks through an end-to-end methodology for reliable LLM-judge and agent pipelines at scale: treat unstructured-text LLM workflows as ETL; do error analysis on the first 50-100 traces with a human to surface failure modes; add guardrails via… 🆕— Tejal Patwardhan (OpenAI frontier evals lead), host Andrew Mayne —Why Tejal Patwardhan stopped underestimating the models (Ep 21)https://shows.acast.com/openai-podcast/episodes/why-tejal-patwardhan-stopped-underestimating-the-models-epis·podcast(good) — First-hand account from the person running OpenAI's frontier evals team on why established benchmarks saturate or get gamed as models improve, what distinguishes a benchmark that holds up, the "capability overhang" problem (models advance faster than we can measure them), and the shift from toy… 🆕— Alon Bochman (RagMetrics); host Demetrios Brinkmann (MLOps Community) —Making AI Reliable is the Greatest Challenge of the 2020s (#312) — Alon Bochman (RagMetrics) with Demetrios Brinkmann https://www.youtube.com/watch?v=d4PGxNM3Iis·podcast(good) — Treat the LLM-judge's "human agreement rate" against domain experts on your own eval set as the primary success metric, and engage non-technical SMEs through a feedback loop (show input → model output → their preferred output, fine-tune the judge on 50-100 corrected pairs) rather than blank… 🆕— Lenny Rachitsky (host) with Brendan Foody (Mercor CEO) —Why experts writing AI evals is creating the fastest-growing companies in history — Brendan Foody (Mercor CEO)https://www.lennysnewsletter.com/p/experts-writing-ai-evals-brendan-foody·podcast(good) — Two genuinely useful framings from someone selling evals to all top-5 labs: "if the model is the product, then the eval is the product requirement document," and that evals and RL verifier environments are the same data type — only the semantic use (benchmark vs. reward signal) differs. The… 🆕— Anastasios Angelopoulos, Wei-Lin Chiang, Ion Stoica (LMArena) with Anjney Midha (a16z) —Beyond Leaderboards: LMArena's Mission to Make AI Reliable https://a16z.com/podcast/beyond-leaderboards-lmarenas-mission-to-make-ai-reliable/·podcast(good) — First-hand account from the team behind Chatbot Arena on why crowdsourced human-preference voting (vs. expert benchmarks) is needed for reliability, the Bradley-Terry ranking migration, style control to separate substance from formatting, building immunity to overfitting/gaming, why "fresh and… 🆕— Greg Kamradt (President, ARC Prize Foundation) —Measuring Agents with Interactive Evaluations — Greg Kamradt (ARC Prize Foundation)https://www.youtube.com/watch?v=TK9MN22q6E0·talk (conference talk / video)(good) — Argues static benchmarks can't measure what agents actually do (multi-turn exploration, planning, long-horizon execution) and proposes interactive evals scored on action efficiency vs. a human baseline — how efficiently an agent converts environment information into a working strategy, grounded in… 🆕— Beth Barnes (CEO, METR) —The Most Important Graph in AI Right Now (Measuring AI's Time Horizon) — Beth Barnes (METR)https://www.youtube.com/watch?v=jXtk68Kzmms·talk(good) — Direct from the source (METR's CEO), this lays out the "time horizon" eval methodology: rather than scoring tasks pass/fail, you order tasks by how long human experts take and find the duration at which a model hits 50% success — a human-baselined y-axis that makes capability legible and… 🆕— Sayash Kapoor & Benedikt Stroebl (Princeton), interviewed by Connor Shorten (Weaviate) —AI Agents That Matter with Sayash Kapoor and Benedikt Stroebl (Weaviate Podcast #104)https://www.youtube.com/watch?v=gCP-W_BNzg4·talk / podcast interview(good) — The co-first authors walk through their TMLR paper's core thesis: accuracy-only agent benchmarking produces needlessly complex, expensive agents, and once you control for inference cost, dead-simple baselines (e.g. retrying/resampling a model) land on or above the cost-accuracy Pareto frontier of… 🆕— Barry Zhang & Mahesh Murag (Anthropic) —Don't Build Agents, Build Skills Instead https://www.youtube.com/watch?v=CEvIs9y1uog·talk(good) — Anthropic's case for packaging procedural knowledge as composable "Skills" (organized folders + self-documenting scripts, loaded via progressive disclosure so metadata is cheap until a skill.md is invoked) rather than building bespoke domain agents. For evals specifically, the speakers name the… 🆕— Arvind Narayanan (Princeton); host Sam Charrington (TWIML) —AI Agents: Substance or Snake Oil — Arvind Narayanan (TWIML Podcast #704)https://www.youtube.com/watch?v=HScABWB98Kw·talk(good) — Grounded in Narayanan's "AI Agents That Matter" paper, it argues agent evals are systematically misleading: leaderboards ignore inference cost, so simple repeated-sampling baselines can match or beat complex agent architectures on benchmarks like HumanEval — making cost-vs-accuracy Pareto… 🆕

—pavlovslist.com https://pavlovslist.com/·directory— The RL-environment / eval startups directory ("for the RL-pilled").Environment labs / RL-env companies(the "environments are the new data" venture wave, via pavlovslist):** BenchFlow**(benchflow.ai — SkillsBench, ClawsBench, runtime),** Prime Intellect**(verifiers, Environments Hub),** HUD**,** Mechanize**,** Plato**,** AfterQuery**,** Halluminate**,** Surge AI**,** Scale**,** Mercor**.** Prime Intellect**(verifiers

, Florian Brand) ·Braintrust·** Arize**(Phoenix/AX, OpenInference) ·** Galileo**·** LangChain / LangSmith**(agentevals) ·** Sierra**(τ-bench) ·** Core Automation**(Kanav Garg) ·** Epoch AI**(benchmark audits) ·** METR**(autonomy/horizon) ·** FutureHouse**(HLE audit) ·** UK AISI**(Inspect).

Built by merging this project's research rounds (mining → adversarial verification → reference audit) with a /deep-research

pass. Source detail lives inresearch/citations.md

,research/findings.json

,research/reference-audit.md

,research/notes/

, and the full link list inresearch/url-inventory.md

(153 URLs). Verified-high (deep-research, 3/3 votes): Verifier's Law, theverifiers library, EvalGen, Inspect AI, promptfoo, the ABC benchmark-rigor paper, plus lm-eval-harness, Autoevals, agentevals, AI Agents That Matter.Flagged caveats: the MT-Bench 10/25 bias numbers arehedged by their own authors; Lee's "Agent Runtime" post URL and the WebArena/OSWorld/Terminal-Bench/Cybench links still need verification; the Kanav Garg talk is cited via a conference summary (no canonical primary URL yet).

This repo ships 146 deep reading notes in notes/ — structured summaries with key points,

verbatim quotes, and themes, for the highest-signal sources:

— blog posts & practitioner essaysnotes/articles/

— 47 transcribed talks, podcasts & lectures (withnotes/talks/

[mm:ss] timestamps)— papers surfaced by the citation graphnotes/papers/

PRs welcome. Keep the bar high: show your work (real data/code/war-stories beat hot takes), give every entry a one-line why, verify the URL, and flag caveats. See CONTRIBUTING.md. Quality over quantity — a great list is as much about what it excludes.

To the extent possible under law, BenchFlow and contributors have waived all copyright and related rights to this work (CC0 1.0). The linked resources remain under their respective licenses.

source & further reading

github.com — original article