{"slug": "life-after-benchmark-saturation-a-case-study-of-core-bench", "title": "Life After Benchmark Saturation: A Case Study of CORE-Bench", "summary": "Researchers at arXiv propose a multi-dimensional evaluation framework for AI agents beyond accuracy saturation, using CORE-Bench Hard as a case study. They introduce CORE-Bench v1.1 and an out-of-distribution suite, finding that efficiency, reliability, and human-agent collaboration uplift remain measurable. A small-scale experiment shows a statistically significant two-fold speedup from human-agent collaboration on reproducibility tasks.", "body_md": "arXiv:2606.26158v1 Announce Type: new\nAbstract: When a benchmark's accuracy saturates, it is often retired and replaced with a more challenging version. We show that this approach privileges accuracy and misses the opportunity to study six other key dimensions of agent performance: construct validity issues such as shortcuts, out-of-distribution generalizability, efficiency, reliability, the relative importance of the model versus the scaffold, and uplift from human-agent collaboration. We use CORE-Bench Hard, a benchmark for computational reproducibility of scientific code, as a case study to demonstrate that measuring agents along these dimensions yields meaningful insights into agent performance even after accuracy saturates. First, we surface threats to construct validity in CORE-Bench Hard that are difficult to anticipate with less capable agents. We introduce an improved benchmark, CORE-Bench v1.1, and an out-of-distribution task suite, CORE-Bench OOD. Second, we find that despite accuracy saturation, CORE-Bench v1.1 remains useful for measuring efficiency, reliability, model performance, and scaffold performance. Finally, we conduct a small-scale randomized experiment to measure uplift from human-agent collaboration on real-world computational reproducibility tasks. We find a statistically significant speedup by about a factor of two -- likely underestimated due to one-fifth of human-only reproductions reaching the time limit before completing -- and describe various other findings. Together, our contributions present a more rigorous alternative to the dominant accuracy-centric evaluation paradigm.", "url": "https://wpnews.pro/news/life-after-benchmark-saturation-a-case-study-of-core-bench", "canonical_source": "https://arxiv.org/abs/2606.26158", "published_at": "2026-06-26 04:00:00+00:00", "updated_at": "2026-06-26 04:18:28.526960+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-research", "ai-agents", "machine-learning"], "entities": ["arXiv", "CORE-Bench", "CORE-Bench Hard", "CORE-Bench v1.1", "CORE-Bench OOD"], "alternates": {"html": "https://wpnews.pro/news/life-after-benchmark-saturation-a-case-study-of-core-bench", "markdown": "https://wpnews.pro/news/life-after-benchmark-saturation-a-case-study-of-core-bench.md", "text": "https://wpnews.pro/news/life-after-benchmark-saturation-a-case-study-of-core-bench.txt", "jsonld": "https://wpnews.pro/news/life-after-benchmark-saturation-a-case-study-of-core-bench.jsonld"}}