Why Evals are Hard

wpnews.pro

Mark Pesce · University of Sydney · June 2026

Abstract #

AI evaluations are failing at the moment they matter most. As models approach general intelligence, evals must measure something we have never been able to measure in 150 years of trying. Simultaneously, benchmarks saturate through contamination and Goodhart effects, and the scope of evaluation expands as task horizons grow from minutes to months. The regulatory boundaries established in response to models like Fable 5 rest on measurements that nobody fully trusts.

The 150-Year Problem #

Evals are hard because the thing they need to measure is changing.

Early benchmarks tested specific capabilities: translation quality, image classification accuracy, question answering on a fixed corpus. These were tractable problems with tractable metrics. A model either got the answer right or it did not. You could put a number on it and that number meant something.

As models grow more capable, the evals have to grow with them. A benchmark that tests narrow capability cannot distinguish between two models that both handle narrow tasks effortlessly. To separate frontier models, you need benchmarks that test broader, deeper, more integrated capabilities.

Follow that trajectory far enough and you arrive at a familiar problem: you are trying to measure general intelligence.

Psychometricians have been working on this since Francis Galton in the 1880s. A hundred and fifty years of serious effort has produced no consensus on what general intelligence is, let alone how to measure it reliably. IQ tests measure something, but what exactly they measure remains contested, and the contest shows no signs of resolution.

We never solved this problem for humans. Now we need to solve it for AI, under time pressure, with regulatory consequences attached to the answer.

Three Simultaneous Failures #

The measurement problem would be hard enough on its own. Three additional forces are making it worse at the same time.

Saturation. Models peg the benchmark, scoring at or near ceiling, and the eval loses its ability to discriminate. MMLU went from breakthrough to useless in roughly two years. Each new benchmark has a shorter half-life than the last, because each new generation of models is trained on a larger fraction of the publicly available evaluation corpus. The eval gets folded into the training data. Passing the test no longer demonstrates the capability the test was designed to measure. It demonstrates memorisation.

Goodhart's Law. When a measure becomes a target, it ceases to be a good measure. Labs optimise for benchmarks because benchmarks drive customer perception, investment, and now regulatory classification. The optimisation is rational and relentless. Benchmark scores increasingly reflect optimisation effort rather than underlying capability. Two models with identical scores may have radically different real-world performance.

Expanding scope. The task horizon for AI has grown from single-turn queries (seconds) to multi-step autonomous workflows (hours, days, potentially months). Evaluating a model's ability to answer a question takes minutes. Evaluating a model's ability to manage a project over weeks takes weeks. The eval becomes as complex, as time-consuming, and as expensive as the task it measures. At the limit, the only way to evaluate whether a model can do the job is to give it the job and see what happens. Evaluation collapses into deployment.

The Void Under the Registry #

"Turing Police" argues that states will draw two regulatory boundaries: one between unregulated and regulated AI, another between regulated and restricted AI. These boundaries require classification. Classification requires measurement. Measurement requires evals.

The evals are failing.

The boundary between "regulated" and "restricted" is where the stakes are highest and the measurement is least reliable. A model just below the line ships commercially. A model just above it gets export-controlled. The difference in treatment is enormous. The ability to tell the two apart, with confidence, does not exist.

This is not a temporary gap that better benchmarks will close. The problem is structural. As models approach general intelligence, the thing being measured resists the tools of measurement. As benchmarks become regulatory instruments, Goodhart effects corrupt them. As task horizons expand, evaluation costs approach deployment costs. All three forces push in the same direction: toward less reliable measurement at exactly the moment when reliability matters most.

The Turing Registry will be enforced. The Turing Police will classify models into tiers. But the instruments they rely on to make those classifications are breaking down under the weight of the task. The boundaries will be drawn with confidence and enforced with vigour. Whether they are drawn in the right place is a question nobody can answer, because the tools to answer it do not yet exist, and, taking the lesson of the last 150 years, may not ever exist.

We find ourselves without reliable measures of capability or intelligence, building a regulatory regime that requires both.

Acknowledgements #

This paper emerged from deep discussions with both John Allsopp and Alan Eyzaguirre, and was drafted by Claude Cowork from my extensive notes. I remain responsible for any errors that may have crept in.

source & further reading

thewatershed.markpesce.com — original article Proof of AGI is the impossibility of evals Emergent Distributed Ralph Loops