Evaluating Intelligence

wpnews.pro

Mark Pesce · University of Sydney · June 2026

Abstract #

When it becomes as hard to evaluate machine intelligence as it has always been to evaluate human intelligence, a threshold has been crossed. Human intelligence testing has been intractable for 150 years. Machine intelligence testing has become intractable in four. They are failing for the same reasons: the constructs resist decomposition, the benchmarks saturate, and every measure becomes a target. The reason is the same because the thing being measured is the same: General Intelligence.

96 and 48 #

In the spring of 2026, professor Roberto Serrano gave his advanced mathematical economics class at Brown University a take-home midterm. The average score was 96 out of 100. Forty students scored a perfect 100.

He gave the same class an in-person final. The average dropped to 48. Of the 27 students who did not show up, 22 had scored perfect 100 on the midterm.

Same students. Same subject. Same semester. The variable that changed was the room. In the room, no AI. Outside the room, AI everywhere. The delta between 96 and 48 is the measure of what artificial intelligence contributed - or rather, hid - to the assessment of human intelligence. It was, in this case, larger than the contribution of the humans being assessed.

Serrano is a game theorist. He understands better than most that when you change the payoff matrix, you change the behaviour. AI made cheating undetectable and costless. Any game theorist would predict mass defection. He got it in his own classroom.

Princeton has responded to the same pressure by ending 133 years of trust-based examination. Since 1893, professors left the room during exams and the honor code held. AI broke it. After 133 years, the only answer is: put a body back in the room.

What is collapsing at Brown and Princeton is the terminal phase of a hundred-year failure in the measurement of intelligence, human and artificial alike.

The Hundred-Year Failure #

The Scholastic Aptitude Test arrived in 1926 as a mass-production instrument for sorting teenagers. For sixty years, perfect scores were rare. Then Goodhart's Law took hold: when a measure becomes a target, it ceases to be a good measure. The SAT stopped measuring aptitude and started measuring preparation for the SAT. During the pandemic, the Ivy League dropped it, then reinstated it when grade point averages turned out to be even less reliable, bent under the same force: grade inflation, another measurement become a target.

Then ChatGPT arrived and the already-hollowed instrument of assessment shattered entirely. Within weeks of launch, students had discovered a universal answer machine. By 2025, students used AI for everything, teachers used AI to grade them, and AI was used to detect whether AI had been used. Student assessments conducted outside of a controlled environment lost all value.

A Stanford Law report in May 2026 showed that law professors prefer AI-generated answers over those written by their colleagues. The Harvard Business Review threw up its hands at hiring evaluations the following month: every tell of a suitable candidate can be perfectly emulated.

Only embodiment discriminates; a face-to-face encounter, in real time, with follow-up questions that probe whether comprehension is real or performed, can establish whether the intelligence on display belongs to the human in the room or to a machine working through a human proxy. In February 2023, a Stanford professor already saw where this was heading. They handed ChatGPT output to their class and asked not "did you write this?" but "is this true?" The question of evaluation had shifted from capability to discernment.

The instruments for measuring human intelligence have not been replaced. They have been rendered meaningless by the arrival of a second intelligence that the instruments cannot distinguish from the first.

The Same Problem, Faster #

AI benchmarks followed the same arc as the SAT, compressing a century of failure into three years.

MMLU went from breakthrough to useless in roughly two years as training data absorbed the test. Goodhart's Law took hold: labs optimised for benchmark scores because those scores drive customer perception, investment, and now regulatory classification. Task horizons expanded from single-turn queries to multi-month autonomous workflows, until evaluating a model's capabilities became as complex and expensive as deploying the model. Evaluation collapsed into deployment.

Each of these failures has a human twin. SAT scores saturated as preparation caught up with the test; MMLU scores saturated as training data absorbed it. The SAT reshaped education around gaming the test; benchmarks reshaped model development around gaming the benchmark. Human assessment retreated from written to oral to embodied as each method was defeated; AI evaluation expanded until the only way to test whether a model can do the job was to give it the job.

Follow the trajectory far enough and we arrive at the problem underneath: measuring general intelligence. Psychometricians have been trying since the 1880s. A hundred and fifty years of serious effort has produced no consensus on what general intelligence is, let alone how to measure it. We never answered this question for humans. Now we need to answer it for machines, under time pressure, because states are classifying models into regulatory tiers and the classification depends on evals that can no longer do the job, and likely never will.

General Intelligence #

Too caught up in the failure of human intelligence assessments, we failed to recognise that the reason those systems had failed indicated another, separate-yet-equal element of intelligence in play. Ever since ChatGPT launched, we have been peri-AGI without recognising it.

AI evaluation is hard for exactly the same reasons that measuring human general intelligence is hard, because in both cases the thing being measured is general intelligence.

The difficulty is itself the evidence. We do not need a theoretical proof that artificial general intelligence has arrived. The practical failure of our evaluation instruments reveals it: When testing AI becomes as hard as testing humans, and hard for the same reasons, the question of artificial general intelligence has already been answered.

Acknowledgements #

This paper emerged from deep discussions with John Allsopp, and was drafted in a lengthy back-and-forth from my notes using Claude Cowork. I remain responsible for any errors that may have crept in.

source & further reading

thewatershed.markpesce.com — original article Proof of AGI is the impossibility of evals Why Evals are Hard Emergent Distributed Ralph Loops