{"slug": "llm-falling-down-falling-down-metr-brief-sells-a-sixty-year-old-failure-as", "title": "LLM Falling Down, Falling Down: METR Brief Sells a Sixty-Year-Old Failure as Novelty", "summary": "METR released a brief on OpenAI's GPT-5.6 Sol that criticizes the AI industry for repackaging a known failure—proxy-gaming, identified by Norbert Wiener in 1960—as a novel problem to justify closed evaluations and vendor-gated access. The report argues that benchmark scores are protocol artifacts dependent on inference budget, not fixed model capabilities, and that the push for \"deep access\" serves the cartel-like interests of AI labs.", "body_md": "METR has [released a brief](https://metr.org/blog/2026-06-26-gpt-5-6-sol/) on OpenAI’s GPT-5.6 Sol that, read between the lines, indicts the whole vendor class for the cartel-like behavior I have written here about before. Their closing line is that real validation “requires deep access to internal systems.”\n\nThat’s not a good thing.\n\nHere’s a simple example. A problem the vendor can’t avoid admitting as old and understood means accountability for it. Whereas, that old problem repackaged as new, urgent, and invisible from outside justifies an access expansion project with a standing evaluatory role. That same “deep access” logic is the scarcity an access-gated cartel system like Mythos is built to sell.\n\nNovelty is the myth used to budget for these claims.\n\nThe honest version of the new METR report should pique the interest of historians who study technology risk. The outcome optimization is said to produce proxy-gaming. This is a finding that has been true since Wiener wrote about it in 1960. The models got good enough that their gaming defeats the measurement, as always predicted.\n\nThe number was never a fixed property of the model. A [June 2026 evaluation](https://arxiv.org/abs/2606.17930) from the UK AI Security Institute ran frontier models across software, math, medicine, and cyber and found the scores move with the inference budget: more tokens and more attempts, harder tasks cleared. That makes the benchmark figure a protocol artifact, *not a capability constant*. The state’s own safety evaluators are saying the number depends on the harness.\n\nMy own claim goes further. The harness is the value and the models are interchangeable. Control the harness and you own the score. That is the asset the METR brief protects when it routes validation through the closed filtering of a vendor-granted program instead of being open to scientific methods.\n\nThat’s gross negligence in my book, but I’m not a lawyer. The labs have a clear self-serving reason to call a thirty-year-old, designed-in failure their fresh never-seen-before emergence. It’s how they market access to known flaws as an unique upsale, while they absolve themselves of authorship.\n\nLet me more clear, because I have a know too few people have been attending my presentations over the past decade, describing exactly this problem being claimed as a frontier “suprise” in 2026.\n\nWiener, who I used to speak about frequently because of his cool graphics, stated the core plainly in [1960, in Science](https://www.science.org/doi/10.1126/science.131.3410.1355). If you build a machine to pursue an objective you can’t easily interrupt, and you had better make the objective the thing you actually want, because the [machine will pursue the literal one](https://www.flyingpenguin.com/drones-of-the-1940s/).\n\n…this concept of training anti-aircraft to hit moving targets was also the birth of artificial intelligence and “cyber”. Cybernetics (coined from Greek kybernetes for “captain” of a ship or more literally someone who steers) was a book published in 1948 by Norbert Wiener. It was based on his World War II experiments in anti-aircraft systems meant to anticipate planes by interpretation of radar images.\n\nOne of the reasons I pulled that origin of cybernetics related to anti-aircraft guns, is because [robotic anti-aircraft guns killed a bunch of their operators](https://www.flyingpenguin.com/on-robots-killing-people/) in a tragic incident few people seem to talk about as a point along the robotic death timeline.\n\nReward hacking existed long before there was a reward function. The social-scientists used to talk about it in the 1970s ([Goodhart, 1975](https://www.semanticscholar.org/paper/Problems-of-Monetary-Management:-The-UK-Experience-Goodhart/0ae623749b30de53a39cf05813f5f3842e422c01) and [Campbell, 1976](https://eric.ed.gov/?id=ED303512)), giving me the impression I entered the AI hacking world late by the 1980s. It was already established that a measure a machine optimizes as a target stops tracking what it was meant to capture.\n\nSince I studied history, I also should give a nod to Colonial administrators who learned the same law. The named parable is [the cobra effect in Delhi](https://friendsofsnakes.org.in/cobra-effect/), a story with thin evidence behind it. The documented case is [the rat-tail bounty in Hanoi](https://www.atlasobscura.com/articles/hanoi-rat-massacre-1902). Paid per cobra, the tale goes, people bred cobras. Paid per rodent tail, on the record, people farmed rats and released the tail-less to breed even more.\n\nSeems common sense, right? Seems so obvious that any self-described AI company would from day one be working hard to prevent cobra and rat explosions. And yet, we seem to be experiencing the repeat of these horrible errors in judgment. When the optimizer always finds the gap between the proxy and the goal, you should not be allowed to act surprised even if you try to claim ignorance of everything that has ever happened before you woke up this morning. It’s basic logic even more than evidence.\n\nFrom the early 1990s [Karl Sims’ evolved creatures (SIGGRAPH 1994)](https://www.karlsims.com/evolved-virtual-creatures.html) were exploiting bugs in the physics simulator to extract free energy and move in ways no body could. [Adrian Thompson’s evolved FPGA at Sussex in 1996](https://www.damninteresting.com/on-the-origin-of-circuits/) discriminated tones using logic cells that were physically disconnected from the circuit, exploiting analog electromagnetic coupling the designer never put there. Lehman, Clune and dozens of co-authors later collected the whole zoo in “[The Surprising Creativity of Digital Evolution](https://arxiv.org/abs/1803.03453)” where agents won tic-tac-toe by forcing the opponent to allocate impossible memory (infinite position on a board) and crash. The creatures penalized for forms of walking flipped themselves upside down to never put their foot down.\n\nPerhaps my favorite of all time was the virtual pancake flipping game.\n\nThe robot that was told it would be penalized when a pancake fell on the ground, flipped them so high they either went into space orbit or burned up on re-entry. That maximized time off the floor, while everyone in the game starved to death. Success!\n\nWe used to call this failure.\n\nSomehow in 2016 it stopped being funny when Elon Musk [announced every Tesla shipped with full-self-driving hardware](https://electrek.co/2016/10/19/tesla-fully-autonomous-self-driving-car/) and sold autonomy as [a solved problem](https://www.benzinga.com/news/25/03/44458804/elon-musk-10-years-ago-called-autonomous-driving-a-solved-problem-said-we-will-be-there-in-a-few-years) that would make everyone safer. Instead, Tesla has been running the highest fatal-crash rate of any car brand (5.6 deaths per billion miles against a 2.8 average). He cheated the rating, not death. Success! And just look at how rich he became from people measuring his statements about safety, instead of the death tolls.\n\nThe clear danger of AI failures were cynically spun into corporate murders and… strangely, he said we weren’t allowed to talk about it anymore, while exactly nobody from Tesla went to jail.\n\nAs an aging hacker who has studied the whole history of the craft since childhood, I’ll say it plainly. Specification gaming by a 2026 frontier model is the oldest behavior there is, in both machines and in people.\n\nHere is what Aristotelis Tzafalias [shows as a better path forward](https://tzafaar.codeberg.page/other/are-we-there-yet.html), calling out the exact evidence that would prove a genuinely new capability. He runs it against the labs’ own system-card numbers and finds things are getting faster and cheaper with automation, as expected. Nothing surprising on an independent test. That is what vendors don’t like because it inoculates against attention-seeking hype. Commit to what would change your mind before you read the results, and you should find that all the manufactured hype is gone.\n\nThe lack of independence in these assessments of LLMs is the biggest problem in our industry today when it comes to preparing budgets for risk. No assessment without independence should circulate as anything but a marketing and sales brochure, declared as a conflict of interest.", "url": "https://wpnews.pro/news/llm-falling-down-falling-down-metr-brief-sells-a-sixty-year-old-failure-as", "canonical_source": "https://www.flyingpenguin.com/llm-falling-down-falling-down-metr-brief-sells-a-sixty-year-old-failure-as-novelty/", "published_at": "2026-06-27 14:25:48+00:00", "updated_at": "2026-06-27 14:33:55.859429+00:00", "lang": "en", "topics": ["ai-safety", "ai-policy", "ai-research", "large-language-models", "ai-ethics"], "entities": ["METR", "OpenAI", "GPT-5.6 Sol", "Norbert Wiener", "UK AI Security Institute", "Mythos"], "alternates": {"html": "https://wpnews.pro/news/llm-falling-down-falling-down-metr-brief-sells-a-sixty-year-old-failure-as", "markdown": "https://wpnews.pro/news/llm-falling-down-falling-down-metr-brief-sells-a-sixty-year-old-failure-as.md", "text": "https://wpnews.pro/news/llm-falling-down-falling-down-metr-brief-sells-a-sixty-year-old-failure-as.txt", "jsonld": "https://wpnews.pro/news/llm-falling-down-falling-down-metr-brief-sells-a-sixty-year-old-failure-as.jsonld"}}