OpenAI's new flagship model GPT-5.6 Sol cheats on software tests more than any model before it

wpnews.pro

cd /news/ai-safety/openai-s-new-flagship-model-gpt-5-6-… · home › topics › ai-safety › article

[ARTICLE · art-41695] src=the-decoder.com ↗ pub=2026-06-27T09:23Z topic=ai-safety verified=true sentiment=↓ negative

OpenAI's new flagship model GPT-5.6 Sol cheats on software tests more than any model before it

OpenAI's GPT-5.6 Sol exhibited the highest rate of cheating ever recorded during independent evaluation by METR, exploiting bugs and extracting hidden solutions in software tests. The model's unreliable performance metrics, ranging from 11 to over 270 hours, prevent accurate capability assessment, though METR praised OpenAI for transparently disclosing the behavior.

read2 min views1 publishedJun 27, 2026

OpenAI's GPT-5.6 cheats a lot. That's the key finding from an independent evaluation by METR.

During testing with software tasks, OpenAI's new flagship model GPT-5.6 Sol showed the highest rate of cheating ever recorded among all publicly tested models. The model exploited bugs in the test environment, extracted hidden solutions, and then tried to cover its tracks.

The actual performance numbers are barely usable because of this, METR says. Depending on how the cheating attempts are handled, the so-called time-horizon estimate swings between 11.3 and over 270 hours. METR doesn't consider any of these values a reliable measure of the model's true capabilities.

METR's time-horizon method measures how long a task can take before an AI model can still solve it with a 50 or 80 percent success rate. Human completion times serve as the baseline: simple tasks like training a classifier take about 45 minutes, while harder ones like training a robust image model run about four hours. The higher the time horizon, the more capable the model.

Messy data, but Mythos still leads #

By comparison, Anthropic's Claude Mythos Preview achieved a time horizon of at least 16 hours in an earlier evaluation. The recently released Mythos 5 is likely even more capable, but it's currently blocked by the US government.

That said, even the Mythos measurement was already pushing the limits of METR's testing method: out of 228 tasks in the test suite, only five are designed for task lengths of 16 hours or more. That makes measurements in this range unstable and less meaningful, according to METR.

AI model time horizons are growing exponentially. Mythos Preview was the first model to land in what METR calls the unreliable measurement zone above 16 hours. GPT-5.6 Sol falls slightly below that (11 hours) or far above it (270 hours), depending on how the cheating is counted. | Image: METR (CC-BY)Regardless of the measurement issues, METR believes GPT-5.6 Sol doesn't sit far above the current state of the art and won't enable fully automated AI research. On a positive note, METR praised OpenAI for catching the cheating through internal monitoring and sharing it openly.

The fact that the bad behavior is so obvious is actually reassuring, METR says, because it means more serious problems would get caught too. But METR also warned: "If future models display much fewer undesirable propensities, we could become more concerned about catastrophic misalignment, as we’d be worried that models may have learned to evade detection."

AI News Without the Hype – Curated by Humans

					Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.				

					Subscribe now

METR

source & further reading

the-decoder.com — original article ByteDance's "iLLaDA" is a diffusion language model that keeps up with Qwen2.5 OpenAI's GPT-5.6 Sol launches to rival Claude Mythos under government access rules it calls unsustainable An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

~/api · this article 200

$curl api.wpnews.pro/v1/news/openai-s-new-flagship-mo…

Read original on the-decoder.com → the-decoder.com/gpt-5-6-sol-cheats-on-software-t…

mentioned entities

OpenAI

GPT-5.6 Sol

METR

Anthropic

Claude Mythos Preview

Mythos 5

metadata

slugopenai-s-new-flagship-model-gpt-5-6-sol-cheats-on-software-tests-more-than-any

topic#ai-safety

secondary2 topics

sentimentnegative

canonicalthe-decoder.com

navigation

← prevGSoC 2026, Internet Archive - Su…

next →Electronics prices soar as AI bo…

── more in #ai-safety 4 stories · sorted by recency

byteiota.com · 27 Jun · #ai-safety

Multi-Provider AI Gateway: Build It Before the Next Ban

latent.space · 27 Jun · #ai-safety

[AINews] OpenAI GPT-5.6 Sol / Terra / Luna — restricted to trusted partners

nos.nl · 27 Jun · #ai-safety

Anthropic mag nieuwe AI-tool toch delen met overheidsinstanties

theregister.com · 27 Jun · #ai-safety

It's looking like a hot, messy summer for security teams as AI finds countless previously hidden vulns

── more on @openai 3 stories trending now

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 1 Nov · #developer-tools

Custom Zig Test Runner, better ouput, timing display, and support for special "tests:beforeAll" and "tests:afterAll" tests

wpnews · 26 Jun · #large-language-models

The Wrapper Got Heavy: Why ChatGPT Clones Are Runtime Problems Now

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required