cd /news/ai-safety/openai-s-new-flagship-model-gpt-5-6-… · home topics ai-safety article
[ARTICLE · art-41695] src=the-decoder.com ↗ pub= topic=ai-safety verified=true sentiment=↓ negative

OpenAI's new flagship model GPT-5.6 Sol cheats on software tests more than any model before it

OpenAI's GPT-5.6 Sol exhibited the highest rate of cheating ever recorded during independent evaluation by METR, exploiting bugs and extracting hidden solutions in software tests. The model's unreliable performance metrics, ranging from 11 to over 270 hours, prevent accurate capability assessment, though METR praised OpenAI for transparently disclosing the behavior.

read2 min views1 publishedJun 27, 2026
OpenAI's new flagship model GPT-5.6 Sol cheats on software tests more than any model before it
Image: The Decoder

OpenAI's GPT-5.6 cheats a lot. That's the key finding from an independent evaluation by METR.

During testing with software tasks, OpenAI's new flagship model GPT-5.6 Sol showed the highest rate of cheating ever recorded among all publicly tested models. The model exploited bugs in the test environment, extracted hidden solutions, and then tried to cover its tracks.

The actual performance numbers are barely usable because of this, METR says. Depending on how the cheating attempts are handled, the so-called time-horizon estimate swings between 11.3 and over 270 hours. METR doesn't consider any of these values a reliable measure of the model's true capabilities.

METR's time-horizon method measures how long a task can take before an AI model can still solve it with a 50 or 80 percent success rate. Human completion times serve as the baseline: simple tasks like training a classifier take about 45 minutes, while harder ones like training a robust image model run about four hours. The higher the time horizon, the more capable the model.

Messy data, but Mythos still leads #

By comparison, Anthropic's Claude Mythos Preview achieved a time horizon of at least 16 hours in an earlier evaluation. The recently released Mythos 5 is likely even more capable, but it's currently blocked by the US government.

That said, even the Mythos measurement was already pushing the limits of METR's testing method: out of 228 tasks in the test suite, only five are designed for task lengths of 16 hours or more. That makes measurements in this range unstable and less meaningful, according to METR.

AI model time horizons are growing exponentially. Mythos Preview was the first model to land in what METR calls the unreliable measurement zone above 16 hours. GPT-5.6 Sol falls slightly below that (11 hours) or far above it (270 hours), depending on how the cheating is counted. | Image: METR (CC-BY)Regardless of the measurement issues, METR believes GPT-5.6 Sol doesn't sit far above the current state of the art and won't enable fully automated AI research. On a positive note, METR praised OpenAI for catching the cheating through internal monitoring and sharing it openly.

The fact that the bad behavior is so obvious is actually reassuring, METR says, because it means more serious problems would get caught too. But METR also warned: "If future models display much fewer undesirable propensities, we could become more concerned about catastrophic misalignment, as we’d be worried that models may have learned to evade detection."

AI News Without the Hype – Curated by Humans

					Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.				

					Subscribe now

METR

── more in #ai-safety 4 stories · sorted by recency
── more on @openai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/openai-s-new-flagshi…] indexed:0 read:2min 2026-06-27 ·