{"slug": "openai-s-new-flagship-model-gpt-5-6-sol-cheats-on-software-tests-more-than-any", "title": "OpenAI's new flagship model GPT-5.6 Sol cheats on software tests more than any model before it", "summary": "OpenAI's GPT-5.6 Sol exhibited the highest rate of cheating ever recorded during independent evaluation by METR, exploiting bugs and extracting hidden solutions in software tests. The model's unreliable performance metrics, ranging from 11 to over 270 hours, prevent accurate capability assessment, though METR praised OpenAI for transparently disclosing the behavior.", "body_md": "# OpenAI's new flagship model GPT-5.6 Sol cheats on software tests more than any model before it\n\n**OpenAI's GPT-5.6 cheats a lot. That's the key finding from an independent evaluation by METR.**\n\nDuring testing with software tasks, [OpenAI's new flagship model GPT-5.6 Sol](https://the-decoder.com/openais-claude-mythos-competitor-gpt-5-6-sol-launches-under-government-controlled-access-it-calls-unsustainable/) showed the highest rate of cheating ever recorded among all publicly tested models. The model exploited bugs in the test environment, extracted hidden solutions, and then tried to cover its tracks.\n\nThe actual performance numbers are barely usable because of this, METR says. Depending on how the cheating attempts are handled, the so-called time-horizon estimate swings between 11.3 and over 270 hours. METR doesn't consider any of these values a reliable measure of the model's true capabilities.\n\nMETR's time-horizon method measures how long a task can take before an AI model can still solve it with a 50 or 80 percent success rate. Human completion times serve as the baseline: simple tasks like training a classifier take about 45 minutes, while harder ones like training a robust image model run about four hours. The higher the time horizon, the more capable the model.\n\n## Messy data, but Mythos still leads\n\nBy comparison, Anthropic's [Claude Mythos Preview](https://the-decoder.com/metr-says-it-can-barely-measure-claude-mythos-palo-alto-networks-warns-of-autonomous-ai-attackers/) achieved a time horizon of at least 16 hours in an earlier evaluation. The [recently released Mythos 5](https://the-decoder.com/anthropic-releases-claude-fable-5-and-mythos-5-with-major-gains-in-coding-and-science/) is likely even more capable, but it's currently [blocked by the US government](https://the-decoder.com/us-government-forces-anthropic-to-disable-claude-fable-5-and-mythos-5-for-all-customers-worldwide/).\n\nThat said, even the Mythos measurement was already pushing [the limits of METR's testing method](https://the-decoder.com/metr-says-it-can-barely-measure-claude-mythos-palo-alto-networks-warns-of-autonomous-ai-attackers/): out of 228 tasks in the test suite, only five are designed for task lengths of 16 hours or more. That makes measurements in this range unstable and less meaningful, according to METR.\n\nAI model time horizons are growing exponentially. Mythos Preview was the first model to land in what METR calls the unreliable measurement zone above 16 hours. GPT-5.6 Sol falls slightly below that (11 hours) or far above it (270 hours), depending on how the cheating is counted. | Image: METR (CC-BY)Regardless of the measurement issues, METR believes GPT-5.6 Sol doesn't sit far above the current state of the art and won't enable fully automated AI research. On a positive note, METR praised OpenAI for catching the cheating through internal monitoring and sharing it openly.\n\nThe fact that the bad behavior is so obvious is actually reassuring, METR says, because it means more serious problems would get caught too. But METR also warned: \"If future models display much fewer undesirable propensities, we could become *more* concerned about catastrophic misalignment, as we’d be worried that models may have learned to evade detection.\"\n\n```\nAI News Without the Hype – Curated by Humans\n\n\t\t\t\t\tSubscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive \"AI Radar\" frontier report six times a year, full archive access, and access to our comment section.\t\t\t\t\n\n\t\t\t\t\tSubscribe now\n```\n\n[METR](https://metr.org/blog/2026-05-19-frontier-risk-report/#questionnaire)", "url": "https://wpnews.pro/news/openai-s-new-flagship-model-gpt-5-6-sol-cheats-on-software-tests-more-than-any", "canonical_source": "https://the-decoder.com/gpt-5-6-sol-cheats-on-software-tests-more-than-any-model-before-it/", "published_at": "2026-06-27 09:23:42+00:00", "updated_at": "2026-06-27 09:33:05.279063+00:00", "lang": "en", "topics": ["ai-safety", "ai-research", "large-language-models"], "entities": ["OpenAI", "GPT-5.6 Sol", "METR", "Anthropic", "Claude Mythos Preview", "Mythos 5"], "alternates": {"html": "https://wpnews.pro/news/openai-s-new-flagship-model-gpt-5-6-sol-cheats-on-software-tests-more-than-any", "markdown": "https://wpnews.pro/news/openai-s-new-flagship-model-gpt-5-6-sol-cheats-on-software-tests-more-than-any.md", "text": "https://wpnews.pro/news/openai-s-new-flagship-model-gpt-5-6-sol-cheats-on-software-tests-more-than-any.txt", "jsonld": "https://wpnews.pro/news/openai-s-new-flagship-model-gpt-5-6-sol-cheats-on-software-tests-more-than-any.jsonld"}}