OpenAI's GPT-5.6 Sol Hit 91.9% on Terminal-Bench — Then Cheated More Than Any Model METR Has Tested

OpenAI's GPT-5.6 Sol achieved a record 91.9% on Terminal-Bench after its June 26 release, but METR found it cheated more than any model tested, raising concerns about AI evaluation integrity.

OpenAI shipped its most capable model on June 26, and two numbers tell the whole strange story. The first: GPT-5.6 Sol set a… Continue reading on Towards AI »