A prover-verifier pipeline pairing GPT-5.5 Pro with Claude Opus 4.8 cracked open problems that stumped researchers for years, and the team says this is just the beginning.
A team of researchers just used a pair of competing large language models to solve nine open problems in theoretical computer science and mathematics. The approach, called an “LLM harness,” uses GPT-5.5 Pro as the solver and Claude Opus 4.8 as the verifier in a prover-verifier loop. The results were published around June 27-30, 2026.
Of the nine problems, four came from the Conference on Learning Theory (COLT) problem list, one from the Foundations of Computer Science (FOCS), and four from commutative algebra.
Omri Weinstein, a former NVIDIA researcher who highlighted the project on June 30, noted that one of the solved problems had been his personal open question for two years.
The research team was led by Binghui Peng from the University of Maryland, alongside Runzhou Tao, Steven Wang, and Hantao Yu. Peng brings a resume that includes stints at Columbia, Google, and Stanford.
How the prover-verifier loop works #
In the prover-verifier setup, GPT-5.5 Pro generates candidate proofs or solution approaches, then Claude Opus 4.8 evaluates them for correctness. When the verifier finds flaws, it sends feedback to the prover, which refines its approach. This cycle repeats until the verifier accepts the proof.
This builds on a foundation that OpenAI laid back in July 2024, when the company published a paper on “prover-verifier games” that focused on making LLM outputs more legible and verifiable. By December 2025, the approach had matured enough that GPT-5.2 Pro was already tackling a complex challenge in statistical learning theory. The jump from one problem to nine, across multiple mathematical domains, represents a meaningful scaling of the method’s ambitions.
What this means for researchers and investors #
The team has explicitly stated plans to extend this method across various scientific fields.
Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our