Mathematicians grade AI performance on complex problem set at Harvard

wpnews.pro

cd /news/artificial-intelligence/mathematicians-grade-ai-performance-… · home › topics › artificial-intelligence › article

[ARTICLE · art-27344] src=cryptobriefing.com ↗ pub=2026-06-14T23:04Z topic=artificial-intelligence verified=true sentiment=· neutral

Mathematicians grade AI performance on complex problem set at Harvard

Thirty mathematicians at Harvard blind-graded AI solutions to 10 original, unpublished research-level math problems, finding that leading AI systems passed on 7 of the 10 problems. The results, released June 10, 2026, show nuanced AI performance that improved from early trials but still falls short of human expert levels.

read2 min views25 publishedJun 14, 2026

Thirty experts blind-graded AI solutions to original research-level math problems, and the results tell a nuanced story about where artificial intelligence actually stands

Here’s a question that keeps researchers up at night: can AI actually do math, or is it just really good at pattern-matching against problems it’s already seen? A group of 30 mathematicians at Harvard decided to find out the hard way, by giving leading AI systems a test they couldn’t possibly have studied for.

The project, called “First Proof, Second Batch,” assembled its expert panel at Harvard’s Center of Mathematical Sciences and Applications in early June 2026. Their task was straightforward but unprecedented in scale: blind-grade AI-generated solutions to 10 original, unpublished research-level mathematics problems. The results, released on June 10, paint a picture that’s neither the doom scenario nor the triumph that partisans on either side might prefer.

The setup: why unpublished problems matter #

The entire exercise hinges on one critical design choice. Every problem in the set was drawn from active, unpublished research. None of these questions had appeared in textbooks, on arXiv, or anywhere else an AI’s training data could have scraped them.

The mathematicians behind the project aren’t exactly lightweights, either. The roster includes Mohammed Abouzaid from Stanford, Nikhil Srivastava from UC Berkeley, Rachel Ward from UT Austin, and Lauren Williams of Harvard.

What the AI actually got right, and wrong #

Four leading AI systems participated in the evaluation, including models from OpenAI and Google. The headline number: the expert panel awarded passing grades on seven of the 10 problems across the four systems tested.

In preliminary and early trial runs, AI systems reportedly solved only 2 of the 10 problems. The gap between early performance and final results suggests that the models may have benefited from multiple attempts or different prompting strategies, though the blind grading protocol was designed to evaluate the quality of submitted solutions on their merits alone.

Building on earlier results #

This second batch builds on an initial round of assessments conducted in February 2026. The First Proof project was designed from the start as an ongoing evaluation framework, not a one-time stunt. By running multiple rounds with fresh problems each time, the organizers can track whether AI capabilities are genuinely improving at research-level mathematics or simply plateauing after the initial rush of benchmark gains.

Standard math benchmarks, even difficult ones like competition-level problems, have increasingly fallen to frontier models. But competition problems, by definition, have known solutions and known solution methods. Research-level mathematics operates in a fundamentally different regime, where you often don’t know if a solution even exists, let alone what techniques might get you there.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our

Editorial Policy.

source & further reading

cryptobriefing.com — original article Why microsoft’s stock is soaring toward a historic gain after earnings VanEck Onchain Economy ETF buys the dip after losses, faces no margin calls CoinGecko launches connector for live crypto data on Claude

~/api · this article 200

$curl api.wpnews.pro/v1/news/mathematicians-grade-ai-…

Read original on cryptobriefing.com → cryptobriefing.com/mathematicians-grade-ai-harva…

mentioned entities

Harvard

OpenAI

Google

Mohammed Abouzaid

Nikhil Srivastava

Rachel Ward

Lauren Williams

Center of Mathematical Sciences and Applications

metadata

slugmathematicians-grade-ai-performance-on-complex-problem-set-at-harvard

topic#artificial-intelligence

secondary2 topics

sentimentneutral

canonicalcryptobriefing.com

navigation

← prevFraud losses surge as scammers u…

next →Save a massive 24% off the MSRP …

── more in #artificial-intelligence 4 stories · sorted by recency

the-decoder.com · 30 Jul · #artificial-intelligence

Language models can't spark scientific revolutions, but world models might

androidauthority.com · 30 Jul · #artificial-intelligence

First look: Gemini could soon help you set up your new Android phone

dev.to · 30 Jul · #artificial-intelligence

OpenAI’s Academic Researcher Access Plan Could Expand Frontier AI Use in Science

thezvi.wordpress.com · 30 Jul · #artificial-intelligence

AI #179 Part 1: A Louder Fire Alarm for General Intelligence

── more on @harvard 3 stories trending now

wpnews · 29 Jul · #ai-safety

News Summary for July 29, 2026

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 29 Jul · #ai-safety

Better security starts with better questions

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required