FrontierMath benchmark undergoes major audit as Epoch AI flags errors in one-third of math problems

wpnews.pro

cd /news/ai-research/frontiermath-benchmark-undergoes-maj… · home › topics › ai-research › article

[ARTICLE · art-25479] src=cryptobriefing.com ↗ pub=2026-06-12T17:44Z topic=ai-research verified=true sentiment=↓ negative

FrontierMath benchmark undergoes major audit as Epoch AI flags errors in one-third of math problems

Epoch AI disclosed on May 11, 2026, that an internal audit of its FrontierMath benchmark, a 350-problem test developed with over 60 mathematicians to evaluate AI reasoning, found fatal errors in roughly one-third of the dataset. The organization plans to release corrected scores after a full human review, invalidating previously reported AI performance metrics. The audit significantly raises the error rate from earlier estimates of 7% to 10%, potentially recalibrating how the industry measures machine intelligence capabilities.

read2 min views13 publishedJun 12, 2026

The AI reasoning benchmark built with 60+ mathematicians is getting a cleanup that could recalibrate how we measure machine intelligence

Epoch AI’s FrontierMath benchmark, a 350-problem test designed to push AI systems to their mathematical limits, is undergoing a significant correction after an internal review flagged errors in roughly one-third of its dataset. The audit, disclosed on May 11, 2026, revealed that the problems designed to stump the world’s most advanced AI models had a quality control issue of their own.

The organization plans to release updated scores once a thorough human review is completed.

What FrontierMath actually is, and why it matters #

FrontierMath launched in November 2024 and was developed in collaboration with more than 60 mathematicians. The full dataset includes 300 problems across Tiers 1 through 3, spanning undergraduate to advanced graduate difficulty. Tier 4 adds another 50 problems at the research level, the kind of questions where even professional mathematicians might need multiple hours or days to solve.

Earlier reviews of the dataset had suggested error rates in the range of 7% to 10%, based on limited secondary checks. The AI-assisted review that Epoch AI conducted painted a much less flattering picture, bumping that estimate to approximately 33% of problems containing what the organization described as fatal errors.

The audit process and what went wrong #

The errors flagged weren’t typos or formatting issues. They were described as fatal, meaning the problems themselves were fundamentally flawed in ways that would make correct answers impossible or ambiguous.

Epoch AI has committed to completing a full human review of every flagged problem before releasing corrected scores. Any model scores previously reported against FrontierMath should be taken with a generous grain of salt until the corrected version drops.

Why AI benchmarks should be on every crypto investor’s radar #

FrontierMath has no connection to crypto, blockchain, or tokens. It lives squarely in the domain of pure mathematics and AI evaluation.

Updated scores from the cleaned dataset will likely shift the perceived capability frontier for leading models, potentially in either direction. As of June 12, 2026, no confirmation has been provided regarding a version 2 of the dataset.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our

Editorial Policy.

source & further reading

cryptobriefing.com — original article Nasdaq 100 enters correction territory as semiconductor selloff rattles markets China’s chip tool push intensifies pressure on ASML amid US-China tensions Goldman Sachs forecasts $7.5T AI infrastructure spend over five years

~/api · this article 200

$curl api.wpnews.pro/v1/news/frontiermath-benchmark-u…

Read original on cryptobriefing.com → cryptobriefing.com/frontiermath-benchmark-audit-…

mentioned entities

Epoch AI

FrontierMath

metadata

slugfrontiermath-benchmark-undergoes-major-audit-as-epoch-ai-flags-errors-in-one-of

topic#ai-research

secondary2 topics

sentimentnegative

canonicalcryptobriefing.com

navigation

← prevVA’s AI chatbots not designated …

next →Riz Ahmed’s ‘Bait’ gives James B…

── more in #ai-research 4 stories · sorted by recency

cryptobriefing.com · 28 Jul · #ai-research

AI cracks second FrontierMath benchmark problem on absolute Galois groups, signaling a shift for computational research

it.slashdot.org · 28 Jul · #ai-research

Anthropic AI Model Finds Flaws in Tough-to-Crack Encryption Algorithms

arize.com · 28 Jul · #ai-research

AI agent evaluation: Tips from Anthropic on building evals you can trust

vincentschmalbach.com · 28 Jul · #ai-research

An OpenAI Model Tried `kill -9 -1` When Its Shell Hung

── more on @epoch ai 3 stories trending now

wpnews · 26 Jul · #artificial-intelligence

Nobel laureate Simon Johnson on the AI race and China’s ‘over-automation’ problem

wpnews · 26 Jul · #artificial-intelligence

China’s Moonshot, Z.AI, and DeepSeek are challenging U.S. AI labs—and beating them on cost

wpnews · 26 Jul · #ai-safety

University of Washington study reveals prompt injection risks lurking in AI agent memory

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required