Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

wpnews.pro

cd /news/large-language-models/mask-proof-an-llm-based-automated-da… · home › topics › large-language-models › article

[ARTICLE · art-28944] src=arxiv.org ↗ pub=2026-06-16T04:00Z topic=large-language-models verified=true sentiment=↑ positive

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Researchers introduced Mask-Proof, an LLM-based automated data curation pipeline that converts mathematical proofs into checkable masked-step tasks, creating the Mask-ProofBench benchmark with 292 problems. Testing 17 models showed reasoning-enhanced models outperformed standard ones by 12-27%, with the evaluator achieving 96.8% agreement with experts, enabling scalable step-level reasoning measurement.

read1 min views1 publishedJun 16, 2026

arXiv:2606.15258v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available at https://github.com/weating/Mask-Proof.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/mask-proof-an-llm-based-…

Read original on arxiv.org → arxiv.org/abs/2606.15258

mentioned entities

Mask-Proof

Mask-ProofBench

arXiv

metadata

slugmask-proof-an-llm-based-automated-data-curation-pipeline-on-mathematical-proofs

topic#large-language-models

secondary3 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevBuild Your Own AI Automation wit…

next →Could a diamond wafer as wide as…

── more in #large-language-models 4 stories · sorted by recency

letsdatascience.com · 16 Jun · #large-language-models

RDS presents hybrid fusion for irony detection

letsdatascience.com · 16 Jun · #large-language-models

GIST-CMTF adds goal inference to causal tool filtering

letsdatascience.com · 16 Jun · #large-language-models

Tangram hides GPU heterogeneity for LLM parallelization

letsdatascience.com · 16 Jun · #large-language-models

LOGOS introduces a generative foundation model for science

── more on @mask-proof 3 stories trending now

wpnews · 15 Jun · #artificial-intelligence

Facebook now has an AI search engine that pulls answers from your Group posts and Reels

wpnews · 15 Jun · #generative-ai

Pentagon Reports 1.5 Million Daily GenAI.mil Users

wpnews · 15 Jun · #large-language-models

The Grain of Thought

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required