{"slug": "mask-proof-an-llm-based-automated-data-curation-pipeline-on-mathematical-proofs", "title": "Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs", "summary": "Researchers introduced Mask-Proof, an LLM-based automated data curation pipeline that converts mathematical proofs into checkable masked-step tasks, creating the Mask-ProofBench benchmark with 292 problems. Testing 17 models showed reasoning-enhanced models outperformed standard ones by 12-27%, with the evaluator achieving 96.8% agreement with experts, enabling scalable step-level reasoning measurement.", "body_md": "arXiv:2606.15258v1 Announce Type: new\nAbstract: Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available at https://github.com/weating/Mask-Proof.", "url": "https://wpnews.pro/news/mask-proof-an-llm-based-automated-data-curation-pipeline-on-mathematical-proofs", "canonical_source": "https://arxiv.org/abs/2606.15258", "published_at": "2026-06-16 04:00:00+00:00", "updated_at": "2026-06-16 04:21:43.835067+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-tools", "natural-language-processing"], "entities": ["Mask-Proof", "Mask-ProofBench", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/mask-proof-an-llm-based-automated-data-curation-pipeline-on-mathematical-proofs", "markdown": "https://wpnews.pro/news/mask-proof-an-llm-based-automated-data-curation-pipeline-on-mathematical-proofs.md", "text": "https://wpnews.pro/news/mask-proof-an-llm-based-automated-data-curation-pipeline-on-mathematical-proofs.txt", "jsonld": "https://wpnews.pro/news/mask-proof-an-llm-based-automated-data-curation-pipeline-on-mathematical-proofs.jsonld"}}