{"slug": "data-and-evaluation-closed-loop-for-model-capability-enhancement", "title": "Data and Evaluation Closed-Loop for Model Capability Enhancement", "summary": "Researchers introduced a 'capability slice' unit and a closed-loop system that links evaluation failures to targeted data interventions in LLM pre-training. In two case studies, the system correctly identified a masked loss causing a BBH score drop and improved math reasoning on AIME benchmarks from near zero to 26.67 Pass@128. The approach makes evaluation-to-data inference routine and auditable rather than intuitive.", "body_md": "arXiv:2606.28471v1 Announce Type: new\nAbstract: Model capability is the central variable in LLM pre-training, yet is never observed directly: data shapes it prospectively, while evaluation reveals it only retrospectively, compressing samples, prompts, decoding, and scoring rules into one noisy score. Practical optimization runs this backward: a failure is observed first, and the engineer must infer the corpus fix. The two sides speak incompatible vocabularies -- benchmark names and per-sample correctness versus data sources, domains, and quality labels -- so this inference is usually intuition, not method. We close this gap with the \\emph{capability slice}: a group of evaluation samples sharing background condition, task type, solving operation, and output constraint -- precise enough to localize a single weakness yet stable enough to survive aggregation, unlike a benchmark name, too coarse, or a single sample, too noisy. Built around this unit, an evaluation taxonomy, a non-instruction data taxonomy, and mapping rules form a closed loop turning a benchmark-level failure into a targeted, testable data intervention. We test this loop on two case studies pulling in opposite directions. First, the loop rules the data out: continued pre-training drives BBH down by $-46.82\\%$, but diagnosis traces this to a single masked \\texttt{\\textless EOS\\textgreater} loss rather than weakened reasoning; restoring it recovers BBH to $66.44$, above the original checkpoint, without changing the data. Second, the loop rules the data in: a persistent math-reasoning weakness is decomposed by solving operation into specific failing combinations, and a weakness-targeted sampling procedure built from it lifts AIME2025/AIME2026 Pass@128 from $6.67$/$0.00$ to $26.67$ each. The same unmodified loop reaches opposite, correct verdicts in both cases, showing the evaluation-to-data inference can be routine, auditable, and experimentally validated rather than intuitive.", "url": "https://wpnews.pro/news/data-and-evaluation-closed-loop-for-model-capability-enhancement", "canonical_source": "https://arxiv.org/abs/2606.28471", "published_at": "2026-06-30 04:00:00+00:00", "updated_at": "2026-06-30 04:30:44.979215+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-agents"], "entities": ["arXiv", "BBH", "AIME2025", "AIME2026"], "alternates": {"html": "https://wpnews.pro/news/data-and-evaluation-closed-loop-for-model-capability-enhancement", "markdown": "https://wpnews.pro/news/data-and-evaluation-closed-loop-for-model-capability-enhancement.md", "text": "https://wpnews.pro/news/data-and-evaluation-closed-loop-for-model-capability-enhancement.txt", "jsonld": "https://wpnews.pro/news/data-and-evaluation-closed-loop-for-model-capability-enhancement.jsonld"}}