Data and Evaluation Closed-Loop for Model Capability Enhancement

wpnews.pro

cd /news/large-language-models/data-and-evaluation-closed-loop-for-… · home › topics › large-language-models › article

[ARTICLE · art-44351] src=arxiv.org ↗ pub=2026-06-30T04:00Z topic=large-language-models verified=true sentiment=↑ positive

Data and Evaluation Closed-Loop for Model Capability Enhancement

Researchers introduced a 'capability slice' unit and a closed-loop system that links evaluation failures to targeted data interventions in LLM pre-training. In two case studies, the system correctly identified a masked loss causing a BBH score drop and improved math reasoning on AIME benchmarks from near zero to 26.67 Pass@128. The approach makes evaluation-to-data inference routine and auditable rather than intuitive.

read1 min views1 publishedJun 30, 2026

arXiv:2606.28471v1 Announce Type: new Abstract: Model capability is the central variable in LLM pre-training, yet is never observed directly: data shapes it prospectively, while evaluation reveals it only retrospectively, compressing samples, prompts, decoding, and scoring rules into one noisy score. Practical optimization runs this backward: a failure is observed first, and the engineer must infer the corpus fix. The two sides speak incompatible vocabularies -- benchmark names and per-sample correctness versus data sources, domains, and quality labels -- so this inference is usually intuition, not method. We close this gap with the \emph{capability slice}: a group of evaluation samples sharing background condition, task type, solving operation, and output constraint -- precise enough to localize a single weakness yet stable enough to survive aggregation, unlike a benchmark name, too coarse, or a single sample, too noisy. Built around this unit, an evaluation taxonomy, a non-instruction data taxonomy, and mapping rules form a closed loop turning a benchmark-level failure into a targeted, testable data intervention. We test this loop on two case studies pulling in opposite directions. First, the loop rules the data out: continued pre-training drives BBH down by $-46.82%$, but diagnosis traces this to a single masked \texttt{\textless EOS\textgreater} loss rather than weakened reasoning; restoring it recovers BBH to $66.44$, above the original checkpoint, without changing the data. Second, the loop rules the data in: a persistent math-reasoning weakness is decomposed by solving operation into specific failing combinations, and a weakness-targeted sampling procedure built from it lifts AIME2025/AIME2026 Pass@128 from $6.67$/$0.00$ to $26.67$ each. The same unmodified loop reaches opposite, correct verdicts in both cases, showing the evaluation-to-data inference can be routine, auditable, and experimentally validated rather than intuitive.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/data-and-evaluation-clos…

Read original on arxiv.org → arxiv.org/abs/2606.28471

mentioned entities

arXiv

BBH

AIME2025

AIME2026

metadata

slugdata-and-evaluation-closed-loop-for-model-capability-enhancement

topic#large-language-models

secondary2 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevShow HN: We made an Audio ML sha…

next →OpenAI ads boss David Dugan on t…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 30 Jun · #large-language-models

CLOSER-VLN: Closed-Loop Self-Verified Retrieval-Augmented Reasoning for Aerial Vision-Language Navigation

arxiv.org · 30 Jun · #large-language-models

An Agentic AI Pipeline for Appliance-Level Energy Anomaly Detection and LLM-Driven Recommendations

arxiv.org · 30 Jun · #large-language-models

RADIANT-PET: Reasoning-Augmented PET/CT Lesion Segmentation with Large Language Models and Reinforcement Learning

arxiv.org · 30 Jun · #large-language-models

Depth-Staggered Fibonacci Spacing for Sparse Attention: Static Schedules Beat Learned Dilation and Extrapolate Where Dense Attention Fails

── more on @arxiv 3 stories trending now

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 29 Jun · #ai-agents

I built 25 executable skills for AI coding agents �“ all open source

wpnews · 29 Jun · #large-language-models

The Silent Cost of AI Agents: Why Your Next.js SaaS Is Burning Money on LLM Calls

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required