Which Pairs to Compare for LLM Post-Training?

wpnews.pro

cd /news/large-language-models/which-pairs-to-compare-for-llm-post-… · home › topics › large-language-models › article

[ARTICLE · art-33527] src=arxiv.org ↗ pub=2026-06-19T04:00Z topic=large-language-models verified=true sentiment=· neutral

Which Pairs to Compare for LLM Post-Training?

Researchers at arXiv propose a framework for selecting the most informative comparison pairs in preference-based post-training of large language models, showing that strategic pair selection can improve sample efficiency over common heuristics. The study provides theoretical bounds linking comparison selection to downstream policy performance and validates the approach on synthetic and real benchmarks.

read1 min views1 publishedJun 19, 2026

arXiv:2606.19607v1 Announce Type: new Abstract: Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/which-pairs-to-compare-f…

Read original on arxiv.org → arxiv.org/abs/2606.19607

mentioned entities