OPRD: On-Policy Representation Distillation

Researchers have introduced On-Policy Representation Distillation (OPRD), a method that aligns student and teacher model representations across selected layers during training, bypassing the language model head to eliminate sampling variance. The approach closes the student-teacher performance gap on AIME 2024/2025 and AIMO benchmarks, where output-space distillation methods plateau below the teacher. OPRD also achieves 1.44x faster training and 54% less memory usage compared to top-k on-policy distillation.

Computer Science Machine Learning Submitted on 4 Jun 2026 Title:OPRD: On-Policy Representation Distillation View PDF /pdf/2606.06021 HTML experimental https://arxiv.org/html/2606.06021v1 Abstract:On-policy distillation OPD supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: 1 sampling variance from Monte Carlo KL estimates over large vocabularies e.g., Qwen's ~150k tokens persists throughout training, and 2 it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation OPRD , which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: this https URL . References & Citations Loading... Bibliographic and Citation Tools Bibliographic Explorer What is the Explorer? https://info.arxiv.org/labs/showcase.html arxiv-bibliographic-explorer Connected Papers What is Connected Papers? https://www.connectedpapers.com/about Litmaps What is Litmaps? https://www.litmaps.co/ scite Smart Citations What are Smart Citations? https://www.scite.ai/ Code, Data and Media Associated with this Article alphaXiv What is alphaXiv? https://alphaxiv.org/ CatalyzeX Code Finder for Papers What is CatalyzeX? https://www.catalyzex.com DagsHub What is DagsHub? https://dagshub.com/ Gotit.pub What is GotitPub? http://gotit.pub/faq Hugging Face What is Huggingface? https://huggingface.co/huggingface ScienceCast What is ScienceCast? https://sciencecast.org/welcome Demos Recommenders and Search Tools Influence Flower What are Influence Flowers? https://influencemap.cmlab.dev/ CORE Recommender What is CORE? https://core.ac.uk/services/recommender IArxiv Recommender What is IArxiv? https://iarxiv.org/about arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs https://info.arxiv.org/labs/index.html .