We can distill it for you wholesale

ServiceNow researchers have developed a new distillation method called π-Distill that allows smaller language models to learn from frontier models even when the teacher's chain-of-thought reasoning is hidden. The technique uses privileged information—such as tool calls or action trajectories—to train a student model that can replicate the teacher's behavior without accessing its internal reasoning. This approach shifts the distillation debate from "stealing secret knowledge" to transforming information unavailable at test time into learnable behaviors.

There has been a lot of drama 1 about distillation: how closed frontier models are being used by other labs to boost their own performance on particularly hard tasks. The drama is not fake, exactly. Anthropic, and recently OpenAI, have a notable lead in the agentic-coding domain, and some of that is from having data that other people don’t. Getting it is… not cheap: This is why there are huge efforts going on at certain companies 2 to develop long form agentic trajectories. But Not everyone has the money, or the engineers, to do that. So, there is an incentive to maybe, allegedly, copy some homework. It’s not clear though how exactly to do that: the frontier labs generally don’t share the chain-of-thought that their models are using while they reason, which means you only have a sparse signal to train your model on. One piece of the puzzle is in a paper from February this year, “Privileged Information Distillation for Language Models” https://arxiv.org/abs/2602.04942 by Emiliano Penaloza et al. at ServiceNow, which is probably not where most people are expecting the hot post-training discourse to come from. On-Policy Self-Distillation is spicy right now in post-training circles, and this is one of the earlier papers in the current zeitgeist 3. The paper’s primary contribution is π-Distill: how do you do distillation when you have Privileged Information? “We ground our work in the task of distilling frontier models for complex multi-turn agentic settings. Typically, the industry standard for these tasks involves Supervised Fine-Tuning SFT on frontier model outputs followed by Reinforcement Learning RL . Unfortunately, some model providers restrict important information, most notably the model’s full Chain-of-Thought CoT reasoning traces OpenAI et al., 2024 , providing only a summary alongside the action they intend to take. This opacity undermines standard distillation methods, as we can observe what successful agents do but not how they reason about it.” The rough idea is to not use the frontier model as a teacher, but to use it as a source of that privileged information: - You have one set model weights, run in two modes: a privileged teacher, and an unprivileged student. - A frontier model solves a task in its tool-use harness. You may not see its chain-of-thought, but you can observe what it actually does: its action trajectory. - That action trajectory is converted into the privileged information: tool names, tool calls with arguments, or a compact hint. - The teacher-mode model sees the task/history plus this privileged trace in the prompt. The student-mode model only sees the task/history in its prompt. - The teacher rolls out a trajectory and gets an RL reward . 4 eafac004-e367-49e4-9c8c-70496f19431a - The student is then trained with teacher forcing: calculating loss based on how likely it would be to predict the actual next token the teacher generated. - The teacher and student losses are combined and applied to the single shared set of weights. As the authors continue, it doesn’t even require a closed model to distill from. Other kinds of privileged information can help you do the same trick, which is the second variant of their recipe. If you don’t have an outside source but you do know some bonus details e.g. hints on how to solve it, or critiques on prior attempts you can pass them into the teacher: - Let the student roll out, without the privileged information. - Then ask the informed teacher how compatible the student’s tokens were with what the teacher would have done. The discussion about distillation has focused on the idea of stealing some kind of secret knowledge. What this method really shows though is that distillation is about turning information that the model will not have at test time into behaviors it will have. Like any good teacher, having a sense of how to get to the answer is going to make it easier to help your student. The “on-policy” part here is that the student and teacher are the same , the difference is the teacher is reading ahead in the study guide. As tasks get longer, tool use gets richer, and agent traces get more valuable. The question is probably less “can labs hide the model’s reasoning?” and more “what clues can you train on?” - And/or marketing. ↩︎ 7aff80c8-7f22-4611-a51a-f43de66ac72d-link - Notably including the one I work at ↩︎ d78f836a-bbc8-4662-9ff8-da8ffd222bf5-link - Other good reads are the Thinky Blog https://thinkingmachines.ai/blog/on-policy-distillation/ and “Self-Distilled Reasoner” https://arxiv.org/abs/2601.18734 , which was released few days before this, and is where the name comes from ↩︎ 23c5d326-3b58-463b-9077-6dd44a9e6c47-link - With a KL penalty that keeps it from drifting too far from the student. ↩︎ eafac004-e367-49e4-9c8c-70496f19431a-link