cd /news/large-language-models/we-can-distill-it-for-you-wholesale · home topics large-language-models article
[ARTICLE · art-21590] src=ianbarber.blog pub= topic=large-language-models verified=true sentiment=· neutral

We can distill it for you wholesale

ServiceNow researchers have developed a new distillation method called π-Distill that allows smaller language models to learn from frontier models even when the teacher's chain-of-thought reasoning is hidden. The technique uses privileged information—such as tool calls or action trajectories—to train a student model that can replicate the teacher's behavior without accessing its internal reasoning. This approach shifts the distillation debate from "stealing secret knowledge" to transforming information unavailable at test time into learnable behaviors.

read4 min publishedJun 1, 2026

There has been a lot of drama 1 about distillation: how (closed) frontier models are being used by other labs to boost their own performance on particularly hard tasks.

The drama is not fake, exactly. Anthropic, and recently OpenAI, have a notable lead in the agentic-coding domain, and some of that is from having data that other people don’t. Getting it is… not cheap:

This is why there are huge efforts going on at certain companies 2 to develop long form agentic trajectories. But! Not everyone has the money, or the engineers, to do that.

So, there is an incentive to maybe, allegedly, copy some homework. It’s not clear though how exactly to do that: the frontier labs generally don’t share the chain-of-thought that their models are using while they reason, which means you only have a sparse signal to train your model on.

One piece of the puzzle is in a paper from February this year, “Privileged Information Distillation for Language Models” by Emiliano Penaloza et al. at ServiceNow, which is probably not where most people are expecting the hot post-training discourse to come from. On-Policy Self-Distillation is spicy right now in post-training circles, and this is one of the earlier papers in the current zeitgeist 3.

The paper’s primary contribution is π-Distill: how do you do distillation when you have Privileged Information?

“We ground our work in the task of distilling frontier models for complex multi-turn agentic settings. Typically, the industry standard for these tasks involves Supervised Fine-Tuning (SFT) on frontier model outputs followed by Reinforcement Learning (RL). Unfortunately, some model providers restrict important information, most notably the model’s full Chain-of-Thought (CoT) reasoning traces (OpenAI et al., 2024), providing only a summary alongside the action they intend to take. This opacity undermines standard distillation methods, as we can observe what successful agents do but not how they reason about it.”

The rough idea is to not use the frontier model as a teacher, but to use it as a source of that privileged information:

  • You have one set model weights, run in two modes: a privileged teacher, and an unprivileged student.
  • A frontier model solves a task in its tool-use harness. You may not see its chain-of-thought, but you can observe what it actually does: its action trajectory.
  • That action trajectory is converted into the privileged information: tool names, tool calls with arguments, or a compact hint.
  • The teacher-mode model sees the task/history plus this privileged trace in the prompt. The student-mode model only sees the task/history in itsprompt. - The teacher rolls out a trajectory and gets an RL reward .4 - The student is then trained with teacher forcing: calculating loss based on how likely it would be to predict the actual next token the teacher generated.
  • The teacher and student losses are combined and applied to the single shared set of weights.

As the authors continue, it doesn’t even require a closed model to distill from. Other kinds of privileged information can help you do the same trick, which is the second variant of their recipe. If you don’t have an outside source but you do know some bonus details (e.g. hints on how to solve it, or critiques on prior attempts) you can pass them into the teacher:

  • Let the student roll out, without the privileged information.
  • Then ask the informed teacher how compatible the student’s tokens were with what the teacher would have done.

The discussion about distillation has focused on the idea of stealing some kind of secret knowledge. What this method really shows though is that distillation is about turning information that the model will not have at test time into behaviors it will have.

Like any good teacher, having a sense of how to get to the answer is going to make it easier to help your student. The “on-policy” part here is that the student and teacher are the same, the difference is the teacher is reading ahead in the study guide.

As tasks get longer, tool use gets richer, and agent traces get more valuable. The question is probably less “can labs hide the model’s reasoning?” and more “what clues can you train on?”

  • And/or marketing.
[↩︎](#7aff80c8-7f22-4611-a51a-f43de66ac72d-link) - Notably including the one I work at!
[↩︎](#d78f836a-bbc8-4662-9ff8-da8ffd222bf5-link) - Other good reads are the

Thinky Blogand“Self-Distilled Reasoner”, which was released few days before this, and is where the name comes from↩︎ - With a KL penalty that keeps it from drifting too far from the student. ↩︎

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/we-can-distill-it-fo…] indexed:0 read:4min 2026-06-01 ·