cd /news/machine-learning/how-model-distillation-actually-work… · home topics machine-learning article
[ARTICLE · art-17547] src=dev.to pub= topic=machine-learning verified=true sentiment=· neutral

How Model Distillation Actually Works (and What the 'China Distilled Our Model' Headlines Really Mean)

Model distillation is a standard deep learning technique that trains a smaller "student" model to imitate a larger "teacher" model by matching the teacher's full probability distribution, not just the correct answer. Contrary to headlines that frame it as theft or an exotic trick, the method is well-established and routinely used by AI labs themselves to create cheaper, faster production models. The actual controversy involves "black-box" distillation, where a student is trained on text outputs from a closed API like ChatGPT, losing the teacher's internal "dark knowledge" but still producing statistically similar behavior.

read4 min publishedMay 29, 2026

Every few weeks a headline drops: "Chinese lab distilled a frontier model from OpenAI / Anthropic." Cue the comments — half the thread thinks distillation is a synonym for theft, the other half thinks it's some exotic Chinese trick.

Both are wrong. Distillation is one of the most boring, well-established techniques in deep learning, and the labs raising the alarms use it on their own models constantly. The actual controversy is narrower and more interesting than the headlines. Let's separate the engineering from the geopolitics.

Knowledge distillation trains a small student model to imitate a large teacher model. The classic framing comes from Hinton et al. (2015): instead of training the student only on ground-truth labels, you also train it to match the teacher's output distribution.

Why does that help? Because the teacher's full probability distribution carries far more information than the single correct answer. If a teacher classifies an image of a dog, it might output dog: 0.9, wolf: 0.08, cat: 0.001

. That "dog and wolf are similar, cat is not" signal — Hinton called it dark knowledge — is exactly what a small model struggles to learn from hard labels alone.

There are two kinds of training signal:

The trick is temperature. You divide the logits by a temperature T > 1

before the softmax, which flattens the distribution and exposes those small-but-meaningful probabilities the student should learn from.

The loss is a blend of two terms: a standard cross-entropy against the real labels, and a KL-divergence pulling the student's softened distribution toward the teacher's.

import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels, T=2.0, alpha=0.5):
    hard_loss = F.cross_entropy(student_logits, labels)

    soft_targets = F.softmax(teacher_logits / T, dim=-1)
    student_log_probs = F.log_softmax(student_logits / T, dim=-1)
    soft_loss = F.kl_div(student_log_probs, soft_targets, reduction="batchmean") * (T ** 2)

    return alpha * hard_loss + (1 - alpha) * soft_loss

For LLMs the same idea applies per token: the teacher's next-token distribution is the soft target. In practice teams mix hard and soft labels — recent work argues the gain from mixing comes less from "matching the teacher better" and more from reducing exposure bias (the train/inference distribution mismatch). The point: this is normal, published, peer-reviewed engineering.

And labs distill their own models all the time. The cheap, fast variant of a flagship model that you actually get to call in production? Very often a distilled student. Anthropic itself, in the middle of its own complaint about Chinese firms, acknowledged that AI companies routinely distill their own models to make smaller, cheaper versions.

Here's the part the headlines skip. Everything above assumes you have the teacher's logits — the raw output distribution. That's white-box distillation, and it requires access to the model's internals or at least its full probability outputs.

You do not get logits from a closed commercial API like Claude or GPT. You get text. That forces black-box (a.k.a. sequence-level) distillation:

You lose the dark knowledge in the soft labels, but it turns out you can get remarkably far just by training on a large, high-quality synthetic dataset generated by a strong teacher. This is exactly why "did model X learn from model Y's outputs?" is such a live and hard-to-prove question — the evidence isn't a stolen weights file, it's statistical fingerprints in behavior (a model that randomly claims to be ChatGPT, mirrors another model's quirks, etc.).

White-box Black-box (closed API)
Needs Logits / weights Just text outputs
Signal richness High (full distribution) Lower (final answers)
Feasible against a closed model? No Yes
What the China allegations are about This one

Strip the drama and here's the documented timeline:

Two things matter here, and most coverage gets them backwards:

This is the question everyone asks, and the honest answer is: dramatically less than training from scratch — which is the entire economic motive — but precise figures for any specific alleged case are not public. Anyone quoting you an exact "they did it in N days for $M" is guessing.

What we can say structurally:

So: distillation makes a strong-ish student fast and cheap. It does not let you leapfrog past the teacher — a student is generally capped by the teacher it learned from. You don't distill your way to the frontier; you distill your way to a cheap copy of someone else's.

If you want the deep technical version of any of these — the math of temperature scaling, why mixing hard and soft labels beats either alone, or how behavioral fingerprinting tries to detect distillation — let me know in the comments.

── more in #machine-learning 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/how-model-distillati…] indexed:0 read:4min 2026-05-29 ·