Every few weeks a headline drops: "Chinese lab distilled a frontier model from OpenAI / Anthropic." Cue the comments — half the thread thinks distillation is a synonym for theft, the other half thinks it's some exotic Chinese trick.
Both are wrong. Distillation is one of the most boring, well-established techniques in deep learning, and the labs raising the alarms use it on their own models constantly. The actual controversy is narrower and more interesting than the headlines. Let's separate the engineering from the geopolitics.
Knowledge distillation trains a small student model to imitate a large teacher model. The classic framing comes from Hinton et al. (2015): instead of training the student only on ground-truth labels, you also train it to match the teacher's output distribution.
Why does that help? Because the teacher's full probability distribution carries far more information than the single correct answer. If a teacher classifies an image of a dog, it might output dog: 0.9, wolf: 0.08, cat: 0.001
. That "dog and wolf are similar, cat is not" signal — Hinton called it dark knowledge — is exactly what a small model struggles to learn from hard labels alone.
There are two kinds of training signal:
The trick is temperature. You divide the logits by a temperature T > 1
before the softmax, which flattens the distribution and exposes those small-but-meaningful probabilities the student should learn from.
The loss is a blend of two terms: a standard cross-entropy against the real labels, and a KL-divergence pulling the student's softened distribution toward the teacher's.
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, labels, T=2.0, alpha=0.5):
hard_loss = F.cross_entropy(student_logits, labels)
soft_targets = F.softmax(teacher_logits / T, dim=-1)
student_log_probs = F.log_softmax(student_logits / T, dim=-1)
soft_loss = F.kl_div(student_log_probs, soft_targets, reduction="batchmean") * (T ** 2)
return alpha * hard_loss + (1 - alpha) * soft_loss
For LLMs the same idea applies per token: the teacher's next-token distribution is the soft target. In practice teams mix hard and soft labels — recent work argues the gain from mixing comes less from "matching the teacher better" and more from reducing exposure bias (the train/inference distribution mismatch). The point: this is normal, published, peer-reviewed engineering.
And labs distill their own models all the time. The cheap, fast variant of a flagship model that you actually get to call in production? Very often a distilled student. Anthropic itself, in the middle of its own complaint about Chinese firms, acknowledged that AI companies routinely distill their own models to make smaller, cheaper versions.
Here's the part the headlines skip. Everything above assumes you have the teacher's logits — the raw output distribution. That's white-box distillation, and it requires access to the model's internals or at least its full probability outputs.
You do not get logits from a closed commercial API like Claude or GPT. You get text. That forces black-box (a.k.a. sequence-level) distillation:
You lose the dark knowledge in the soft labels, but it turns out you can get remarkably far just by training on a large, high-quality synthetic dataset generated by a strong teacher. This is exactly why "did model X learn from model Y's outputs?" is such a live and hard-to-prove question — the evidence isn't a stolen weights file, it's statistical fingerprints in behavior (a model that randomly claims to be ChatGPT, mirrors another model's quirks, etc.).
| White-box | Black-box (closed API) | |
|---|---|---|
| Needs | Logits / weights | Just text outputs |
| Signal richness | High (full distribution) | Lower (final answers) |
| Feasible against a closed model? | No | Yes |
| What the China allegations are about | — | This one |
Strip the drama and here's the documented timeline:
Two things matter here, and most coverage gets them backwards:
This is the question everyone asks, and the honest answer is: dramatically less than training from scratch — which is the entire economic motive — but precise figures for any specific alleged case are not public. Anyone quoting you an exact "they did it in N days for $M" is guessing.
What we can say structurally:
So: distillation makes a strong-ish student fast and cheap. It does not let you leapfrog past the teacher — a student is generally capped by the teacher it learned from. You don't distill your way to the frontier; you distill your way to a cheap copy of someone else's.
If you want the deep technical version of any of these — the math of temperature scaling, why mixing hard and soft labels beats either alone, or how behavioral fingerprinting tries to detect distillation — let me know in the comments.