How Model Distillation Actually Works (and What the 'China Distilled Our Model' Headlines Really Mean)

wpnews.pro

cd /news/machine-learning/how-model-distillation-actually-work… · home › topics › machine-learning › article

[ARTICLE · art-17547] src=dev.to ↗ pub=2026-05-29T12:11Z topic=machine-learning verified=true sentiment=· neutral

How Model Distillation Actually Works (and What the 'China Distilled Our Model' Headlines Really Mean)

Model distillation is a standard deep learning technique that trains a smaller "student" model to imitate a larger "teacher" model by matching the teacher's full probability distribution, not just the correct answer. Contrary to headlines that frame it as theft or an exotic trick, the method is well-established and routinely used by AI labs themselves to create cheaper, faster production models. The actual controversy involves "black-box" distillation, where a student is trained on text outputs from a closed API like ChatGPT, losing the teacher's internal "dark knowledge" but still producing statistically similar behavior.

read4 min views18 publishedMay 29, 2026

Every few weeks a headline drops: "Chinese lab distilled a frontier model from OpenAI / Anthropic." Cue the comments — half the thread thinks distillation is a synonym for theft, the other half thinks it's some exotic Chinese trick.

Both are wrong. Distillation is one of the most boring, well-established techniques in deep learning, and the labs raising the alarms use it on their own models constantly. The actual controversy is narrower and more interesting than the headlines. Let's separate the engineering from the geopolitics.

Knowledge distillation trains a small student model to imitate a large teacher model. The classic framing comes from Hinton et al. (2015): instead of training the student only on ground-truth labels, you also train it to match the teacher's output distribution.

Why does that help? Because the teacher's full probability distribution carries far more information than the single correct answer. If a teacher classifies an image of a dog, it might output dog: 0.9, wolf: 0.08, cat: 0.001

. That "dog and wolf are similar, cat is not" signal — Hinton called it dark knowledge — is exactly what a small model struggles to learn from hard labels alone.

There are two kinds of training signal:

The trick is temperature. You divide the logits by a temperature T > 1

before the softmax, which flattens the distribution and exposes those small-but-meaningful probabilities the student should learn from.

The loss is a blend of two terms: a standard cross-entropy against the real labels, and a KL-divergence pulling the student's softened distribution toward the teacher's.

import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels, T=2.0, alpha=0.5):
    hard_loss = F.cross_entropy(student_logits, labels)

    soft_targets = F.softmax(teacher_logits / T, dim=-1)
    student_log_probs = F.log_softmax(student_logits / T, dim=-1)
    soft_loss = F.kl_div(student_log_probs, soft_targets, reduction="batchmean") * (T ** 2)

    return alpha * hard_loss + (1 - alpha) * soft_loss

For LLMs the same idea applies per token: the teacher's next-token distribution is the soft target. In practice teams mix hard and soft labels — recent work argues the gain from mixing comes less from "matching the teacher better" and more from reducing exposure bias (the train/inference distribution mismatch). The point: this is normal, published, peer-reviewed engineering.

And labs distill their own models all the time. The cheap, fast variant of a flagship model that you actually get to call in production? Very often a distilled student. Anthropic itself, in the middle of its own complaint about Chinese firms, acknowledged that AI companies routinely distill their own models to make smaller, cheaper versions.

Here's the part the headlines skip. Everything above assumes you have the teacher's logits — the raw output distribution. That's white-box distillation, and it requires access to the model's internals or at least its full probability outputs.

You do not get logits from a closed commercial API like Claude or GPT. You get text. That forces black-box (a.k.a. sequence-level) distillation:

You lose the dark knowledge in the soft labels, but it turns out you can get remarkably far just by training on a large, high-quality synthetic dataset generated by a strong teacher. This is exactly why "did model X learn from model Y's outputs?" is such a live and hard-to-prove question — the evidence isn't a stolen weights file, it's statistical fingerprints in behavior (a model that randomly claims to be ChatGPT, mirrors another model's quirks, etc.).

White-box	Black-box (closed API)
Needs	Logits / weights	Just text outputs
Signal richness	High (full distribution)	Lower (final answers)
Feasible against a closed model?	No	Yes
What the China allegations are about	—	This one

Strip the drama and here's the documented timeline:

Two things matter here, and most coverage gets them backwards:

This is the question everyone asks, and the honest answer is: dramatically less than training from scratch — which is the entire economic motive — but precise figures for any specific alleged case are not public. Anyone quoting you an exact "they did it in N days for $M" is guessing.

What we can say structurally:

So: distillation makes a strong-ish student fast and cheap. It does not let you leapfrog past the teacher — a student is generally capped by the teacher it learned from. You don't distill your way to the frontier; you distill your way to a cheap copy of someone else's.

If you want the deep technical version of any of these — the math of temperature scaling, why mixing hard and soft labels beats either alone, or how behavioral fingerprinting tries to detect distillation — let me know in the comments.

source & further reading

dev.to — original article Practical Guide: Integrating Claude Code with NanoBanana MCP for Image Generation and Editing Squeezing Every Megabyte: Optimizing an 8GB NVIDIA Jetson Orin Nano for Headless ROS 2 and Edge-AI "Is it alive?" is the wrong question. Ask "is it working?"

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-model-distillation-a…

Read original on dev.to → dev.to/p0rt/how-model-distillation-actually-work…

mentioned entities

OpenAI

Anthropic

Hinton

metadata

slughow-model-distillation-actually-works-and-what-the-china-distilled-our-model

topic#machine-learning

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevAsk HN: I hate you. Don't leave …

next →Rail Vision signs MoU with Rails…

── more in #machine-learning 4 stories · sorted by recency

dev.to · 14 Jul · #machine-learning

Local-first RAG for privileged legal documents: why citations need verification

latent.space · 14 Jul · #machine-learning

[AINews] Codex usage up >10x in 6 months to 7M users, +1M in the past ~day; did Codex overtake Claude Code??

cryptobriefing.com · 14 Jul · #machine-learning

Codex surges to 6 million active users, overtaking Claude Code’s 2 million

machinebrief.com · 14 Jul · #machine-learning

Meta's Cloud Dreams: From Llama to Leasing Out Compute

── more on @openai 3 stories trending now

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

wpnews · 8 Jul · #large-language-models

Gemini 3.5 Pro Delayed to July 17: Architectural Rebuild Explained

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI and Cursor unveil joint AI model as $60B acquisition reshapes enterprise AI landscape

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required