# Somehow, more on distillation

> Source: <https://ianbarber.blog/2026/06/05/somehow-more-on-distillation/>
> Published: 2026-06-05 14:45:57+00:00

The capabilities in a large language model emerge, mysteriously, from the training data. Everyone agrees, roughly, that you start with a big pile of data, add some compute, and at the end you can vibe code. Opinions differ on what that pile of data should look like.

Microsoft AI recently released an incredibly in-depth [technical report](https://microsoft.ai/wp-content/uploads/2026/06/main_20260602_2.pdf) about the development of their first model, MAI-Thinking-1. Shortly after, Nvidia released their latest open model, Nemotron 3 Ultra, accompanied by another detailed [deep dive](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf). The two approach data from somewhat contrasting directions.

Nemotron is maximally distillation-pilled. Almost every corpus in their post-training stack comes from someone else’s model: math and science from DeepSeek V4 Pro, code and kernels from GPT-OSS and DeepSeek R1, chat from GLM-5, terminal traces from DeepSeek V3.2, SWE from MiniMax and Qwen3-Coder. The general reasoning teacher is trained to match DeepSeek V4 on a mixture that DeepSeek V4 generated. Even the pre-training data is 22% synthetic web crawl, plus synthetic QA, legal and fact-seeking sets 1That, to their credit, they

[released](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Legal-v1)..

This approach to data is what you do when the capabilities are the feature, not the product. Nvidia is a GPU company. It wants the behaviors and intelligence to be as widely available as possible. Now you can use them in the original models, or, use them in an open, American-made package, which runs beautifully on Blackwell. The model is a *vehicle* for inference, and as a vehicle it is excellent: strong, remarkably open, and incredibly well-tuned.

At the other end of the scale is Microsoft AI, who are working from rather different principles. They want capabilities that are learned, and can be predicted. They want inputs they can control, and can carefully ladder up.

This does not mean they disavow synthetic data: MAI self-distill, generate synthetic SWE envs and tool-calling, and create synthetic instruction-following rubrics and guidelines. There is plenty of model-generated data in the mix, especially in post-training where they train a bunch of specialists then distill them into the final model 2Nemotron does a similar thing with their MOPD (multi-teacher on-policy distillation) technique. DeepSeek v4 was the first place I recall seeing this idea of train a bunch of specialists and then distill into the final model, but it seems to be another of the emerging best practices.. What they largely avoid is data from

*third-party models*, particularly in pre and mid training

Even post-training really only uses external models, mainly GPT 5, for grading..

[3](javascript:void(0))Their goal isn’t just to get intelligence out into the world, it’s to build frontier capabilities themselves, and to sell them to enterprises. For that, you need a reproducible ladder you can actually climb (the paper refers to their whole process as a ‘hill climbing machine’). They spend a lot of energy on provenance: the corpus is (human-generated) publicly available and licensed data, and they specifically strip out AI-generated and other questionable material. They put an intense amount of effort into fitting scaling laws to a ladder of small models, judging every change by how much more baseline compute you’d need to reach the same quality, and whether those gains persists as you scale.

They have to *prove* their models are getting better, entirely on their own terms. They trust the model because they tested exhaustively, and because they verified their tests hold with scale.

Nemotron has the opposite problem, and the opposite solution. They can see how their model is doing against the suite of models they are leveraging. Their risk is not hygiene, it is overfitting to their sources, so they spend effort on ensuring generalization. Evals like PinchBench (an OpenClaw based eval, naturally) and ProfBench are held back as gates: evaluated only after the final model and never used in development. Tasks are trained under some harnesses and then checked on ones the model hasn’t seen before. They trust the model because it clears bars it was only introduced to at test time.

If your data is clean, and you can see all of it, you predict the model before you train it and confirm your forecast. If the distribution is one you can’t deeply inspect, you instead start trying to break the thing in novel ways.

Both seem to work! There are surprisingly few apples-to-apples benchmarks between the papers, but LiveCodeBench has them in similar territory 89.0 for Nemotron, 87.7 for MAI. Researcher decisions might shape the language model, but they are also shaped by the business model.

- 1That, to their credit, they
[released](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Legal-v1). - 2Nemotron does a similar thing with their MOPD (multi-teacher on-policy distillation) technique. DeepSeek v4 was the first place I recall seeing this idea of train a bunch of specialists and then distill into the final model, but it seems to be another of the emerging best practices.
- 3Even post-training really only uses external models, mainly GPT 5, for grading.
