{"slug": "somehow-more-on-distillation", "title": "Somehow, more on distillation", "summary": "Microsoft AI released a detailed technical report on the development of its first model, MAI-Thinking-1, emphasizing a controlled, reproducible training process built on human-generated data and proprietary scaling laws. In contrast, Nvidia released its open Nemotron 3 Ultra model, which relies heavily on distillation from third-party models like DeepSeek and GPT-OSS to maximize broad capability and performance on its hardware. The two approaches reflect differing business goals: Microsoft aims to build frontier capabilities for enterprise customers, while Nvidia uses its model as a vehicle to drive inference on its Blackwell GPUs.", "body_md": "The capabilities in a large language model emerge, mysteriously, from the training data. Everyone agrees, roughly, that you start with a big pile of data, add some compute, and at the end you can vibe code. Opinions differ on what that pile of data should look like.\n\nMicrosoft AI recently released an incredibly in-depth [technical report](https://microsoft.ai/wp-content/uploads/2026/06/main_20260602_2.pdf) about the development of their first model, MAI-Thinking-1. Shortly after, Nvidia released their latest open model, Nemotron 3 Ultra, accompanied by another detailed [deep dive](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf). The two approach data from somewhat contrasting directions.\n\nNemotron is maximally distillation-pilled. Almost every corpus in their post-training stack comes from someone else’s model: math and science from DeepSeek V4 Pro, code and kernels from GPT-OSS and DeepSeek R1, chat from GLM-5, terminal traces from DeepSeek V3.2, SWE from MiniMax and Qwen3-Coder. The general reasoning teacher is trained to match DeepSeek V4 on a mixture that DeepSeek V4 generated. Even the pre-training data is 22% synthetic web crawl, plus synthetic QA, legal and fact-seeking sets 1That, to their credit, they\n\n[released](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Legal-v1)..\n\nThis approach to data is what you do when the capabilities are the feature, not the product. Nvidia is a GPU company. It wants the behaviors and intelligence to be as widely available as possible. Now you can use them in the original models, or, use them in an open, American-made package, which runs beautifully on Blackwell. The model is a *vehicle* for inference, and as a vehicle it is excellent: strong, remarkably open, and incredibly well-tuned.\n\nAt the other end of the scale is Microsoft AI, who are working from rather different principles. They want capabilities that are learned, and can be predicted. They want inputs they can control, and can carefully ladder up.\n\nThis does not mean they disavow synthetic data: MAI self-distill, generate synthetic SWE envs and tool-calling, and create synthetic instruction-following rubrics and guidelines. There is plenty of model-generated data in the mix, especially in post-training where they train a bunch of specialists then distill them into the final model 2Nemotron does a similar thing with their MOPD (multi-teacher on-policy distillation) technique. DeepSeek v4 was the first place I recall seeing this idea of train a bunch of specialists and then distill into the final model, but it seems to be another of the emerging best practices.. What they largely avoid is data from\n\n*third-party models*, particularly in pre and mid training\n\nEven post-training really only uses external models, mainly GPT 5, for grading..\n\n[3](javascript:void(0))Their goal isn’t just to get intelligence out into the world, it’s to build frontier capabilities themselves, and to sell them to enterprises. For that, you need a reproducible ladder you can actually climb (the paper refers to their whole process as a ‘hill climbing machine’). They spend a lot of energy on provenance: the corpus is (human-generated) publicly available and licensed data, and they specifically strip out AI-generated and other questionable material. They put an intense amount of effort into fitting scaling laws to a ladder of small models, judging every change by how much more baseline compute you’d need to reach the same quality, and whether those gains persists as you scale.\n\nThey have to *prove* their models are getting better, entirely on their own terms. They trust the model because they tested exhaustively, and because they verified their tests hold with scale.\n\nNemotron has the opposite problem, and the opposite solution. They can see how their model is doing against the suite of models they are leveraging. Their risk is not hygiene, it is overfitting to their sources, so they spend effort on ensuring generalization. Evals like PinchBench (an OpenClaw based eval, naturally) and ProfBench are held back as gates: evaluated only after the final model and never used in development. Tasks are trained under some harnesses and then checked on ones the model hasn’t seen before. They trust the model because it clears bars it was only introduced to at test time.\n\nIf your data is clean, and you can see all of it, you predict the model before you train it and confirm your forecast. If the distribution is one you can’t deeply inspect, you instead start trying to break the thing in novel ways.\n\nBoth seem to work! There are surprisingly few apples-to-apples benchmarks between the papers, but LiveCodeBench has them in similar territory 89.0 for Nemotron, 87.7 for MAI. Researcher decisions might shape the language model, but they are also shaped by the business model.\n\n- 1That, to their credit, they\n[released](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Legal-v1). - 2Nemotron does a similar thing with their MOPD (multi-teacher on-policy distillation) technique. DeepSeek v4 was the first place I recall seeing this idea of train a bunch of specialists and then distill into the final model, but it seems to be another of the emerging best practices.\n- 3Even post-training really only uses external models, mainly GPT 5, for grading.", "url": "https://wpnews.pro/news/somehow-more-on-distillation", "canonical_source": "https://ianbarber.blog/2026/06/05/somehow-more-on-distillation/", "published_at": "2026-06-05 14:45:57+00:00", "updated_at": "2026-06-05 14:54:25.102576+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-research", "ai-infrastructure"], "entities": ["Microsoft AI", "MAI-Thinking-1", "Nvidia", "Nemotron 3 Ultra", "DeepSeek V4 Pro", "GPT-OSS", "DeepSeek R1", "GLM-5"], "alternates": {"html": "https://wpnews.pro/news/somehow-more-on-distillation", "markdown": "https://wpnews.pro/news/somehow-more-on-distillation.md", "text": "https://wpnews.pro/news/somehow-more-on-distillation.txt", "jsonld": "https://wpnews.pro/news/somehow-more-on-distillation.jsonld"}}