{"slug": "progressive-distillation", "title": "Progressive Distillation", "summary": "Knowledge distillation can transfer performance from larger AI models to smaller ones, enabling local deployment with lower computational costs. Progressive distillation incrementally transfers knowledge through a series of teacher models into a compact student model, as demonstrated by building a 250K parameter classifier from an 11M parameter model.", "body_md": "Now that almost everyone has thought about or is actively integrating AI workflows into their projects, some might ask is this all worth the cost? Many think the current economics of the AI space don't scale and that there will be upward price movement. Others still might not be comfortable with sending their data to remote services for processing. Then there is the crowd that wants to deploy models in small spaces with limited compute.\n\nAre there ways we can deploy small models locally and run at a lower cost? Yes with [Knowledge Distillation](https://en.wikipedia.org/wiki/Knowledge_distillation). Knowledge distillation can get a bad rap due to it's questionable use in training some Large Language Models (LLMs). But it's a perfectly valid way to transfer performance from a larger model to a smaller one. Especially when both models are yours and/or open.\n\nThis article will explore progressive distillation which is a technique to incrementally transfer knowledge from a series of larger teacher models into a smaller student.\n\nInstall `txtai`\n\nand all dependencies.\n\n```\npip install txtai[pipeline-train] datasets\n```\n\nThe first step we need to do is setup up the training pipeline. We'll use the Hugging Face Training framework to build a series of models.\n\nThe following code establishes a `train`\n\nmethod, `test`\n\nmethod and loads the classification training data.\n\n``` python\nfrom datasets import load_dataset\nfrom transformers import AutoModelForSequenceClassification, AutoTokenizer\nfrom txtai.pipeline import HFTrainer, Labels\n\ndef train(teacher, student, distillation, **kwargs):\n    trainer = HFTrainer()\n\n    model = AutoModelForSequenceClassification.from_pretrained(student, trust_remote_code=True)\n    tokenizer = AutoTokenizer.from_pretrained(student, trust_remote_code=True)\n\n    return trainer(\n        (model, tokenizer),\n        ds[\"train\"],\n        columns=(\"sentence\", \"label\"),\n        maxlength=maxlength,\n        teacher=teacher,\n        distillation=distillation,\n        **kwargs\n    )\n\ndef test(model):\n    labels = Labels(model, dynamic=False, trust_remote_code=True)\n\n    # Determine accuracy on test set\n    results = [row[\"label\"] == labels(row[\"sentence\"], max_length=maxlength)[0][0] for row in ds[\"validation\"]]\n    print(sum(results) / len(ds[\"validation\"]))\n\n# Hugging Face dataset\nds = load_dataset(\"nyu-mll/glue\", \"sst2\")\nmaxlength = 128\n```\n\nWe're going to build a [bert-hash-femto](https://huggingface.co/NeuML/bert-hash-femto) classifier which is a extremely small 250K parameter model. This model was pretrained using the same recipe as [BERT](https://arxiv.org/abs/1810.04805). The paper [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962) established that small models perform better with Knowledge Distillation tasks when they are pretrained. [This article](https://huggingface.co/blog/NeuML/bert-hash-embeddings) is also a good source for more information on the topic.\n\nThe next series of steps will do the following:\n\nThis path goes from a 11M parameter model -> 4.4M parameter model -> 250K parameter model\n\n```\ntest(train(\n    \"assemblyai/bert-large-uncased-sst2\",\n    \"google/bert_uncased_L-4_H-256_A-4\",\n    (1.0, 1.0),\n    learning_rate=1e-4,\n    num_train_epochs=5,\n    per_device_train_batch_size=32,\n    output_dir=\"bert-mini-sst2\"\n))\n[transformers] [1mBertForSequenceClassification LOAD REPORT[0m from: google/bert_uncased_L-4_H-256_A-4\nKey                                        | Status     | \n-------------------------------------------+------------+-\ncls.predictions.transform.dense.weight     | UNEXPECTED | \ncls.predictions.decoder.bias               | UNEXPECTED | \ncls.predictions.transform.LayerNorm.bias   | UNEXPECTED | \ncls.predictions.decoder.weight             | UNEXPECTED | \ncls.seq_relationship.bias                  | UNEXPECTED | \ncls.seq_relationship.weight                | UNEXPECTED | \ncls.predictions.transform.dense.bias       | UNEXPECTED | \ncls.predictions.bias                       | UNEXPECTED | \ncls.predictions.transform.LayerNorm.weight | UNEXPECTED | \nclassifier.weight                          | MISSING    | \nclassifier.bias                            | MISSING    | \n\nNotes:\n- UNEXPECTED:   can be ignored when loading from different task/architecture; not ok if you expect identical arch.\n- MISSING:  those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.\n\n0.875\ntest(train(\n    \"bert-mini-sst2\",\n    \"google/bert_uncased_L-2_H-128_A-2\",\n    (1.0, 1.0),\n    learning_rate=1e-4,\n    num_train_epochs=5,\n    per_device_train_batch_size=32,\n    output_dir=\"bert-tiny-sst2\"\n))\n[transformers] [1mBertForSequenceClassification LOAD REPORT[0m from: google/bert_uncased_L-2_H-128_A-2\nKey                                        | Status     | \n-------------------------------------------+------------+-\ncls.predictions.transform.dense.weight     | UNEXPECTED | \ncls.predictions.transform.LayerNorm.bias   | UNEXPECTED | \ncls.seq_relationship.weight                | UNEXPECTED | \ncls.seq_relationship.bias                  | UNEXPECTED | \ncls.predictions.transform.dense.bias       | UNEXPECTED | \ncls.predictions.bias                       | UNEXPECTED | \ncls.predictions.transform.LayerNorm.weight | UNEXPECTED | \nclassifier.weight                          | MISSING    | \nclassifier.bias                            | MISSING    | \n\nNotes:\n- UNEXPECTED:   can be ignored when loading from different task/architecture; not ok if you expect identical arch.\n- MISSING:  those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.\n\n0.8325688073394495\ntest(train(\n    \"bert-tiny-sst2\",\n    \"neuml/bert-hash-femto\",\n    (1.0, 1.0),\n    learning_rate=3e-4,\n    num_train_epochs=5,\n    per_device_train_batch_size=32,\n    output_dir=\"bert-femto-sst2\"\n))\n[transformers] [1mBertHashForSequenceClassification LOAD REPORT[0m from: neuml/bert-hash-femto\nKey                      | Status  | \n-------------------------+---------+-\nbert.pooler.dense.bias   | MISSING | \nbert.pooler.dense.weight | MISSING | \nclassifier.bias          | MISSING | \nclassifier.weight        | MISSING | \n\nNotes:\n- MISSING:  those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.\n\n0.8084862385321101\n```\n\nLet's look at the performance. The 11M parameter model registered an accuracy score of `0.8750`\n\non the SST2 dev set. Then the 4M parameter model scored `0.8326`\n\nand finally the tiny femto model scored `0.8085`\n\n. As we can see each score got progressively worse but each model has less capability. Let's train a femto model directly to compare.\n\n```\ntest(train(\n    None,\n    \"neuml/bert-hash-femto\",\n    None,\n    learning_rate=3e-4,\n    num_train_epochs=5,\n    per_device_train_batch_size=32,\n    output_dir=\"bert-femto-sst2\"\n))\n[transformers] [1mBertHashForSequenceClassification LOAD REPORT[0m from: neuml/bert-hash-femto\nKey                      | Status  | \n-------------------------+---------+-\nbert.pooler.dense.bias   | MISSING | \nbert.pooler.dense.weight | MISSING | \nclassifier.bias          | MISSING | \nclassifier.weight        | MISSING | \n\nNotes:\n- MISSING:  those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.\n\n0.801605504587156\n```\n\nThis scored `0.8016`\n\na sizable shift down from the progressively distilled model. Now keep in mind the femto model only has `250K`\n\nparameters but we gave it a sizable accuracy boost within it's capabilities.\n\nThis example showed how progressive distillation can boost overall model performance, especially for tiny models. Incrementally compressing knowledge into a series of smaller subsets enabled the final model to learn more efficiently than directly training the model on the dataset without distillation. Put progressive distillation in the toolkit when working with tiny models!", "url": "https://wpnews.pro/news/progressive-distillation", "canonical_source": "https://dev.to/neuml/progressive-distillation-341i", "published_at": "2026-05-31 12:37:43+00:00", "updated_at": "2026-05-31 12:42:04.786128+00:00", "lang": "en", "topics": ["machine-learning", "artificial-intelligence", "large-language-models", "ai-tools", "ai-research"], "entities": ["txtai", "Hugging Face"], "alternates": {"html": "https://wpnews.pro/news/progressive-distillation", "markdown": "https://wpnews.pro/news/progressive-distillation.md", "text": "https://wpnews.pro/news/progressive-distillation.txt", "jsonld": "https://wpnews.pro/news/progressive-distillation.jsonld"}}