Progressive Distillation

Knowledge distillation can transfer performance from larger AI models to smaller ones, enabling local deployment with lower computational costs. Progressive distillation incrementally transfers knowledge through a series of teacher models into a compact student model, as demonstrated by building a 250K parameter classifier from an 11M parameter model.

Now that almost everyone has thought about or is actively integrating AI workflows into their projects, some might ask is this all worth the cost? Many think the current economics of the AI space don't scale and that there will be upward price movement. Others still might not be comfortable with sending their data to remote services for processing. Then there is the crowd that wants to deploy models in small spaces with limited compute. Are there ways we can deploy small models locally and run at a lower cost? Yes with Knowledge Distillation https://en.wikipedia.org/wiki/Knowledge distillation . Knowledge distillation can get a bad rap due to it's questionable use in training some Large Language Models LLMs . But it's a perfectly valid way to transfer performance from a larger model to a smaller one. Especially when both models are yours and/or open. This article will explore progressive distillation which is a technique to incrementally transfer knowledge from a series of larger teacher models into a smaller student. Install txtai and all dependencies. pip install txtai pipeline-train datasets The first step we need to do is setup up the training pipeline. We'll use the Hugging Face Training framework to build a series of models. The following code establishes a train method, test method and loads the classification training data. python from datasets import load dataset from transformers import AutoModelForSequenceClassification, AutoTokenizer from txtai.pipeline import HFTrainer, Labels def train teacher, student, distillation, kwargs : trainer = HFTrainer model = AutoModelForSequenceClassification.from pretrained student, trust remote code=True tokenizer = AutoTokenizer.from pretrained student, trust remote code=True return trainer model, tokenizer , ds "train" , columns= "sentence", "label" , maxlength=maxlength, teacher=teacher, distillation=distillation, kwargs def test model : labels = Labels model, dynamic=False, trust remote code=True Determine accuracy on test set results = row "label" == labels row "sentence" , max length=maxlength 0 0 for row in ds "validation" print sum results / len ds "validation" Hugging Face dataset ds = load dataset "nyu-mll/glue", "sst2" maxlength = 128 We're going to build a bert-hash-femto https://huggingface.co/NeuML/bert-hash-femto classifier which is a extremely small 250K parameter model. This model was pretrained using the same recipe as BERT https://arxiv.org/abs/1810.04805 . The paper Well-Read Students Learn Better: On the Importance of Pre-training Compact Models https://arxiv.org/abs/1908.08962 established that small models perform better with Knowledge Distillation tasks when they are pretrained. This article https://huggingface.co/blog/NeuML/bert-hash-embeddings is also a good source for more information on the topic. The next series of steps will do the following: This path goes from a 11M parameter model - 4.4M parameter model - 250K parameter model test train "assemblyai/bert-large-uncased-sst2", "google/bert uncased L-4 H-256 A-4", 1.0, 1.0 , learning rate=1e-4, num train epochs=5, per device train batch size=32, output dir="bert-mini-sst2" transformers 1mBertForSequenceClassification LOAD REPORT 0m from: google/bert uncased L-4 H-256 A-4 Key | Status | -------------------------------------------+------------+- cls.predictions.transform.dense.weight | UNEXPECTED | cls.predictions.decoder.bias | UNEXPECTED | cls.predictions.transform.LayerNorm.bias | UNEXPECTED | cls.predictions.decoder.weight | UNEXPECTED | cls.seq relationship.bias | UNEXPECTED | cls.seq relationship.weight | UNEXPECTED | cls.predictions.transform.dense.bias | UNEXPECTED | cls.predictions.bias | UNEXPECTED | cls.predictions.transform.LayerNorm.weight | UNEXPECTED | classifier.weight | MISSING | classifier.bias | MISSING | Notes: - UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch. - MISSING: those params were newly initialized because missing from the checkpoint. Consider training on your downstream task. 0.875 test train "bert-mini-sst2", "google/bert uncased L-2 H-128 A-2", 1.0, 1.0 , learning rate=1e-4, num train epochs=5, per device train batch size=32, output dir="bert-tiny-sst2" transformers 1mBertForSequenceClassification LOAD REPORT 0m from: google/bert uncased L-2 H-128 A-2 Key | Status | -------------------------------------------+------------+- cls.predictions.transform.dense.weight | UNEXPECTED | cls.predictions.transform.LayerNorm.bias | UNEXPECTED | cls.seq relationship.weight | UNEXPECTED | cls.seq relationship.bias | UNEXPECTED | cls.predictions.transform.dense.bias | UNEXPECTED | cls.predictions.bias | UNEXPECTED | cls.predictions.transform.LayerNorm.weight | UNEXPECTED | classifier.weight | MISSING | classifier.bias | MISSING | Notes: - UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch. - MISSING: those params were newly initialized because missing from the checkpoint. Consider training on your downstream task. 0.8325688073394495 test train "bert-tiny-sst2", "neuml/bert-hash-femto", 1.0, 1.0 , learning rate=3e-4, num train epochs=5, per device train batch size=32, output dir="bert-femto-sst2" transformers 1mBertHashForSequenceClassification LOAD REPORT 0m from: neuml/bert-hash-femto Key | Status | -------------------------+---------+- bert.pooler.dense.bias | MISSING | bert.pooler.dense.weight | MISSING | classifier.bias | MISSING | classifier.weight | MISSING | Notes: - MISSING: those params were newly initialized because missing from the checkpoint. Consider training on your downstream task. 0.8084862385321101 Let's look at the performance. The 11M parameter model registered an accuracy score of 0.8750 on the SST2 dev set. Then the 4M parameter model scored 0.8326 and finally the tiny femto model scored 0.8085 . As we can see each score got progressively worse but each model has less capability. Let's train a femto model directly to compare. test train None, "neuml/bert-hash-femto", None, learning rate=3e-4, num train epochs=5, per device train batch size=32, output dir="bert-femto-sst2" transformers 1mBertHashForSequenceClassification LOAD REPORT 0m from: neuml/bert-hash-femto Key | Status | -------------------------+---------+- bert.pooler.dense.bias | MISSING | bert.pooler.dense.weight | MISSING | classifier.bias | MISSING | classifier.weight | MISSING | Notes: - MISSING: those params were newly initialized because missing from the checkpoint. Consider training on your downstream task. 0.801605504587156 This scored 0.8016 a sizable shift down from the progressively distilled model. Now keep in mind the femto model only has 250K parameters but we gave it a sizable accuracy boost within it's capabilities. This example showed how progressive distillation can boost overall model performance, especially for tiny models. Incrementally compressing knowledge into a series of smaller subsets enabled the final model to learn more efficiently than directly training the model on the dataset without distillation. Put progressive distillation in the toolkit when working with tiny models