The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models

Researchers propose TheProfessor, a multi-teacher prompt distillation method for vision-language models that ensembles a domain-finetuned teacher and a zero-shot teacher. On four datasets, confidence-weighted ensembling improves average harmonic mean from 87.52 to 89.28, with the largest gain of +5.78 on domain-shifted EuroSAT.

arXiv:2606.23897v1 Announce Type: new Abstract: Prompt distillation compresses large vision-language models VLMs such as CLIP into lightweight student models by matching teacher predictions on unlabeled domain images. PromptKD CVPR 2024 established this paradigm with a single PromptSRC-finetuned ViT-L/14 teacher and a ViT-B/16 student. We propose TheProfessor, a multi-teacher extension that distills from a fixed two-teacher ensemble: a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher whose logits are pre-computed per dataset. We evaluate single-teacher PromptKD, equal-probability ensembling, and confidence-weighted ensembling on four base-to-novel datasets: Caltech-101, DTD, UCF101, and EuroSAT. In a 12-run single-seed sweep, confidence-weighted ensembling improves average HM from 87.52 to 89.28 +1.77 points , while equal averaging improves average HM to 88.88 +1.37 points . Gains are dataset dependent: they are negligible on Caltech-101 +0.16 HM for confidence weighting , modest on UCF101 +0.62 , and largest on domain-shifted EuroSAT +5.78 . These results update our earlier Caltech-only analysis and show that multi-teacher prompt distillation is most useful when the second teacher contributes complementary supervision under domain shift.