{"slug": "large-language-model-teaches-visual-students-cross-modality-transfer-of-fine", "title": "Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge", "summary": "Researchers propose LaViD, a framework that transfers semantic knowledge from a large language model to a vision-only student model using multiple-choice questions, outperforming existing distillation methods on fine-grained benchmarks and improving robustness to spurious correlations.", "body_md": "arXiv:2606.27527v1 Announce Type: new\nAbstract: Large Language Models (LLMs) possess broad conceptual knowledge acquired through large-scale text pretraining, yet their potential to supervise models in other modalities remains underexplored. In this work, we propose LaViD--Language-to-Visual Knowledge Distillation--a simple and effective framework for transferring high-level semantic knowledge from a language-only teacher to a vision-only student model. Instead of relying on paired multimodal data, LaViD elicits conceptual signals from an LLM by prompting it to generate multiple-choice questions (MCQs) that probe semantic distinctions between visual classes. Each class is mapped to a soft label distribution over these MCQs, forming a rich conceptual signature that guides the student through an auxiliary distillation loss. Notably, despite using a language-only teacher without access to image data, LaViD consistently outperforms recent methods like MaKD that distill from vision-language models across multiple fine-grained benchmarks. It also achieves competitive or superior performance compared to state-of-the-art visual distillation methods such as DKD and MLKD, with further gains when combined with logit standardization. On the Waterbirds dataset, LaViD substantially improves worst-group accuracy, demonstrating enhanced robustness to spurious correlations with distillation. Code is available at https://github.com/lliangthomas/lavid.", "url": "https://wpnews.pro/news/large-language-model-teaches-visual-students-cross-modality-transfer-of-fine", "canonical_source": "https://arxiv.org/abs/2606.27527", "published_at": "2026-06-29 04:00:00+00:00", "updated_at": "2026-06-29 04:02:56.977283+00:00", "lang": "en", "topics": ["large-language-models", "computer-vision", "machine-learning", "ai-research"], "entities": ["LaViD", "MaKD", "DKD", "MLKD", "Waterbirds"], "alternates": {"html": "https://wpnews.pro/news/large-language-model-teaches-visual-students-cross-modality-transfer-of-fine", "markdown": "https://wpnews.pro/news/large-language-model-teaches-visual-students-cross-modality-transfer-of-fine.md", "text": "https://wpnews.pro/news/large-language-model-teaches-visual-students-cross-modality-transfer-of-fine.txt", "jsonld": "https://wpnews.pro/news/large-language-model-teaches-visual-students-cross-modality-transfer-of-fine.jsonld"}}