{"slug": "rt-vla-real-time-vision-language-action-models-via-knowledge-distillation", "title": "RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation", "summary": "Researchers propose RT-VLA, a lightweight vision-language-action model for autonomous driving that uses knowledge distillation from the larger SimLingo model to achieve real-time performance. RT-VLA reduces inference latency by up to 44.8x while maintaining competitive driving and reasoning capabilities, enabling explainable AI in safety-critical driving scenarios.", "body_md": "arXiv:2606.14010v1 Announce Type: new\nAbstract: Vision-Language-Action (VLA) models have shown strong potential for end-to-end autonomous driving by jointly modeling visual perception, language reasoning, explainability and action prediction. However, their large vision-language backbones and reasoning modules introduce substantial inference latency and thereby prevent their deployment in the unforgiving reality of the road networks. We propose RT-VLA, a lightweight, distilled VLA model that transfers the driving and reasoning capabilities of the state-of-the-art SimLingo model into a compact student through multi-level supervised distillation. RT-VLA preserves language-based reasoning and supports post-hoc explanation through offline language analysis of safety-critical driving moments without adding latency to real-time control. Compared to the SimLingo teacher, RT-VLA maintains competitive closed-loop driving and language reasoning performance while reducing inference time by 44.8X in vision-only mode and 7.9X in vision+language mode. These results suggest that supervised distillation is a practical approach for building real-time, explainable VLA-style autonomous driving models.", "url": "https://wpnews.pro/news/rt-vla-real-time-vision-language-action-models-via-knowledge-distillation", "canonical_source": "https://arxiv.org/abs/2606.14010", "published_at": "2026-06-15 04:00:00+00:00", "updated_at": "2026-06-15 04:14:13.452938+00:00", "lang": "en", "topics": ["autonomous-vehicles", "large-language-models", "computer-vision", "ai-research", "ai-safety"], "entities": ["RT-VLA", "SimLingo", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/rt-vla-real-time-vision-language-action-models-via-knowledge-distillation", "markdown": "https://wpnews.pro/news/rt-vla-real-time-vision-language-action-models-via-knowledge-distillation.md", "text": "https://wpnews.pro/news/rt-vla-real-time-vision-language-action-models-via-knowledge-distillation.txt", "jsonld": "https://wpnews.pro/news/rt-vla-real-time-vision-language-action-models-via-knowledge-distillation.jsonld"}}