{"slug": "beyond-self-attention-sub-quadratic-vision-transformers-for-fast-image", "title": "Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning", "summary": "Researchers proposed a sub-quadratic vision transformer for image captioning that replaces standard self-attention with a Gaussian Mixture Model-based clustering mechanism, reducing computational complexity from O(n^2) to O(nK). The model, evaluated on the Flickr 30K dataset, achieved competitive performance while improving efficiency.", "body_md": "arXiv:2606.14753v1 Announce Type: new\nAbstract: Image captioning is a challenging and significant task that aims to generate coherent and semantically meaningful textual descriptions for given images. To accomplish this task, it requires a deep understanding of visual content along with the ability to express that understanding in natural language. Despite remarkable progress with transformer-based architectures, existing approaches often suffer from limitations, such as a lack of rich local feature representations and the high computational cost of quadratic self-attention. The proposed model focuses on improving computational efficiency by restructuring the vision transformer architecture. In designing this approach, the standard self-attention mechanism in Vision Transformers is replaced with a probabilistic transformer approach based on a Gaussian Mixture Model (GMM), a soft-clustering technique. Instead of computing pairwise attention among all image patches, the model groups similar patches into a fixed number of clusters using an Expectation-Maximization (EM) algorithm. This clustering-based mechanism reduces the computational complexity from quadratic O(n^2) to linear O(nK), where K << n. The autoregressive GPT-based decoder is used for caption generation. The model is evaluated on the Flickr 30K dataset, demonstrating competitive and significant improvement over existing works.", "url": "https://wpnews.pro/news/beyond-self-attention-sub-quadratic-vision-transformers-for-fast-image", "canonical_source": "https://arxiv.org/abs/2606.14753", "published_at": "2026-06-16 04:00:00+00:00", "updated_at": "2026-06-16 04:19:50.047628+00:00", "lang": "en", "topics": ["computer-vision", "natural-language-processing", "machine-learning", "ai-research"], "entities": ["Flickr 30K", "Gaussian Mixture Model", "Expectation-Maximization", "GPT"], "alternates": {"html": "https://wpnews.pro/news/beyond-self-attention-sub-quadratic-vision-transformers-for-fast-image", "markdown": "https://wpnews.pro/news/beyond-self-attention-sub-quadratic-vision-transformers-for-fast-image.md", "text": "https://wpnews.pro/news/beyond-self-attention-sub-quadratic-vision-transformers-for-fast-image.txt", "jsonld": "https://wpnews.pro/news/beyond-self-attention-sub-quadratic-vision-transformers-for-fast-image.jsonld"}}