Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

wpnews.pro

cd /news/computer-vision/beyond-self-attention-sub-quadratic-… · home › topics › computer-vision › article

[ARTICLE · art-28925] src=arxiv.org ↗ pub=2026-06-16T04:00Z topic=computer-vision verified=true sentiment=↑ positive

Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

Researchers proposed a sub-quadratic vision transformer for image captioning that replaces standard self-attention with a Gaussian Mixture Model-based clustering mechanism, reducing computational complexity from O(n^2) to O(nK). The model, evaluated on the Flickr 30K dataset, achieved competitive performance while improving efficiency.

read1 min views1 publishedJun 16, 2026

arXiv:2606.14753v1 Announce Type: new Abstract: Image captioning is a challenging and significant task that aims to generate coherent and semantically meaningful textual descriptions for given images. To accomplish this task, it requires a deep understanding of visual content along with the ability to express that understanding in natural language. Despite remarkable progress with transformer-based architectures, existing approaches often suffer from limitations, such as a lack of rich local feature representations and the high computational cost of quadratic self-attention. The proposed model focuses on improving computational efficiency by restructuring the vision transformer architecture. In designing this approach, the standard self-attention mechanism in Vision Transformers is replaced with a probabilistic transformer approach based on a Gaussian Mixture Model (GMM), a soft-clustering technique. Instead of computing pairwise attention among all image patches, the model groups similar patches into a fixed number of clusters using an Expectation-Maximization (EM) algorithm. This clustering-based mechanism reduces the computational complexity from quadratic O(n^2) to linear O(nK), where K << n. The autoregressive GPT-based decoder is used for caption generation. The model is evaluated on the Flickr 30K dataset, demonstrating competitive and significant improvement over existing works.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/beyond-self-attention-su…

Read original on arxiv.org → arxiv.org/abs/2606.14753

mentioned entities

Flickr 30K

Gaussian Mixture Model

Expectation-Maximization

GPT

metadata

slugbeyond-self-attention-sub-quadratic-vision-transformers-for-fast-image

topic#computer-vision

secondary3 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevBuild Your Own AI Automation wit…

next →Could a diamond wafer as wide as…

── more in #computer-vision 4 stories · sorted by recency

dev.to · 16 Jun · #computer-vision

Better Models Won't Fix AI Companions

dev.to · 16 Jun · #computer-vision

Building a Natural Language Query Interface for Your Database: A Developer's Blueprint

arxiv.org · 16 Jun · #computer-vision

MMLongEmbed: Benchmarking Multimodal Embedding Models in Long-Context Scenarios

arxiv.org · 16 Jun · #computer-vision

Is My Vision-Language Data in Your AI? Membership Inference Test (MINT) Demo 2

── more on @flickr 30k 3 stories trending now

wpnews · 15 Jun · #artificial-intelligence

Facebook now has an AI search engine that pulls answers from your Group posts and Reels

wpnews · 15 Jun · #generative-ai

Pentagon Reports 1.5 Million Daily GenAI.mil Users

wpnews · 15 Jun · #large-language-models

The Grain of Thought

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required