UtVAA: Ultra-tiny Vision Transformer with Affix Attention for Mobile Image Classification

Researchers introduced UtVAA, an ultra-tiny Vision Transformer architecture with a novel Affix Attention block for mobile image classification. The smallest variant has 204.67K parameters and 53.95M FLOPs, achieving competitive accuracy on CIFAR-10, CIFAR-100, and tomato disease datasets. This work enables transformer-based models to run on resource-constrained devices without significant performance loss.

arXiv:2606.14735v1 Announce Type: new Abstract: Vision Transformers ViTs have demonstrated strong representation capability in image classification. However, their quadratic self-attention complexity and large parameter counts limit deployment on resource-constrained mobile and edge devices. This paper introduces UtVAA, an ultra-tiny Vision Transformer architecture designed for efficient visual recognition under strict computational budgets. It incorporates a novel Affix Attention block that combines depthwise-pointwise local feature extraction, linear self-attention, coordinate attention for spatial dependency modelling, and a lightweight ternary fusion strategy to integrate local and global representations. In addition, Dilated Bottleneck blocks expand the receptive field using dilated depthwise separable convolutions while maintaining low FLOPs and stable optimisation through residual connections. UtVAA is implemented in scalable Tiny, Medium, and Large variants, with the smallest model containing 204.67K parameters and 53.95M FLOPs. Experimental results on CIFAR-10, CIFAR-100, PlantVillage-Tomato and SLIF-Tomato datasets show that UtVAA achieves competitive accuracy within a sub-million-parameter regime. Overall, the results demonstrate that transformer-based vision models can be redesigned into ultra-tiny architectures without significant loss in discriminative performance, making UtVAA suitable for mobile and edge deployment. Code is available at https://github.com/romiyal/UtVAA