{"slug": "benchmarking-convolutional-transformer-hybrid-and-vision-language-models-for", "title": "Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening", "summary": "A new study benchmarked twelve deep learning architectures across four model families—convolutional neural networks, vision transformers, hybrid CNN-transformer backbones, and vision-language models—for multi-disease retinal screening using the RFMiD dataset. Attention-based models, including SwinTiny, CoAtNet0, and MaxViTTiny, achieved the highest performance on binary screening and multi-label classification across 28 disease classes, while vision-language models matched but did not surpass top transformer and hybrid backbones. The findings provide a reproducible reference for model selection in automated retinal screening, with external validation on Messidor-2 showing AUC ranging from 66.8% to 84.7% for referable diabetic retinopathy.", "body_md": "arXiv:2605.26283v1 Announce Type: new\nAbstract: Modern deep learning offers powerful tools for automated retinal screening, but it remains unclear how different visual model families compare in realistic multi-disease settings and under domain shift. In this work, we benchmark twelve architectures across four model families: convolutional neural networks, vision transformers, hybrid CNN-transformer backbones, and vision-language models, using the Retinal Fundus Multi-disease Image Dataset (RFMiD). We evaluate two tasks: binary screening for any retinal disease and multi-label classification across 28 disease classes. Using standardized training, calibration, and evaluation protocols, we report AUC, F1, precision, recall, and sensitivity at a clinically relevant operating point with specificity near 80%. On RFMiD, all architectures perform well on binary screening, with AUC above 84%, but attention-based models perform best. SwinTiny and the hybrid CoAtNet0 and MaxViTTiny models achieve the strongest binary screening results and improve macro and micro F1 in the multi-label setting. Vision-language models, including CLIP ViT-B/16 and SigLIP-Base384, are competitive with CNN baselines but do not surpass the best transformer and hybrid backbones. In external validation on Messidor-2 for referable diabetic retinopathy, AUC ranges from 66.8% to 84.7%, with hybrid and transformer models again showing strong performance. These results provide a reproducible reference for model selection in multi-disease retinal screening and guide future automated screening tools for clinical deployment.", "url": "https://wpnews.pro/news/benchmarking-convolutional-transformer-hybrid-and-vision-language-models-for", "canonical_source": "https://arxiv.org/abs/2605.26283", "published_at": "2026-05-27 04:00:00+00:00", "updated_at": "2026-05-27 04:27:16.965269+00:00", "lang": "en", "topics": ["computer-vision", "machine-learning", "neural-networks", "artificial-intelligence", "ai-research"], "entities": ["RFMiD", "Messidor-2", "SwinTiny", "CoAtNet0", "MaxViTTiny", "CLIP ViT-B/16", "SigLIP-Base384"], "alternates": {"html": "https://wpnews.pro/news/benchmarking-convolutional-transformer-hybrid-and-vision-language-models-for", "markdown": "https://wpnews.pro/news/benchmarking-convolutional-transformer-hybrid-and-vision-language-models-for.md", "text": "https://wpnews.pro/news/benchmarking-convolutional-transformer-hybrid-and-vision-language-models-for.txt", "jsonld": "https://wpnews.pro/news/benchmarking-convolutional-transformer-hybrid-and-vision-language-models-for.jsonld"}}