From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

Researchers present the first systematic survey of unified vision-language perception in Multimodal Large Language Models (MLLMs), tracing the paradigm evolution through a five-stage taxonomy and identifying open challenges toward artificial general intelligence.

arXiv:2606.26196v1 Announce Type: new Abstract: Multimodal Large Language Models MLLMs have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's R-series, which have driven a paradigm shift toward perception-centric intelligence. However, there remains a lack of systematic surveys that examine perception from a truly unified vision-language perspective -- one that treats vision and language as an inseparable modality. Existing reviews are often fragmented, focusing separately on either vision or language, and thus rarely capture the cross-modal evolution of perception as an integrated capability. To bridge this gap, we present the first systematic survey of unified vision-language perception in MLLMs. Specifically, we 1 formalize MLLM perception as an intrinsic, unified vision-language capability analogous to human innate perception, 2 introduce a five-stage taxonomy tracing the paradigm evolution of MLLM perception and survey representative methods and milestones at each phase, and 3 identify open challenges and outline promising research directions toward truly general, unified multimodal intelligence. We hope our study will provide both a foundational understanding and an actionable roadmap to foster further innovation on the path toward artificial general intelligence AGI .