Are We There Yet? Exploring the Capabilities of MLLMs in Assistive AI Applications

Researchers evaluated multimodal large language models (MLLMs) for assistive AI applications, finding they show promise in object recognition and multilingual text reading but have limitations in real-world egocentric tasks. The study used a head-mounted camera system called NetraLink to benchmark state-of-the-art models.

arXiv:2606.25084v1 Announce Type: new Abstract: Multimodal Large Language Models MLLMs have redefined visual understanding by combining vision encoders with large-scale language models. This unified architecture enables strong performance on tasks like image captioning, visual question answering, and multimodal dialogue, often in zero- and few-shot settings. Their general-purpose capabilities and flexible interfaces make MLLMs a promising foundation for real-world vision-language applications. Assistive AI aims to help users interact with their environments through natural language. These scenarios demand robust visual recognition, contextual reasoning, and multilingual comprehension-capabilities that MLLMs are believed to offer. However, their effectiveness in assistive settings remains to be fully understood. In this work, we explore whether MLLMs can support Assistive AI by evaluating state-of-the-art models on real-world tasks: recognizing everyday objects like currency, answering questions based on scene text, and reading visually presented content across multiple languages. To this end, we developed a system, NetraLink, using a head-mounted GoPro to capture real-world egocentric data, and collected a benchmark covering these assistive scenarios. Our findings provide a comprehensive diagnostic of current MLLMs, highlighting their strengths and limitations in enabling assistive technologies grounded in visual perception and language interaction.