{"slug": "can-agents-read-the-room-benchmarking-visual-social-intelligence-in-multimodal", "title": "Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation", "summary": "Researchers introduced AgentViSS, a benchmark for evaluating visual social intelligence in multimodal agents, featuring 240 scenarios and four role-level tasks. Tests on seven multimodal large language models revealed that while agents excel at role-specific expression, they struggle with interaction regulation and visually grounded outcomes.", "body_md": "arXiv:2606.15152v1 Announce Type: new\nAbstract: Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal agents can use visual cues to guide interaction. We introduce \\textsc{\\benchmarkname{}}, a benchmark evaluating visual social intelligence in multimodal social simulation. It contains 240 scenarios, 585 role instances, and 2,340 role-task instances, combining aligned textual-visual evidence, structured role profiles, and four role-level tasks: expression task, characteristic task, interaction regulation task, and interaction outcome task. Evaluating seven recent MLLMs under verbalized-vision and direct-vision reveals a clear gap between local role enactment and interaction management: role-specific expression and conflict handling are near saturation, whereas interaction regulation and visually grounded outcome achievement remain substantially more difficult. The code is released at https://github.com/JunsWan/AgentViSS, and the dataset is available at https://huggingface.co/datasets/JunsWan/AgentViSS.", "url": "https://wpnews.pro/news/can-agents-read-the-room-benchmarking-visual-social-intelligence-in-multimodal", "canonical_source": "https://arxiv.org/abs/2606.15152", "published_at": "2026-06-16 04:00:00+00:00", "updated_at": "2026-06-16 04:23:40.267452+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-research", "computer-vision", "ai-agents"], "entities": ["AgentViSS", "arXiv", "GitHub", "Hugging Face"], "alternates": {"html": "https://wpnews.pro/news/can-agents-read-the-room-benchmarking-visual-social-intelligence-in-multimodal", "markdown": "https://wpnews.pro/news/can-agents-read-the-room-benchmarking-visual-social-intelligence-in-multimodal.md", "text": "https://wpnews.pro/news/can-agents-read-the-room-benchmarking-visual-social-intelligence-in-multimodal.txt", "jsonld": "https://wpnews.pro/news/can-agents-read-the-room-benchmarking-visual-social-intelligence-in-multimodal.jsonld"}}