{"slug": "embodied3dbench-benchmarking-low-level-embodied-spatial-intelligence-of-vision", "title": "Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models", "summary": "Researchers have introduced Embodied3DBench, a benchmark designed to evaluate low-level spatial intelligence in Vision Language Models (VLMs) within embodied 3D environments. The benchmark includes over 21,000 question-answer pairs across six task categories, revealing that current models excel at high-level spatial reasoning but struggle with interaction-oriented perception like affordance and grasp point prediction. To address this gap, the team synthesized a 1.3 million QA-pair training dataset, which significantly improved model performance after fine-tuning.", "body_md": "arXiv:2605.29074v1 Announce Type: new\nAbstract: Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D environments. To systematically evaluate these foundational perceptual capabilities, the benchmark includes 6 task categories divided into two core groups: Spatial Structural Understanding (Grounding, Spatial Relation Prediction, and Multi-view Correspondence) and Interaction-Oriented Perception (Affordance Prediction, Grasp Point Prediction, and Trajectory Prediction). The benchmark spans 12 subcategories and contains over 21k high-quality question-answer pairs. We evaluate 13 state-of-the-art models, and the results show that while current models exhibit relatively strong high-level spatial reasoning, such as understanding object-to-object positional relations, they remain fragile in interaction-oriented perception, highlighting a significant lack of robust 3D-aware interaction priors. To actively bridge this capability gap revealed by our benchmark, we further synthesize a large-scale training dataset comprising 1.3M QA pairs. Notably, fine-tuning on this dataset yields significant improvements in low-level spatial intelligence. Ultimately, Embodied3DBench fills a critical gap by providing both a systematic evaluation framework and a scalable data solution, setting a clear target for the development of interaction-aware multimodal systems.", "url": "https://wpnews.pro/news/embodied3dbench-benchmarking-low-level-embodied-spatial-intelligence-of-vision", "canonical_source": "https://arxiv.org/abs/2605.29074", "published_at": "2026-05-29 04:00:00+00:00", "updated_at": "2026-05-29 04:15:49.657051+00:00", "lang": "en", "topics": ["computer-vision", "robotics", "artificial-intelligence", "large-language-models", "machine-learning"], "entities": ["Embodied3DBench", "Vision Language Models", "VLMs"], "alternates": {"html": "https://wpnews.pro/news/embodied3dbench-benchmarking-low-level-embodied-spatial-intelligence-of-vision", "markdown": "https://wpnews.pro/news/embodied3dbench-benchmarking-low-level-embodied-spatial-intelligence-of-vision.md", "text": "https://wpnews.pro/news/embodied3dbench-benchmarking-low-level-embodied-spatial-intelligence-of-vision.txt", "jsonld": "https://wpnews.pro/news/embodied3dbench-benchmarking-low-level-embodied-spatial-intelligence-of-vision.jsonld"}}