{"slug": "drivespatial-a-benchmark-for-spatiotemporal-intelligence-in-vlms-for-autonomous", "title": "DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving", "summary": "Researchers have introduced DriveSpatial, a benchmark of 15,600 human-verified question-answer pairs across 20 tasks designed to evaluate spatiotemporal intelligence in vision-language models for autonomous driving. Testing 15 representative models revealed a 28.4-point performance gap behind humans, with cognitive scene construction identified as the primary bottleneck. The findings indicate current vision-language models lack the scene-construction ability required for reliable spatiotemporal reasoning in driving contexts.", "body_md": "arXiv:2605.23176v1 Announce Type: new\nAbstract: Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.", "url": "https://wpnews.pro/news/drivespatial-a-benchmark-for-spatiotemporal-intelligence-in-vlms-for-autonomous", "canonical_source": "https://arxiv.org/abs/2605.23176", "published_at": "2026-05-25 04:00:00+00:00", "updated_at": "2026-05-25 15:21:14.877722+00:00", "lang": "en", "topics": ["autonomous-vehicles", "computer-vision", "large-language-models", "artificial-intelligence", "machine-learning"], "entities": ["DriveSpatial", "VLMs", "AD"], "alternates": {"html": "https://wpnews.pro/news/drivespatial-a-benchmark-for-spatiotemporal-intelligence-in-vlms-for-autonomous", "markdown": "https://wpnews.pro/news/drivespatial-a-benchmark-for-spatiotemporal-intelligence-in-vlms-for-autonomous.md", "text": "https://wpnews.pro/news/drivespatial-a-benchmark-for-spatiotemporal-intelligence-in-vlms-for-autonomous.txt", "jsonld": "https://wpnews.pro/news/drivespatial-a-benchmark-for-spatiotemporal-intelligence-in-vlms-for-autonomous.jsonld"}}