cd /news/autonomous-vehicles/drivespatial-a-benchmark-for-spatiot… · home topics autonomous-vehicles article
[ARTICLE · art-13598] src=arxiv.org pub= topic=autonomous-vehicles verified=true sentiment=· neutral

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

Researchers have introduced DriveSpatial, a benchmark of 15,600 human-verified question-answer pairs across 20 tasks designed to evaluate spatiotemporal intelligence in vision-language models for autonomous driving. Testing 15 representative models revealed a 28.4-point performance gap behind humans, with cognitive scene construction identified as the primary bottleneck. The findings indicate current vision-language models lack the scene-construction ability required for reliable spatiotemporal reasoning in driving contexts.

read1 min publishedMay 25, 2026

arXiv:2605.23176v1 Announce Type: new Abstract: Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.

── more in #autonomous-vehicles 4 stories · sorted by recency
── more on @drivespatial 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/drivespatial-a-bench…] indexed:0 read:1min 2026-05-25 ·