{"slug": "consistent-yet-wrong-evidence-insensitivity-in-spatial-vision-language-models", "title": "Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models", "summary": "Leading vision-language models (VLMs) produce view-invariant and consistent answers to spatial distance queries even when those answers are incorrect, revealing a weak link between predictions and actual visual evidence. Researchers introduced ViewDiag, a multi-view evaluation protocol across 80 scenes, finding that high prediction stability often coincides with substantial error, challenging the assumption that cross-view consistency indicates true geometric understanding. The findings suggest stable predictions may reflect prior-driven collapse rather than evidence-sensitive reasoning, undermining the reliability of VLMs for robotics and embodied AI.", "body_md": "arXiv:2606.02742v1 Announce Type: new\nAbstract: Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding. We test this assumption and find the opposite: leading VLMs often produce view-invariant and consistent answers even when those answers are incorrect, indicating weak coupling between predictions and viewpoint-specific visual evidence.\nWe introduce \\textbf{ViewDiag}, a controlled multi-view evaluation protocol built from Hypersim, ScanNet, and KITTI360, comprising 176 object-pair tracks across 80 scenes with 2--10 views per track. The protocol evaluates models along three axes: metric accuracy, distributional concentration, and a latent feature probe for internal collapse that distinguishes decision collapse from representation collapse. Across diverse models, we observe a consistent pattern of high prediction stability paired with substantial error, clustering in a regime characterized by strong consistency but low accuracy.\n\\noindent These results challenge the common use of cross-view consistency as a proxy for geometric understanding. Instead, we show that stable predictions may reflect prior-driven collapse rather than evidence-sensitive reasoning. ViewDiag provides a controlled benchmark and diagnostic framework for evaluating spatial VLMs beyond accuracy alone. The code and data can be found \\href{https://github.com/SDivakarBhat/Consistent_Yet_Wrong.git}{here}", "url": "https://wpnews.pro/news/consistent-yet-wrong-evidence-insensitivity-in-spatial-vision-language-models", "canonical_source": "https://arxiv.org/abs/2606.02742", "published_at": "2026-06-03 04:00:00+00:00", "updated_at": "2026-06-03 04:18:44.519001+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "computer-vision", "large-language-models", "robotics"], "entities": ["ViewDiag", "Hypersim", "ScanNet", "KITTI360"], "alternates": {"html": "https://wpnews.pro/news/consistent-yet-wrong-evidence-insensitivity-in-spatial-vision-language-models", "markdown": "https://wpnews.pro/news/consistent-yet-wrong-evidence-insensitivity-in-spatial-vision-language-models.md", "text": "https://wpnews.pro/news/consistent-yet-wrong-evidence-insensitivity-in-spatial-vision-language-models.txt", "jsonld": "https://wpnews.pro/news/consistent-yet-wrong-evidence-insensitivity-in-spatial-vision-language-models.jsonld"}}