{"slug": "perceive-interact-reason-building-tool-augmented-visual-agents-for-spatial", "title": "Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning", "summary": "Researchers introduced PERIA, a tool-augmented visual agent designed to improve spatial reasoning in vision-language models by actively acquiring evidence and performing multi-step visual interactions. The agent uses two lightweight tool families for perception and interaction, and was trained with a unified recipe combining supervised tool-use trajectory synthesis and composite rewards. On 13 benchmarks, PERIA-8B outperformed its Qwen3-8B backbone by 10% on in-distribution tasks and achieved performance comparable to much larger models like GPT-5, demonstrating significant gains in spatial reasoning capabilities.", "body_md": "arXiv:2606.12830v1 Announce Type: new\nAbstract: While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.", "url": "https://wpnews.pro/news/perceive-interact-reason-building-tool-augmented-visual-agents-for-spatial", "canonical_source": "https://arxiv.org/abs/2606.12830", "published_at": "2026-06-12 04:00:00+00:00", "updated_at": "2026-06-12 04:50:07.163339+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "computer-vision", "ai-agents", "large-language-models"], "entities": ["PERIA", "Qwen3-8B", "Qwen3-VL-235B-A22B-Thinking", "GPT-5"], "alternates": {"html": "https://wpnews.pro/news/perceive-interact-reason-building-tool-augmented-visual-agents-for-spatial", "markdown": "https://wpnews.pro/news/perceive-interact-reason-building-tool-augmented-visual-agents-for-spatial.md", "text": "https://wpnews.pro/news/perceive-interact-reason-building-tool-augmented-visual-agents-for-spatial.txt", "jsonld": "https://wpnews.pro/news/perceive-interact-reason-building-tool-augmented-visual-agents-for-spatial.jsonld"}}