{"slug": "toolgate-token-efficient-pre-call-control-for-tool-augmented-vision-language", "title": "ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents", "summary": "Researchers introduced ToolGate, a lightweight external controller that predicts whether to execute or skip tool calls proposed by vision-language agents, addressing the pre-call control problem where many tool executions are costly and unnecessary. Across five benchmarks, ToolGate reduced token costs to 64-69% of the unrestricted ReAct baseline while preserving average accuracy, and improved accuracy by 1.65 points with matched-domain training on Qwen3-VL-30B. The findings demonstrate that tool-augmented agents benefit from explicit control over when tool outputs are worth executing, not just from better perceptual tools.", "body_md": "arXiv:2606.03054v1 Announce Type: new\nAbstract: Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful calls occur at similar rates (11.8% vs. 9.9%), while most calls do not change the immediate forced-answer prediction. We introduce ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features. Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1.65 points. These results show that tool-augmented VLM agents benefit not only from better perceptual tools, but also from explicit control over when tool outputs are worth paying for.", "url": "https://wpnews.pro/news/toolgate-token-efficient-pre-call-control-for-tool-augmented-vision-language", "canonical_source": "https://arxiv.org/abs/2606.03054", "published_at": "2026-06-03 04:00:00+00:00", "updated_at": "2026-06-03 04:18:28.823576+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "computer-vision", "ai-agents"], "entities": ["ToolGate", "Qwen3-VL", "ReAct", "VLM", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/toolgate-token-efficient-pre-call-control-for-tool-augmented-vision-language", "markdown": "https://wpnews.pro/news/toolgate-token-efficient-pre-call-control-for-tool-augmented-vision-language.md", "text": "https://wpnews.pro/news/toolgate-token-efficient-pre-call-control-for-tool-augmented-vision-language.txt", "jsonld": "https://wpnews.pro/news/toolgate-token-efficient-pre-call-control-for-tool-augmented-vision-language.jsonld"}}