Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

wpnews.pro

cd /news/artificial-intelligence/evoflux-inference-time-evolution-of-… · home › topics › artificial-intelligence › article

[ARTICLE · art-24799] src=arxiv.org pub=2026-06-12T04:00Z topic=artificial-intelligence verified=true sentiment=↑ positive

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Researchers introduced Evoflux, an inference-time evolutionary search method that repairs executable tool workflows for compact language model agents. On MCP-Bench tasks involving 250 live tools, Evoflux improved execution feasibility from roughly 3% to 17-24% across small planners, outperforming supervised fine-tuning and DPO approaches. The method demonstrates that execution-grounded search is more reliable than teacher-trace distillation for enabling small agents to handle tool discovery, parameter validation, and dependency tracking.

read1 min publishedJun 12, 2026

arXiv:2606.12674v1 Announce Type: new Abstract: Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/evoflux-inference-time-e…

Read original on arxiv.org → arxiv.org/abs/2606.12674

mentioned entities

Evoflux

MCP

MCP-Bench

ReAct

metadata

slugevoflux-inference-time-evolution-of-executable-tool-workflows-for-compact-agents

topic#artificial-intelligence

secondary4 topics

sentimentpositive

langen

canonicalarxiv.org

navigation

← prevLinear Coding Sessions

next →Your new car is getting harder a…

── more in #artificial-intelligence 4 stories · sorted by recency

arxiv.org · 12 Jun · #artificial-intelligence

Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

code.visualstudio.com · 17 Jun · #artificial-intelligence

Visual Studio Code 1.125

letsdatascience.com · 12 Jun · #artificial-intelligence

GitHub Improves Copilot CLI Delegation Selectivity

news.ycombinator.com · 12 Jun · #artificial-intelligence

Ask HN: What will be the next big memory management system for AI Agents?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required