{"slug": "are-online-skill-and-memory-modules-always-worth-their-tokens-a-budget-study-of", "title": "Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents", "summary": "A new study finds that online web agents augmented with memory, workflow, or skill modules often fail to outperform a token-matched vanilla baseline under a fixed inference budget. Testing across three models and multiple domains, researchers show that the apparent gains of augmentation methods vanish when controlling for token costs, and they recommend reporting run-to-run variance as a core evaluation criterion.", "body_md": "arXiv:2606.15017v1 Announce Type: new\nAbstract: Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. Across three WebArena domains and three models, Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B, the vanilla baseline matches or surpasses all three augmentation methods in aggregate success rate while often using fewer total tokens. We observe a similar trend on WorkArena-L1 with Qwen 3.6-27B, indicating that the effect extends to enterprise knowledge-work tasks. Our results suggest that skills and workflow memory can be useful in specific domains, but their apparent gains often vanish against a budget-matched actor. We further show that run-to-run variance materially affects outcomes and should be reported as a core evaluation criterion for online web agents.", "url": "https://wpnews.pro/news/are-online-skill-and-memory-modules-always-worth-their-tokens-a-budget-study-of", "canonical_source": "https://arxiv.org/abs/2606.15017", "published_at": "2026-06-16 04:00:00+00:00", "updated_at": "2026-06-16 04:22:59.272890+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-research"], "entities": ["Gemini 3 Flash", "GPT-5.4-mini", "Qwen 3.6-27B", "WebArena", "WorkArena-L1", "AWM", "ASI", "ReasoningBank"], "alternates": {"html": "https://wpnews.pro/news/are-online-skill-and-memory-modules-always-worth-their-tokens-a-budget-study-of", "markdown": "https://wpnews.pro/news/are-online-skill-and-memory-modules-always-worth-their-tokens-a-budget-study-of.md", "text": "https://wpnews.pro/news/are-online-skill-and-memory-modules-always-worth-their-tokens-a-budget-study-of.txt", "jsonld": "https://wpnews.pro/news/are-online-skill-and-memory-modules-always-worth-their-tokens-a-budget-study-of.jsonld"}}