{"slug": "stateful-inference-for-low-latency-multi-agent-tool-calling", "title": "Stateful Inference for Low-Latency Multi-Agent Tool Calling", "summary": "Researchers have developed a stateful inference architecture for multi-agent tool calling that reduces per-turn computational cost from full reprocessing to delta-only updates, achieving 2.1x faster performance on 6-turn workflows and 4.2x on median turns of 35-turn workflows compared to existing frameworks vLLM and SGLang. The system uses a persistent KV cache, radix prefix cache, and prompt-lookup speculative decoding to halve end-to-end wall time by reusing state across turns rather than relying on conventional caching.", "body_md": "# Computer Science > Machine Learning\n\n[Submitted on 25 May 2026]\n\n# Title:Stateful Inference for Low-Latency Multi-Agent Tool Calling\n\n[View PDF](/pdf/2605.26289)\n\n[HTML (experimental)](https://arxiv.org/html/2605.26289v1)\n\nAbstract:Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn.\n\nWe present a stateful inference architecture that converts the $O(n_t)$ per-turn cost of conventional serving into an $O(\\Delta_t)$ delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully-generated workloads, the reference implementation is $2.1\\times$ faster per turn on a 6-turn agentic workflow and $4.2\\times$ on the median turn of a 35-turn one, halving end-to-end wall time. The advantage comes from stateful reuse and speculation, not caching.\n\n### References & Citations\n\nLoading...\n\n# Bibliographic and Citation Tools\n\nBibliographic Explorer\n\n*(*[What is the Explorer?](https://info.arxiv.org/labs/showcase.html#arxiv-bibliographic-explorer))\nConnected Papers\n\n*(*[What is Connected Papers?](https://www.connectedpapers.com/about))\nLitmaps\n\n*(*[What is Litmaps?](https://www.litmaps.co/))\nscite Smart Citations\n\n*(*[What are Smart Citations?](https://www.scite.ai/))# Code, Data and Media Associated with this Article\n\nalphaXiv\n\n*(*[What is alphaXiv?](https://alphaxiv.org/))\nCatalyzeX Code Finder for Papers\n\n*(*[What is CatalyzeX?](https://www.catalyzex.com))\nDagsHub\n\n*(*[What is DagsHub?](https://dagshub.com/))\nGotit.pub\n\n*(*[What is GotitPub?](http://gotit.pub/faq))\nHugging Face\n\n*(*[What is Huggingface?](https://huggingface.co/huggingface))\nScienceCast\n\n*(*[What is ScienceCast?](https://sciencecast.org/welcome))# Demos\n\n# Recommenders and Search Tools\n\nInfluence Flower\n\n*(*[What are Influence Flowers?](https://influencemap.cmlab.dev/))\nCORE Recommender\n\n*(*[What is CORE?](https://core.ac.uk/services/recommender))\nIArxiv Recommender\n\n*(*[What is IArxiv?](https://iarxiv.org/about))# arXivLabs: experimental projects with community collaborators\n\narXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.\n\nBoth individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.\n\nHave an idea for a project that will add value for arXiv's community? [ Learn more about arXivLabs](https://info.arxiv.org/labs/index.html).", "url": "https://wpnews.pro/news/stateful-inference-for-low-latency-multi-agent-tool-calling", "canonical_source": "https://arxiv.org/abs/2605.26289", "published_at": "2026-05-27 12:05:58+00:00", "updated_at": "2026-05-27 12:17:29.146051+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "ai-infrastructure", "ai-agents"], "entities": ["vLLM", "SGLang"], "alternates": {"html": "https://wpnews.pro/news/stateful-inference-for-low-latency-multi-agent-tool-calling", "markdown": "https://wpnews.pro/news/stateful-inference-for-low-latency-multi-agent-tool-calling.md", "text": "https://wpnews.pro/news/stateful-inference-for-low-latency-multi-agent-tool-calling.txt", "jsonld": "https://wpnews.pro/news/stateful-inference-for-low-latency-multi-agent-tool-calling.jsonld"}}