cd /news/large-language-models/stateful-inference-for-low-latency-m… · home topics large-language-models article
[ARTICLE · art-15232] src=arxiv.org pub= topic=large-language-models verified=true sentiment=↑ positive

Stateful Inference for Low-Latency Multi-Agent Tool Calling

Researchers have developed a stateful inference architecture for multi-agent tool calling that reduces per-turn computational cost from full reprocessing to delta-only updates, achieving 2.1x faster performance on 6-turn workflows and 4.2x on median turns of 35-turn workflows compared to existing frameworks vLLM and SGLang. The system uses a persistent KV cache, radix prefix cache, and prompt-lookup speculative decoding to halve end-to-end wall time by reusing state across turns rather than relying on conventional caching.

read2 min publishedMay 27, 2026
[Submitted on 25 May 2026]


[View PDF](/pdf/2605.26289)

[HTML (experimental)](https://arxiv.org/html/2605.26289v1)

Abstract:Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn.

We present a stateful inference architecture that converts the $O(n_t)$ per-turn cost of conventional serving into an $O(\Delta_t)$ delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully-generated workloads, the reference implementation is $2.1\times$ faster per turn on a 6-turn agentic workflow and $4.2\times$ on the median turn of a 35-turn one, halving end-to-end wall time. The advantage comes from stateful reuse and speculation, not caching.

References & Citations

...

Bibliographic Explorer

(What is the Explorer?) Connected Papers

(What is Connected Papers?) Litmaps

(What is Litmaps?) scite Smart Citations

(What are Smart Citations?)# Code, Data and Media Associated with this Article alphaXiv

(What is alphaXiv?) CatalyzeX Code Finder for Papers

(What is CatalyzeX?) DagsHub

(What is DagsHub?) Gotit.pub

(What is GotitPub?) Hugging Face

(What is Huggingface?) ScienceCast

(What is ScienceCast?)# Demos Influence Flower

(What are Influence Flowers?) CORE Recommender

(What is CORE?) IArxiv Recommender

(What is IArxiv?)# arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/stateful-inference-f…] indexed:0 read:2min 2026-05-27 ·