Stateful Inference for Low-Latency Multi-Agent Tool Calling

wpnews.pro

cd /news/large-language-models/stateful-inference-for-low-latency-m… · home › topics › large-language-models › article

[ARTICLE · art-15232] src=arxiv.org ↗ pub=2026-05-27T12:05Z topic=large-language-models verified=true sentiment=↑ positive

Stateful Inference for Low-Latency Multi-Agent Tool Calling

Researchers have developed a stateful inference architecture for multi-agent tool calling that reduces per-turn computational cost from full reprocessing to delta-only updates, achieving 2.1x faster performance on 6-turn workflows and 4.2x on median turns of 35-turn workflows compared to existing frameworks vLLM and SGLang. The system uses a persistent KV cache, radix prefix cache, and prompt-lookup speculative decoding to halve end-to-end wall time by reusing state across turns rather than relying on conventional caching.

read2 min views12 publishedMay 27, 2026

[Submitted on 25 May 2026]


[View PDF](/pdf/2605.26289)

[HTML (experimental)](https://arxiv.org/html/2605.26289v1)

Abstract:Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn.

We present a stateful inference architecture that converts the $O(n_t)$ per-turn cost of conventional serving into an $O(\Delta_t)$ delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully-generated workloads, the reference implementation is $2.1\times$ faster per turn on a 6-turn agentic workflow and $4.2\times$ on the median turn of a 35-turn one, halving end-to-end wall time. The advantage comes from stateful reuse and speculation, not caching.

References & Citations

...

Bibliographic Explorer

(What is the Explorer?) Connected Papers

(What is Connected Papers?) Litmaps

(What is Litmaps?) scite Smart Citations

(What are Smart Citations?)# Code, Data and Media Associated with this Article alphaXiv

(What is alphaXiv?) CatalyzeX Code Finder for Papers

(What is CatalyzeX?) DagsHub

(What is DagsHub?) Gotit.pub

(What is GotitPub?) Hugging Face

(What is Huggingface?) ScienceCast

(What is ScienceCast?)# Demos Influence Flower

(What are Influence Flowers?) CORE Recommender

(What is CORE?) IArxiv Recommender

(What is IArxiv?)# arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/stateful-inference-for-l…

Read original on arxiv.org → arxiv.org/abs/2605.26289

mentioned entities

vLLM

SGLang

metadata

slugstateful-inference-for-low-latency-multi-agent-tool-calling

topic#large-language-models

secondary3 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevWhy Your Resume Keeps Getting Re…

next →Show HN: Unspaghettit – executab…

── more in #large-language-models 4 stories · sorted by recency

aws.amazon.com · 9 Jul · #large-language-models

Enhancing enterprise inference on Amazon SageMaker HyperPod with data capture, Hugging Face, NVMe, and Route 53 integration

dev.to · 11 Jul · #large-language-models

I Traced a Multi-Step LLM Agent With Self-Hosted SigNoz. One Feature Sold Me.

dev.to · 11 Jul · #large-language-models

What an agent pays to read your site

dev.to · 11 Jul · #large-language-models

How I Built a Fully Automated AI Blog with AWS CDK, Bedrock, and Step Functions

── more on @vllm 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required