Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

A study of a production enterprise LLM assistant with 110 agents and 584 tools found that routing accuracy degrades by 16-23 percentage points as the catalog scales, driven by retrieval and confusion gaps. Embedding-based shortlisting recovered 10-11 points of F1 at full scale, validated by human annotation on real traffic.

arXiv:2606.17519v1 Announce Type: new Abstract: Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16--23 percentage points across models. An oracle analysis decomposes the degradation into a \emph{retrieval} gap the model cannot surface the right tool and a \emph{confusion} gap even with perfect retrieval, the oracle ceiling drops 10pp . Embedding-based shortlisting recovers +10--11pp F1 at full scale across all three models and two providers. A production annotation study 1,435 human-labeled utterances, three annotators confirms the recovery on real traffic at +10--17pp despite 10--15pp lower absolute performance.