Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees

Researchers introduced SLARouter, an online routing algorithm for large language model (LLM) applications that learns cost-optimal policies from sparse user feedback while guaranteeing Service Level Agreement (SLA) compliance. The algorithm reduces operating costs by up to 2.2x over existing baselines without requiring per-benchmark tuning, addressing the tension between inference cost and response quality in commercial LLM deployments.

arXiv:2606.19376v1 Announce Type: new Abstract: Inference costs for large language model LLM applications are rapidly growing, driven by surging demand and rising infrastructure cost. Users expect high-quality responses, and in commercial settings this is formally codified in Service Level Agreements SLAs , creating a fundamental tension between cost and quality. Recent progress on cost-aware LLM request routing has shown potential to resolve this tension, but existing approaches rely on complete feedback signals, offline training, extensive per-workload tuning, and most lack SLA guarantees or inference-time adaptivity. We introduce SLARouter, an online routing algorithm that learns a cost-optimal policy from the sparse, one-sided user feedback available in production systems. SLARouter provides theoretical guarantees for both cost optimality and strict SLA compliance. Experiments across a wide range of LLM benchmarks show that SLARouter satisfies SLA constraints without the need for per-benchmark tuning, reducing operating cost by up to 2.2x over existing baselines.