{"slug": "modular-why-llm-inference-needs-a-new-kind-of-router-part-1", "title": "Modular: Why LLM Inference Needs a New Kind of Router - Part 1", "summary": "Modular announced that traditional HTTP-era load balancing algorithms like round-robin, consistent hashing, and least-connections are inadequate for large language model inference because GPU pods are stateful, specialized, and heterogeneous rather than interchangeable stateless backends. The company introduced a new routing approach in its Modular Cloud orchestration layer to account for expensive KV caches, prefill versus decode specialization, and multi-request conversations that require backends to maintain context across calls. This is the first installment of a three-part series explaining how inference routing must evolve beyond the \"blind\" assumptions of conventional load balancers.", "body_md": "Hippocratic AI + Modular to power real-time patient conversations. Read More →\n\nInference Products\n\nShared Endpoints\n\nAccess frontier models via an API\n\nDedicated Endpoints\n\nMission critical reliability\n\nCustom models\n\nYour model, peak performance\n\nDeployment Options\n\nOur Cloud\n\nFully managed, pay by usage\n\nYour Cloud\n\nModular stack in your VPC\n\nPricing\n\nFlexible plans for every team\n\nModels\n\nDeepSeek V4 Pro\n\nFLUX.2 Klein 9B\n\nKimi K2.6\n\nMiniMax M2.7\n\nWan 2.2 T2V A14B\n\nView All\n\nText to audio\n\nTurn text into natural speech\n\nImage generation\n\nGenerate images from text prompts\n\nCode generation\n\nGenerate production-ready code\n\nVideo generation\n\nGenerate video from text + image\n\nAgentic\n\nDeploy AI agents anywhere\n\nCustom Models\n\nKernel-level model control\n\nCase Studies\n\nProven results from real customers\n\nMAX Framework\n\nGenAI native modeling & serving\n\nMojo Language\n\nThe best GPU & CPU performance\n\nSelf-Hosted\n\nMAX+Mojo self-hosted by you\n\nCommunity\n\nBuild the future of AI together\n\nMojo Agent Skills\n\nOfficial AI agent skills from Modular\n\nDocs\n\nDeploy GenAI models, our cloud or yours\n\nModel Library\n\nLatest supported open models\n\nMojo Docs\n\nWrite high-performance kernels for CPUs and GPUs\n\nAbout\n\nBuild AI for anyone, anywhere.\n\nCareers\n\n👋 We’re currently hiring!\n\nCulture\n\nWhat we believe\n\nContact Us\n\nRequest a demo\n\nMay 8, 2026\n\nAayush Deshpande\n\nDeep Dhillon\n\nAlexandr Nikitin\n\nMichael Dunn-OConnor\n\nEngineering\n\nHTTP routing has been a solved problem for many years. Round-robin, consistent hashing, least-connections. Pick one, put it in front of a pool of identical servers, and the traffic spreads pretty evenly.\n\nBut then came Large Language Models.\n\nThe backends here aren't interchangeable web servers. They're GPU pods holding large, local KV caches in high-bandwidth, RAM or SSD memory. That state is expensive to rebuild, not uniformly available across the cluster, and often determines whether a request returns quickly or spends seconds recomputing previous work. Some pods might specialize in prefill, others in decode. Conversations typically stretch across requests. A single inference call sometimes needs two backends in sequence. The old assumptions about \"interchangeable backends\" and \"independent requests\" don't support these requirements.\n\nTraditional routing is blind to all of this. It treats every backend as interchangeable, every request as independent, every pod as equally good. GPU pods are none of those things. They’re stateful, specialized and heterogeneous. Inference routing has to account for that.\n\nThis is the first post in a three-part series about what routing has to become to handle inference workloads. Modular Cloud’s orchestration layer is built around this routing problem, and this series explains how it solves it.\n\nHTTP-era load balancing is built on a small menu of algorithms, each tuned to a specific deployment shape. They have different policies, but they share the same precondition: stateless backends.\n\nRound-robin distributes requests uniformly across a pool of identical backends. It assumes every backend serves every request at the same cost. This might look like eight replicas of the same web service behind a load balancer, each getting 12.5% of the traffic. It’s simple, fair, stateless.\n\nConsistent hashing routes each request to a backend determined by hashing some property of the request (a key, url, session identifier), and picking the backend whose position on the hash ring is closest. It’s the routing strategy of choice when you want the same key to land on the same backend, for client-side caching or session affinity. The backend’s “stickiness” is a function of the request key, not of anything the backend is holding in memory.\n\nLeast-requests sends each new request to whichever backend has the fewest active requests, on the assumption that fewer active requests means more spare capacity. It works when every request takes roughly the same amount of work.\n\nThese policies share the same three assumptions:\n\nThose assumptions hold for stateless web services. LLM inference breaks all three.\n\nLLM workloads violate the stateless assumptions in four specific ways. Each one introduces a dimension that traditional routing has no mechanism to handle.\n\nWhen a pod serves an inference request, the forward pass builds a KV cache: the model's intermediate state for every token position, held in GPU memory. Modern engines retain that cache after the response completes, so later requests sharing a prefix can skip the equivalent compute.\n\nThis changes the routing problem drastically. A 100K token prompt landing on a pod with the first 75K tokens already cached can prefill in milliseconds. The same prompt hitting a cold pod takes seconds. Round-robin, blind to cache state, would produce unpredictable time to first token (TTFT) for identical requests.\n\nCache state is the primary driver of prefill latency variance at scale. A router that selects pods based on cache residency eliminates prefill compute proportional to the shared prefix length for every hit. This frees up GPU cycles the cluster would otherwise spend recomputing work it has already done.\n\nLLM Inference has two phases, and they stress hardware differently.\n\nPrefill processes the entire prompt in parallel. It’s compute-bound. This means GPU cores are saturated doing dense matrix multiplications across thousands of tokens at once.\n\nDecode generates tokens one at a time autoregressively, each token depending on every token before it. It’s memory-bandwidth-bound. This means most of the GPU’s time is spent fetching model weights and KV cache from high bandwidth memory (HBM), and most of the compute sits idle.\n\nRunning both phases on the same pod means the hardware is never tuned for either. Prefill needs dense compute; decode needs memory bandwidth. A pod optimized for one underutilizes what the other requires. Disaggregated deployments use pods tuned for each phase separately. A single client request divides work across both.\n\nModern engines use chunked prefill to interleave the two phases on the same pod, blurring the boundary. But the underlying compute-vs-bandwidth distinction still holds, and when you disaggregate at the deployment level, your router has to know which pod can do what.\n\nMost LLM traffic is multi-turn. A user sends a message. The assistant replies. The user sends another message, and that message implicitly contains the entire conversation history as context.\n\nTurn N+1 shares a prefix with turn N: the system prompt, all prior turns, all prior assistant replies. If the KV cache from turn N is still resident on some pod, turn N+1 is effectively free to prefill for the shared portion. If the cache has been evicted, or if turn N+1 lands on a different pod, the shared prefix is recomputed from scratch.\n\nSession affinity in HTTP used to mean “route this user’s requests to the same backend so the application can use in-memory state.” In LLM inference it means the same thing but the in-memory state is the KV cache. Getting it right is the difference between sub-second responses and multi-second responses on every turn after the first.\n\nA single client-facing request may require more than one backend.\n\nIn a disaggregated deployment, the prefill pod builds the KV cache and the decode pod generates tokens. Neither can serve the request alone. The router picks a prefill pod, then a decode pod, then orchestrates the sequence: send the prompt to prefill, wait for completion, send the same prompt plus a cache hint to decode, stream tokens back to the client.\n\nHTTP load balancers don’t do this. They pick one backend per request. Adding multi-step coordination to a single-dispatch router is a different shape of routing entirely.\n\nEach of the four dimensions above imposes requirements on a routing system. Those requirements fall into three distinct architectural concerns, each handled by a separate layer.\n\nThis layer tracks LLM-specific state at the latencies routing decisions require. The question “which of N pods has these blocks cached?” has to be answerable in microseconds, under concurrent updates, resilient to pod churn. A hashmap with a mutex isn’t sufficient.\n\nThis layer expresses routing logic as compositions of small, testable, reusable components. Operators pick a filter, a few scorers, a picker, and assemble a profile. The framework validates the composition at build time, not under traffic at 3am.\n\nThis layer coordinates multi-step request flows on top of the decision layer. Single-dispatch routing is a degenerate case of multi-step: one pod, one step. Disaggregated prefill/decode is the general case: two pods, two steps, with the second decision informed by the first. The same framework handles both without requiring a new HTTP handler per variant.\n\nParts 2 and 3 of this series build these layers.\n\nThis series describes the routing layer inside Modular Cloud's distributed inference framework, and how it handles each of these four problems in production inference workloads.\n\nPrefix-aware routing (tokenization, block-level hashing, cache-aware scoring with load-aware tiebreaking, circuit breakers on upstream latency) ships as a profile configuration, not a new algorithm. When the team needs a new routing behavior, the work is composing plugins into a new profile rather than writing a new routing strategy from scratch. Each new deployment pattern reuses what's already there.\n\nLLMs introduced four dimensions that traditional load balancers have no mechanism to handle: KV cache state that makes backend selection a performance-critical decision, hardware specialization that splits a single request across pod types, conversation continuity that ties sessions to cache residency, and multi-step execution that requires coordinating a sequence of backends rather than picking one.\n\nThis problem has been tackled from multiple angles. NVIDIA Dynamo, llm-d, vLLM production-stack, AIBrix, KServe, and Envoy AI Gateway have each advanced inference routing in different directions: disaggregated prefill/decode, prefix-aware scheduling, KV-aware load balancing, production-grade serving primitives. Modular Cloud builds on that foundation. To support the range of deployment patterns it targets, Modular Cloud makes composable plugins and multi-step execution both first-class primitives, so a new deployment pattern becomes a profile you assemble rather than a strategy you fork.\n\nThat’s the gap Modular Cloud’s routing layer is designed to close: three architectural layers with composition as the extension model rather than forking or wrapping. The rest of this series shows how it’s built.\n\nPart 2: The data layer. The data structure that makes cache-aware routing possible: sharded bitmaps, Fibonacci-scrambled distribution, and binary search over cumulative block hashes that turns a P x N scan into O(K x log N).\n\nPart 3: The decision and execution layers. Turning cache state into routing decisions and then into execution. A five-stage composable pipeline, typed state between plugins, and the Selector/Workflow/Executor split that scales the same framework from round-robin to disaggregated prefill/decode.\n\nThree trends from MLSys 2026\n\nMay 29, 2026\n\nWhy LLM Inference Needs a New Kind of Router - Part 2\n\nMay 21, 2026\n\nTileTensor Part 1 - Safer, More Efficient GPU Kernels\n\nApril 13, 2026\n\nBuild the future of AI with Modular\n\nSign up today\n\nSignup to our Cloud Platform today to get started easily.\n\nBrowse open models\n\nBrowse our model catalog, or deploy your own custom model\n\nGet all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime.\n\n⚠️ This form requires JavaScript to function. Please enable JavaScript in your browser to continue.\n\nThanks for signing up to our newsletter! 🚀\n\nThank you,\n\nModular Sales Team", "url": "https://wpnews.pro/news/modular-why-llm-inference-needs-a-new-kind-of-router-part-1", "canonical_source": "https://www.modular.com/blog/why-llm-inference-needs-a-new-kind-of-router-part-1", "published_at": "2026-05-08 00:00:00+00:00", "updated_at": "2026-05-29 23:58:05.610728+00:00", "lang": "en", "topics": ["ai-infrastructure", "large-language-models", "ai-products", "ai-tools", "generative-ai"], "entities": ["Modular", "Hippocratic AI", "DeepSeek", "FLUX", "Kimi", "MiniMax", "Wan", "Aayush Deshpande"], "alternates": {"html": "https://wpnews.pro/news/modular-why-llm-inference-needs-a-new-kind-of-router-part-1", "markdown": "https://wpnews.pro/news/modular-why-llm-inference-needs-a-new-kind-of-router-part-1.md", "text": "https://wpnews.pro/news/modular-why-llm-inference-needs-a-new-kind-of-router-part-1.txt", "jsonld": "https://wpnews.pro/news/modular-why-llm-inference-needs-a-new-kind-of-router-part-1.jsonld"}}