Hippocratic AI + Modular to power real-time patient conversations. Read More →
Inference Products
Shared Endpoints
Access frontier models via an API
Dedicated Endpoints
Mission critical reliability
Custom models
Your model, peak performance
Deployment Options
Our Cloud
Fully managed, pay by usage
Your Cloud
Modular stack in your VPC
Pricing
Flexible plans for every team
Models
DeepSeek V4 Pro
FLUX.2 Klein 9B
FLUX.2 Klein 9B FP4
Kimi K2.6
MiniMax M2.7
View All
Text to audio
Turn text into natural speech
Image generation
Generate images from text prompts
Code generation
Generate production-ready code
Video generation
Generate video from text + image
Agentic
Deploy AI agents anywhere
Custom Models
Kernel-level model control
Case Studies Proven results from real customers
MAX Framework
GenAI native modeling & serving
Mojo Language
The best GPU & CPU performance
Self-Hosted
MAX+Mojo self-hosted by you
Community
Build the future of AI together
Mojo Agent Skills
Official AI agent skills from Modular
Docs
Deploy GenAI models, our cloud or yours
Model Library
Latest supported open models
Mojo Docs
Write high-performance kernels for CPUs and GPUs
About
Build AI for anyone, anywhere.
Careers
👋 We’re currently hiring!
Culture
What we believe
Contact Us
Request a demo
June 5, 2026
Aayush Deshpande
Deep Dhillon
Alexandr Nikitin
Michael Dunn-OConnor
Engineering
In Part 2 of this series, we built a data structure that can query which pods have the user’s KVCache blocks in microseconds for every request. This post goes beyond the cluster and cache state to explain how Modular Cloud generates routing decisions and dispatches them across pods.
Most routing stacks ship with a fixed set of algorithms: round-robin, least-requests, consistent hashing, etc. These are generally independent implementations rather than composable components. As a result, when a customer asks for "consistent hashing with a concurrency cap" or "cache-aware with session stickiness," it requires adding a new algorithm from scratch. Disaggregated prefill/decode increases this proliferation. Every variant traditionally has its own HTTP handler, discovery logic, proxy code, and session management. That requires hundreds of lines of additional plumbing per variant.
Instead, Modular Cloud’s routing layer is built from a small number of stages where behaviors are expressed as composable plugins. New requirements can be satisfied with plugins, and new execution patterns are created by composing those plugins.
Every routing decision in Modular Cloud goes through the same five stages, in the same order. Prepare → Filter → Score → Pick → Execute.
Prepare
Filter
Score
Pick
Execute
Prepare enriches the routing context with whatever the downstream stages need: tokenizing the prompt, hashing it into blocks, extracting a session key from a header, and computing a hash key for consistent hashing. It runs once per request, before any candidate evaluation.
Filter removes candidates that can't serve the request due to health checks, hardware-role matching (prefill requests must not go to decode-only pods), and concurrency. It then outputs a smaller candidate list.
Score assigns a quality score to each remaining candidate based on factors like cache affinity, load, and node locality. A routing profile can include multiple scorers, each evaluating candidates on different factors.
Pick selects one or more candidates from the scored list using one of several strategies: MaxScorePicker (deterministic, highest score wins), RoundRobinPicker (stateful cycle), or SessionPicker (sticky lookup). Multiple scores can be composed with explicit weights to produce the final per-candidate score. Pick then outputs a RoutingPlan, which is a list of tuples representing the role, candidate, failure policy, and more.
MaxScorePicker
RoundRobinPicker
SessionPicker
RoutingPlan
Execute dispatches the RoutingPlan. For single-dispatch routing, this is one HTTP proxy call. For multi-step plans like disaggregated prefill/decode, it runs a sequenced flow: call the prefill pod, wait, call the decode pod, then stream the response.
Much of our inspiration comes from Endpoint Picker (EPP), the routing component of the Gateway API Inference Extension (GAIE). EPP effectively returns a single endpoint, whereas we wanted the routing layer to support execution of multiple endpoints. To enable that flexibility, we expanded our profiles with additional stages like Prepare and Execute.
Unlike EPP, Prepare is a distinct stage from scoring. Tokenization and block hashing are expensive, while extracting a session key from a header is cheap. Mixing these into scoring forces you to redo expensive work for each scorer and complicates the dependency graph for scorers that share derived inputs. Separating Prepare gives the framework a single place to run expensive transforms and cache their results for the rest of the pipeline.
Additionally, Execute is a first-class stage, not an implicit step. EPP's pipeline ends at Pick, and whatever executes the result lives outside the pipeline. That works for single-dispatch routing, where "execute" is a single HTTP proxy. It breaks down when execution has its own structure, as it does for disaggregated prefill/decode, where the plan is a sequence rather than a single step. By putting Execute in the pipeline, the same framework can handle both simple and complex dispatch with the same abstraction.
These five stages are composable because of their defined interfaces and clear separation of concerns. A scorer evaluates each candidate on one criterion and produces a per-candidate score. The framework combines multiple scorers with explicit weights, and a picker selects the winner. Different routing strategies are created by combining these reusable plugins, making it easier to implement new patterns.
RoundRobinScorer prioritizes the next endpoint in a rotation. When combined with LeastLoadScorer, that priority is weighted toward emptier queues.
RoundRobinScorer
LeastLoadScorer
Consistent hashing has a preparer to derive a hash key from the request and a picker to select a candidate matching that key. Potential candidates can be scored by load, cache affinity, locality, or any weighted combination.
Cache-aware routing utilizes two preparers and two scorers. The preparers tokenize and hash the prompt up front. CacheAffinityScorer rewards pods that already hold the blocks, and LeastLoadScorer prevents hot-spotting. The weights control that tradeoff between reuse and capacity.
CacheAffinityScorer
Similarly, combining cache-aware routing with sticky sessions can be achieved by combining existing preparers and scorers. Once you have five stages with clean interfaces, familiar routing behaviors decompose into stage-level plugins. And the plugins compose into new execution patterns.
Plugins need to talk to each other. TokenizePreparer produces a token array that BlockHashPreparer consumes, which is later consumed by CacheAffinityScorer.
TokenizePreparer
BlockHashPreparer
How do they communicate without being coupled? The framework provides typed slots on the RoutingContext. A slot is a typed key with a name and a compile-time type. TokenizePreparer writes into the Tokens slot. BlockHashPreparer reads Tokens and writes BlockHashes. CacheAffinityScorer reads BlockHashes. None of the three plugins needs to reference each other to complete these read/writes.
RoutingContext
Tokens
BlockHashes
There are two benefits of this indirection:
Compare this to EPP's approach. EPP shares per-request plugin state through a structure called CycleState, which is backed by a key-value store keyed by opaque strings with interface{}-typed values. Type-safe generic accessors were added on top, which catch type mismatches at the call site. However, the storage is still string-keyed at heart. Two plugins can independently choose the same key and conflict, and nothing in the system notices until something goes wrong in production. The Gateway API Inference Extension (GAIE) community is aware of this and is actively discussing improvements.
CycleState
We can close that gap with typed slots resolved at build time.
When the framework composes a routing profile, it can check a set of invariants statically: every slot reader has a writer, every declared dependency appears in the composition, the dependency graph has no cycles, and no two plugins are mis-ordered or declare conflicting dependencies.
If any check fails, the service doesn't start. The operator sees a structured error at deployment time, not during a failed read in production. You can build new profiles with confidence that errors surface immediately. Selector
Workflow
Executor
So far, we've described a single-dispatch pipeline: one pass through Prepare → Filter → Score → Pick → Execute, one pod selected, one HTTP call. Most routing looks like this, but disaggregated prefill/decode doesn't. Disaggregation means one client request is served by two pods. Prefill builds the KV cache, and decode generates tokens. Routing has to pick both, and the second pick depends on the first.
The framework handles this with a three-layer abstraction on top of the five-stage pipeline.
Selector: A Selector is one pass through Prepare → Filter → Score → Pick. It takes a set of candidates and returns a chosen candidate (or a ranked list).
Workflow: A Workflow composes one or more Selectors and produces the full RoutingPlan. A single-dispatch Workflow uses one Selector. A prefill/decode Workflow uses two. One for the prefill pod, one for the decode pod. The decode Selector's decisions can depend on what the prefill Selector chose, because they share a RoutingContext.
Executor: An Executor takes the RoutingPlan and dispatches it. A single-dispatch Executor does one HTTP proxy. A sequential Executor does the prefill-wait-then-decode pattern, with hooks for request mutation, body injection, and response streaming at each step.
The composition scales linearly. Single-dispatch uses one Selector, and disaggregated prefill/decode uses two. A hypothetical three-step workflow (route through a preprocessor, then prefill, then decode) would use three. The Workflow abstraction handles the composition of any number of required Selectors. Adding a new execution pattern requires a different Workflow and a different Executor, not a new HTTP handler.
The Selector, Workflow, and Executor split is most valuable in disaggregated prefill/decode, where a single request is served by two pods, requiring additional coordination. Decode can't start until prefill has built the KV cache. Selections for both pods can happen concurrently (prefill and decode selection are independent decisions), but execution is sequential:
A disaggregated Workflow uses two Selectors sharing one RoutingContext. Sharing one context lets the second decision build on the first. The prefill Selector records the pod it picks, and the decode Selector reads that value so it can prefer a decode pod close to the prefill pod, minimizing how far the cache must travel. The sequential Executor then carries out the steps above and streams the response back to the client.
Each Selector runs the same Prepare, Filter, Score, and Pick stages as single-dispatch routing, using the same plugins and scoring.
Whether disaggregation helps depends on the workload. The Hao AI Lab at UC San Diego published an 18-month retrospective on where disaggregation helps and where it doesn't. The short version: disaggregation wins on long-context, cache-cold workloads and loses on short-context, high-hit-rate ones.
When disaggregation does make sense, the framework uses the same plugins, the same RoutingContext, and the same scoring logic as single-dispatch routing. No parallel system is required.
Adding a new routing behavior to this framework takes four steps.
Say we want a geographic proximity scorer that prefers pods in the same region as the client. We would have to implement the Scorer interface with a Score(ctx, candidates) []float64 method, register the plugin type with the framework's plugin registry, add the scorer to whichever profiles want it, and weight it against other scorers. Profile authors decide the weight, and the framework validates the composition at startup.
Scorer
Score(ctx, candidates) []float64 Over these three posts, we demonstrated how Modular Cloud’s routing layer allows us to rapidly implement new routing optimizations. The data layer tracks which KV cache blocks live on which pods at microsecond query latency. The decision layer expresses routing behaviors as composable plugins rather than hard-coded algorithms. The execution layer coordinates multi-step flows when a single request touches multiple pods, as with disaggregated prefill/decode.
We plan to follow this series with a deep dive into inference scheduling. Routing decides which pod handles a request. Scheduling decides which requests run next, in what batch, against what cache state. This blog series will explore how we approach scheduling as a single system, allowing us to perform holistic optimizations for large-scale inference.
The routing layer described in this series is running in production on Modular Cloud today. If you're serving text, image, or video models and want to test our vertically integrated stack against your workloads, request access to Modular Cloud.
Three trends from MLSys 2026
May 29, 2026
Why LLM Inference Needs a New Kind of Router - Part 2
May 21, 2026
Why LLM Inference Needs a New Kind of Router - Part 1
May 8, 2026
Build the future of AI with Modular
Sign up today
Signup to our Cloud Platform today to get started easily.
Browse open models
Browse our model catalog, or deploy your own custom model
Get all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime.
⚠️ This form requires JavaScript to function. Please enable JavaScript in your browser to continue.
Thanks for signing up to our newsletter! 🚀
Thank you,
Modular Sales Team