Modular: Why LLM Inference Needs a New Kind of Router - Part 3

wpnews.pro

Hippocratic AI + Modular to power real-time patient conversations. Read More →

Inference Products

Shared Endpoints

Access frontier models via an API

Dedicated Endpoints

Mission critical reliability

Custom models

Your model, peak performance

Deployment Options

Our Cloud

Fully managed, pay by usage

Your Cloud

Modular stack in your VPC

Pricing

Flexible plans for every team

Models

DeepSeek V4 Pro

FLUX.2 Klein 9B

FLUX.2 Klein 9B FP4

Kimi K2.6

MiniMax M2.7

View All

Text to audio

Turn text into natural speech

Image generation

Generate images from text prompts

Code generation

Generate production-ready code

Video generation

Generate video from text + image

Agentic

Deploy AI agents anywhere

Custom Models

Kernel-level model control

Case Studies Proven results from real customers

MAX Framework

GenAI native modeling & serving

Mojo Language

The best GPU & CPU performance

Self-Hosted

MAX+Mojo self-hosted by you

Community

Build the future of AI together

Mojo Agent Skills

Official AI agent skills from Modular

Docs

Deploy GenAI models, our cloud or yours

Model Library

Latest supported open models

Mojo Docs

Write high-performance kernels for CPUs and GPUs

About

Build AI for anyone, anywhere.

Careers

👋 We’re currently hiring!

Culture

What we believe

Contact Us

Request a demo

June 5, 2026

Aayush Deshpande

Deep Dhillon

Alexandr Nikitin

Michael Dunn-OConnor

Engineering

In Part 2 of this series, we built a data structure that can query which pods have the user’s KVCache blocks in microseconds for every request. This post goes beyond the cluster and cache state to explain how Modular Cloud generates routing decisions and dispatches them across pods.

Most routing stacks ship with a fixed set of algorithms: round-robin, least-requests, consistent hashing, etc. These are generally independent implementations rather than composable components. As a result, when a customer asks for "consistent hashing with a concurrency cap" or "cache-aware with session stickiness," it requires adding a new algorithm from scratch. Disaggregated prefill/decode increases this proliferation. Every variant traditionally has its own HTTP handler, discovery logic, proxy code, and session management. That requires hundreds of lines of additional plumbing per variant.

Instead, Modular Cloud’s routing layer is built from a small number of stages where behaviors are expressed as composable plugins. New requirements can be satisfied with plugins, and new execution patterns are created by composing those plugins.

Every routing decision in Modular Cloud goes through the same five stages, in the same order. Prepare → Filter → Score → Pick → Execute.

Prepare

Filter

Score

Pick

Execute

Prepare enriches the routing context with whatever the downstream stages need: tokenizing the prompt, hashing it into blocks, extracting a session key from a header, and computing a hash key for consistent hashing. It runs once per request, before any candidate evaluation.

Filter removes candidates that can't serve the request due to health checks, hardware-role matching (prefill requests must not go to decode-only pods), and concurrency. It then outputs a smaller candidate list.

Score assigns a quality score to each remaining candidate based on factors like cache affinity, load, and node locality. A routing profile can include multiple scorers, each evaluating candidates on different factors.

Pick selects one or more candidates from the scored list using one of several strategies: MaxScorePicker (deterministic, highest score wins), RoundRobinPicker (stateful cycle), or SessionPicker (sticky lookup). Multiple scores can be composed with explicit weights to produce the final per-candidate score. Pick then outputs a RoutingPlan, which is a list of tuples representing the role, candidate, failure policy, and more.

MaxScorePicker

RoundRobinPicker

SessionPicker

RoutingPlan

Execute dispatches the RoutingPlan. For single-dispatch routing, this is one HTTP proxy call. For multi-step plans like disaggregated prefill/decode, it runs a sequenced flow: call the prefill pod, wait, call the decode pod, then stream the response.

Much of our inspiration comes from Endpoint Picker (EPP), the routing component of the Gateway API Inference Extension (GAIE). EPP effectively returns a single endpoint, whereas we wanted the routing layer to support execution of multiple endpoints. To enable that flexibility, we expanded our profiles with additional stages like Prepare and Execute.

Unlike EPP, Prepare is a distinct stage from scoring. Tokenization and block hashing are expensive, while extracting a session key from a header is cheap. Mixing these into scoring forces you to redo expensive work for each scorer and complicates the dependency graph for scorers that share derived inputs. Separating Prepare gives the framework a single place to run expensive transforms and cache their results for the rest of the pipeline.

Additionally, Execute is a first-class stage, not an implicit step. EPP's pipeline ends at Pick, and whatever executes the result lives outside the pipeline. That works for single-dispatch routing, where "execute" is a single HTTP proxy. It breaks down when execution has its own structure, as it does for disaggregated prefill/decode, where the plan is a sequence rather than a single step. By putting Execute in the pipeline, the same framework can handle both simple and complex dispatch with the same abstraction.

These five stages are composable because of their defined interfaces and clear separation of concerns. A scorer evaluates each candidate on one criterion and produces a per-candidate score. The framework combines multiple scorers with explicit weights, and a picker selects the winner. Different routing strategies are created by combining these reusable plugins, making it easier to implement new patterns.

RoundRobinScorer prioritizes the next endpoint in a rotation. When combined with LeastLoadScorer, that priority is weighted toward emptier queues.

RoundRobinScorer

LeastLoadScorer

Consistent hashing has a preparer to derive a hash key from the request and a picker to select a candidate matching that key. Potential candidates can be scored by load, cache affinity, locality, or any weighted combination.

Cache-aware routing utilizes two preparers and two scorers. The preparers tokenize and hash the prompt up front. CacheAffinityScorer rewards pods that already hold the blocks, and LeastLoadScorer prevents hot-spotting. The weights control that tradeoff between reuse and capacity.

CacheAffinityScorer

Similarly, combining cache-aware routing with sticky sessions can be achieved by combining existing preparers and scorers. Once you have five stages with clean interfaces, familiar routing behaviors decompose into stage-level plugins. And the plugins compose into new execution patterns.

Plugins need to talk to each other. TokenizePreparer produces a token array that BlockHashPreparer consumes, which is later consumed by CacheAffinityScorer.

TokenizePreparer

BlockHashPreparer

How do they communicate without being coupled? The framework provides typed slots on the RoutingContext. A slot is a typed key with a name and a compile-time type. TokenizePreparer writes into the Tokens slot. BlockHashPreparer reads Tokens and writes BlockHashes. CacheAffinityScorer reads BlockHashes. None of the three plugins needs to reference each other to complete these read/writes.

RoutingContext

Tokens

BlockHashes

There are two benefits of this indirection:

Compare this to EPP's approach. EPP shares per-request plugin state through a structure called CycleState, which is backed by a key-value store keyed by opaque strings with interface{}-typed values. Type-safe generic accessors were added on top, which catch type mismatches at the call site. However, the storage is still string-keyed at heart. Two plugins can independently choose the same key and conflict, and nothing in the system notices until something goes wrong in production. The Gateway API Inference Extension (GAIE) community is aware of this and is actively discussing improvements.

CycleState

We can close that gap with typed slots resolved at build time.

When the framework composes a routing profile, it can check a set of invariants statically: every slot reader has a writer, every declared dependency appears in the composition, the dependency graph has no cycles, and no two plugins are mis-ordered or declare conflicting dependencies.

If any check fails, the service doesn't start. The operator sees a structured error at deployment time, not during a failed read in production. You can build new profiles with confidence that errors surface immediately. Selector

Workflow

Executor

So far, we've described a single-dispatch pipeline: one pass through Prepare → Filter → Score → Pick → Execute, one pod selected, one HTTP call. Most routing looks like this, but disaggregated prefill/decode doesn't. Disaggregation means one client request is served by two pods. Prefill builds the KV cache, and decode generates tokens. Routing has to pick both, and the second pick depends on the first.

The framework handles this with a three-layer abstraction on top of the five-stage pipeline.

Selector: A Selector is one pass through Prepare → Filter → Score → Pick. It takes a set of candidates and returns a chosen candidate (or a ranked list).

Workflow: A Workflow composes one or more Selectors and produces the full RoutingPlan. A single-dispatch Workflow uses one Selector. A prefill/decode Workflow uses two. One for the prefill pod, one for the decode pod. The decode Selector's decisions can depend on what the prefill Selector chose, because they share a RoutingContext.

Executor: An Executor takes the RoutingPlan and dispatches it. A single-dispatch Executor does one HTTP proxy. A sequential Executor does the prefill-wait-then-decode pattern, with hooks for request mutation, body injection, and response streaming at each step.

The composition scales linearly. Single-dispatch uses one Selector, and disaggregated prefill/decode uses two. A hypothetical three-step workflow (route through a preprocessor, then prefill, then decode) would use three. The Workflow abstraction handles the composition of any number of required Selectors. Adding a new execution pattern requires a different Workflow and a different Executor, not a new HTTP handler.

The Selector, Workflow, and Executor split is most valuable in disaggregated prefill/decode, where a single request is served by two pods, requiring additional coordination. Decode can't start until prefill has built the KV cache. Selections for both pods can happen concurrently (prefill and decode selection are independent decisions), but execution is sequential:

A disaggregated Workflow uses two Selectors sharing one RoutingContext. Sharing one context lets the second decision build on the first. The prefill Selector records the pod it picks, and the decode Selector reads that value so it can prefer a decode pod close to the prefill pod, minimizing how far the cache must travel. The sequential Executor then carries out the steps above and streams the response back to the client.

Each Selector runs the same Prepare, Filter, Score, and Pick stages as single-dispatch routing, using the same plugins and scoring.

Whether disaggregation helps depends on the workload. The Hao AI Lab at UC San Diego published an 18-month retrospective on where disaggregation helps and where it doesn't. The short version: disaggregation wins on long-context, cache-cold workloads and loses on short-context, high-hit-rate ones.

When disaggregation does make sense, the framework uses the same plugins, the same RoutingContext, and the same scoring logic as single-dispatch routing. No parallel system is required.

Adding a new routing behavior to this framework takes four steps.

Say we want a geographic proximity scorer that prefers pods in the same region as the client. We would have to implement the Scorer interface with a Score(ctx, candidates) []float64 method, register the plugin type with the framework's plugin registry, add the scorer to whichever profiles want it, and weight it against other scorers. Profile authors decide the weight, and the framework validates the composition at startup.

Scorer

Score(ctx, candidates) []float64 Over these three posts, we demonstrated how Modular Cloud’s routing layer allows us to rapidly implement new routing optimizations. The data layer tracks which KV cache blocks live on which pods at microsecond query latency. The decision layer expresses routing behaviors as composable plugins rather than hard-coded algorithms. The execution layer coordinates multi-step flows when a single request touches multiple pods, as with disaggregated prefill/decode.

We plan to follow this series with a deep dive into inference scheduling. Routing decides which pod handles a request. Scheduling decides which requests run next, in what batch, against what cache state. This blog series will explore how we approach scheduling as a single system, allowing us to perform holistic optimizations for large-scale inference.

The routing layer described in this series is running in production on Modular Cloud today. If you're serving text, image, or video models and want to test our vertically integrated stack against your workloads, request access to Modular Cloud.

Three trends from MLSys 2026

May 29, 2026

Why LLM Inference Needs a New Kind of Router - Part 2

May 21, 2026

Why LLM Inference Needs a New Kind of Router - Part 1

May 8, 2026

Build the future of AI with Modular

Sign up today

Signup to our Cloud Platform today to get started easily.

Browse open models

Browse our model catalog, or deploy your own custom model

Get all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime.

⚠️ This form requires JavaScript to function. Please enable JavaScript in your browser to continue.

Thanks for signing up to our newsletter! 🚀

Thank you,

Modular Sales Team

source & further reading

modular.com — original article You can now run Max AI models on Apple Silicon Modular: Qualcomm to Acquire Modular Modular: Modular 26.4: SOTA MoE Serving, Model Bringup via Agent Skills, Mojo Beta 2 and More

Modular: Why LLM Inference Needs a New Kind of Router - Part 3

Run your AI side-project on zahid.host