{"slug": "modular-why-llm-inference-needs-a-new-kind-of-router-part-2", "title": "Modular: Why LLM Inference Needs a New Kind of Router - Part 2", "summary": "Modular has built a new data layer for LLM inference routing that solves the problem of querying cached blocks across hundreds of pods in microseconds. The company's architecture uses a specialized data structure with fast concurrent reads, batched writes, and idempotent event processing to replace traditional hashmap-and-mutex approaches that cannot handle over 2 million lookups per second. This infrastructure enables real-time patient conversations for Hippocratic AI by routing inference requests to pods with the best cached prefix on every request's critical path.", "body_md": "Hippocratic AI + Modular to power real-time patient conversations. Read More →\n\nInference Products\n\nShared Endpoints\n\nAccess frontier models via an API\n\nDedicated Endpoints\n\nMission critical reliability\n\nCustom models\n\nYour model, peak performance\n\nDeployment Options\n\nOur Cloud\n\nFully managed, pay by usage\n\nYour Cloud\n\nModular stack in your VPC\n\nPricing\n\nFlexible plans for every team\n\nModels\n\nDeepSeek V4 Pro\n\nFLUX.2 Klein 9B\n\nKimi K2.6\n\nMiniMax M2.7\n\nWan 2.2 T2V A14B\n\nView All\n\nText to audio\n\nTurn text into natural speech\n\nImage generation\n\nGenerate images from text prompts\n\nCode generation\n\nGenerate production-ready code\n\nVideo generation\n\nGenerate video from text + image\n\nAgentic\n\nDeploy AI agents anywhere\n\nCustom Models\n\nKernel-level model control\n\nCase Studies\n\nProven results from real customers\n\nMAX Framework\n\nGenAI native modeling & serving\n\nMojo Language\n\nThe best GPU & CPU performance\n\nSelf-Hosted\n\nMAX+Mojo self-hosted by you\n\nCommunity\n\nBuild the future of AI together\n\nMojo Agent Skills\n\nOfficial AI agent skills from Modular\n\nDocs\n\nDeploy GenAI models, our cloud or yours\n\nModel Library\n\nLatest supported open models\n\nMojo Docs\n\nWrite high-performance kernels for CPUs and GPUs\n\nAbout\n\nBuild AI for anyone, anywhere.\n\nCareers\n\n👋 We’re currently hiring!\n\nCulture\n\nWhat we believe\n\nContact Us\n\nRequest a demo\n\nMay 21, 2026\n\nAayush Deshpande\n\nDeep Dhillon\n\nAlexandr Nikitin\n\nMichael Dunn-OConnor\n\nEngineering\n\nIn Part 1, we argued that LLM routing is qualitatively different from HTTP routing. Inference backends hold state that traditional load balancers ignore. This post covers the first of the three layers we identified: the data layer that makes that state queryable on the hot path of every inference request.\n\nTo route a request to the pod with the best cached prefix, you need to know which blocks are cached on which pod. That sounds simple until you look at the numbers. You may have hundreds of pods, each with thousands of cached blocks. State can change hundreds of times per second. Across this complexity, queries need to return in microseconds because they sit on the critical path of every inference request.\n\nA hashmap and a mutex aren’t sufficient for the scale and velocity of inference routing. You need a data structure designed for this specific access pattern: a way to provide fast concurrent reads, batched writes, idempotent event processing, and efficient query responses.\n\nThis post walks through how we built that data structure.\n\nGiven a request tokenized into N block hashes (the query chain) and P live pods in the cluster, return a ranked list of pods ordered by how many of those N blocks each pod has cached.\n\nThe constraints:\n\nFor a concrete example: say we have 64 pods, 32 blocks per query, 1,000 requests per second. A naive approach (for each block, for each pod, check if the pod has it cached) is 32 × 64 = 2,048 lookups per request. That’s over 2 million per second in aggregate. Lock contention aside, this approach cannot meet the microsecond latency constraint. We need something different.\n\nThe data layer doesn't originate cache state. It consumes it.\n\nLLM engines emit block-level events as their cache contents change:\n\nHow those events reach the routing layer depends on the deployment. Some environments use a pub/sub fabric like NATS with JetStream. Others use direct gRPC streams from each engine instance to the router. Future deployments might use Kafka, Redis Streams, or something else entirely.\n\nThe data layer's job is to be indifferent to the transport. Adding a new event source means implementing one consumer interface. The indexer sees a stream of typed events and doesn't know or care what message infrastructure they came from.\n\nTwo requirements follow from this design.\n\nIdempotent operations. The same event might arrive twice. A replay of the last hour of block events on a consumer restart is normal. Registering a block that's already indexed is a no-op. Evicting a block that's already gone is a no-op. Shutting down a pod that's already dead is a no-op. These are valid states, not error cases.\n\nMulti-consumer safety. Two consumers (for example NATS and gRPC) might feed the same indexer simultaneously. They shouldn't conflict. This is guaranteed by idempotency: if both consumers see \"pod A registered block X\" and both write it, the result is the same as one consumer writing it once.\n\nThe central question the index answers is: which pods have this block?\n\nThe naive representation is a slice of pod identifiers per block: blockHash → [podA, podC, podF]. It works, but it's wasteful in ways that compound at scale. Every pod identifier is a string pointer. Every slice has its own backing array and header. Millions of small heap allocations put sustained pressure on Go's garbage collector. And checking \"does pod X have block Y?\" means a linear scan over the slice.\n\nblockHash → [podA, podC, podF]\n\nThe question is a set-membership test across a bounded population of pods, which is most efficiently represented as a bitmap.\n\nA bitmap assigns each pod an index (0, 1, …, P−1) and represents \"the set of pods that have this block\" as a fixed-width bit vector. One bit per pod. Set operations (union, intersection, complement) become bitwise operations. Membership is a single bit test. Population count compiles down to a single POPCNT instruction on x86-64.\n\nPOPCNT\n\nWe cap pod count at 256 as a deployment-level design choice. 256 is larger than any single-orchestrator deployment we've seen in production and it could be widened in the future without requiring a redesign. At 256 pods, the bitmap is 256 bits = 32 bytes = four uint64 words. The HostBitmap type is four uint64s laid out flat. We cap the pod count per orchestrator instance. This fits naturally with a cell-based deployment architecture, where each routing instance owns a self-contained cell of pods and horizontal scale comes from running more cells rather than widening any one. The cap is larger than any single-cell deployment we have seen in production, and we could widen it later without redesigning the data structure. The HostBitmap is a fixed-size flat array of uint64 words sized to the cap, with no pointers, heap allocation, or indirection.\n\nuint64\n\nHostBitmap\n\nThe full index is map[blockHash] → HostBitmap. For every block the cluster has cached anywhere, the map stores which pods have it.\n\nmap[blockHash] → HostBitmap\n\nA single map[blockHash] → HostBitmap works until concurrent writes go behind a lock. Then every update (every Registered or Evicted event) serializes against every other update and every read. At hundreds of events per second across thousands of blocks, the lock becomes the bottleneck.\n\nThe fix is sharding. We split the index into 256 independent maps, each with its own lock. A block’s shard is chosen by hashing the block hash and taking the top bits. We picked 256 because it makes per-shard contention negligible at our target throughput while keeping memory overhead trivial. Writes to different shards proceed in parallel, as do reads. Reads and writes to the same shard still contend, but only for a single map operation, not a full scan.\n\nBlock hashes from cumulative hashing aren't uniformly random. They carry structure that reflects the input token distribution. Using the low bits of the hash directly for shard selection would cluster related blocks into the same shard, turning 256-way parallelism back into a handful of hot shards.\n\nWe run the block hash through a multiplicative hash with the golden-ratio constant (0x9E3779B97F4A7C15 for 64-bit), a technique known as Fibonacci hashing, before extracting the shard index from the top bits. The golden-ratio multiplier is 2^64 divided by φ (the golden ratio, ≈ 1.618). Even sequential inputs scatter across the output range with maximum spacing between consecutive images, a result that follows from the three-distance theorem.\n\n0x9E3779B97F4A7C15\n\nEach shard struct is padded so it starts on its own 64-byte cache-line boundary. Without padding, two adjacent shards might share a cache line. A write to one invalidates the line on every CPU core that has the other cached. This is false sharing and it quietly kills concurrency on workloads that look parallelism-friendly. The padding costs ~16 KB total (256 shards × 64 bytes) and eliminates the problem entirely.\n\nWith 256 shards and Fibonacci distribution, the probability that any given block lands in a specific shard is 1/256. At 1,000 QPS (queries per second) with 32 blocks per query, each shard sees roughly 125 lookups and a handful of writes per second. Contention is effectively minimized.\n\nStorage handles \"does pod X have block Y?\" in constant time. Routing needs to answer a harder question: for a query chain of N blocks, how many of them does each pod have?\n\nThe naive approach is the N x P loop. For each of the N blocks in the query chain, look up the bitmap in its shard, then check each of the P pods. At 64 pods and 32 blocks, that's 2,048 shard lookups per request. At 1,000 QPS this would require over 2 million lookups per second, making it impossible to meet the microsecond query latency target.\n\nCumulative block hashing is the solution. Block hashes in our system aren't independent. Each block's hash is a function of its own tokens and the hash of the previous block. Block 4's hash incorporates everything from blocks 0 through 3.\n\nThe consequence: if a pod has block 5 cached with a particular hash, it must have computed blocks 0 through 4 with exactly the same chain of hashes. Cache hits form a contiguous prefix. There's no \"I have block 5 but not block 3.\" If the block-5 hash matches, the prefix up to block 4 matched too.\n\nThis turns the query from a N x P scan into a binary search for the drop-off point: the largest K such that each pod has the first K blocks of the query chain cached. Because the prefix is contiguous, binary search applies.\n\nHere’s an example of binary search prefix matching using four pods and eight blocks.\n\n```\nQuery chain: [H0, H1, H2, H3, H4, H5, H6, H7]\n\nPod A has cached: H0–H5    (prefix depth 6)\nPod B has cached: H0–H3    (prefix depth 4)\nPod C has cached: H0–H7    (prefix depth 8, full chain)\nPod D has cached: H0–H1    (prefix depth 2)\n```\n\nStep 1. Check block H3 (midpoint). Look up the shard for H3, intersect the bitmap with the alive-pods bitmap.\n\nResult: {A, B, C}. Pod D dropped off before H3.\n\nStep 2. For the survivors {A, B, C}, check H5 (midpoint of upper half).\n\nResult: {A, C}. Pod B dropped off between H3 and H5.\n\nStep 3. Check H6.\n\nResult: {C}. Pod A dropped off between H5 and H6.\n\nStep 4. Check H7.\n\nResult: {C}. Pod C has the full chain.\n\nFor pods that dropped off (A, B, D), we binary-search their exact drop-off depths. Total shard lookups in this example: roughly 10. The naive P × N scan would take 32.\n\nThe complexity is O(K × log N), where K is the number of distinct drop-off depths across the surviving pod set. If all pods have the same prefix depth, K = 1 and the search finishes in log N lookups. If every pod has a different depth, K = P.\n\nIn practice, pods in the same deployment tend to have similar cache states (they've all handled prior requests from the same distribution) so K is small.\n\nThe binary search cost scales with log N and K, neither of which is large. It does not scale with P. Doubling the pod count doesn't change the number of shard lookups. It changes the population count on the bitmaps, which is a constant-time operation per word.\n\nTwo different problems in this data layer both involve hashing, and it's worth being precise about what each one solves.\n\nCumulative chaining solves a correctness problem. Two different requests might happen to share the same tokens in block 3 while differing earlier in the sequence. If blocks were identified only by a local hash of their own tokens, those requests would produce the same H3 and the router would treat their block-3 KV cache entries as interchangeable. They aren't: the attention values depend on the full context, not just the local tokens. Sharing cache across mismatched prefixes produces incorrect model output.\n\nCumulative chaining ensures that every block hash encodes the full token sequence up to that point, not just the tokens in that block. Two requests that happen to share the same tokens in block 3 but differ earlier in the sequence will produce different H3 values. The KV cache entries are distinct, so the router treats them as distinct — and the model sees the right context.\n\nBitmap lookup solves an availability problem. Given a hash chain that's correct, which of those hashes is actually present on which pods right now? Blocks get evicted. Pods crash. The hash chain describes what a pod's cache would need to contain to serve a request; the bitmap records what it actually contains. Routing needs both: chaining to ask the right question, and bitmaps to answer it accurately against live cluster state.\n\nAs deployments add CPU-memory and SSD offload tiers (covered in our Five Eras of KV Cache post), the question expands from \"does this pod have the block?\" to \"where on this pod does it live?\" GPU memory is strictly faster than CPU memory, which is strictly faster than SSD. The router should prefer pods that have the block in the fastest accessible tier.\n\nThe tempting approach is to extend the HostBitmap to be per-tier: three bitmaps per block instead of one. That works, but it bloats the hot path. Most routing decisions don't need per-tier information. They just need \"who has this block?\" Paying the extra cost on every query is wasteful.\n\nInstead, we split the index into two layers.\n\nRouting shards. The 256-shard HostBitmap index described above. Answers \"who has this block?\" on the hot path. This is queried on every request.\n\nHost tier maps. A separate per-pod data structure that answers \"where on this pod does this block live?\" The tier map is consulted only after a winning pod is selected, to produce the cache hint injected into the downstream request.\n\nThe split follows naturally from the access pattern. Routing decisions query the hot path on every request. Tier information is pulled only for the winner. The two data structures can be sized and tuned independently.\n\nPods come and go. When a pod's events stop arriving (whether it crashed, was evicted, or is draining) the indexer needs to stop routing to it immediately. When a new pod takes its slot, routing needs to pick it up.\n\nThe naive approach is to walk every shard and clear the dead pod's bit from every bitmap. For tens of thousands of blocks, this holds shard locks for a long time and blocks routing queries.\n\nWe use two-phase removal instead.\n\nPhase 1: instant liveness update. Every pod has a bit in a global alive bitmap, separate from the per-block shards. When a pod is declared dead, we compare-and-swap (CAS) its bit to 0 in the alive bitmap. Every subsequent routing query intersects candidate bitmaps with the alive bitmap, so the dead pod is instantly excluded without touching a single shard.\n\nPhase 2: bounded concurrency cleanup. The per-block shard entries for the dead pod are still there, masked out by the alive bitmap. A background goroutine walks the shards with a bounded concurrency limit and clears the dead pod's bits. When cleanup finishes, two things are reclaimed: the dead pod's per-tier maps of block → location (offset, length, device), and routing-shard entries for blocks that the dead pod was the last live holder of. Without this second sweep, block hashes that no surviving pod holds would accumulate in the index as the cluster churns. The result of these phases is that hot path stays fast. Cleanup happens at the indexer's own pace without blocking routing.\n\nThe data layer in Modular Cloud's routing system:\n\nThe result is a data structure that handles tens of thousands of queries per second on the hot path of an inference orchestrator while concurrent event streams update it in real time.\n\nCache-aware routing is only as fast as the data layer underneath it. If answering \"which pods have these blocks?\" takes milliseconds, you've spent the latency budget before any scoring logic runs. The architecture here (sharded bitmaps, Fibonacci-distributed for uniformity, queried via binary search over cumulative block hashes, with two-phase host lifecycle for pod churn) keeps that answer in the microsecond range.\n\nThe data layer answers one question: who has the cache? Cache affinity is one input into a routing decision. Load, session state, tenant priority, hardware role, node locality for KV cache transfer: these all come into play, and different deployments weight them differently. The data layer gives you the facts. The decision layer tells you what to do with them.\n\nPart 3 covers the decision and execution layers: turning cache state into routing decisions, then into execution. A five-stage composable pipeline, typed state between plugins, and the Selector/Workflow/Executor split that scales the same framework from round-robin to disaggregated prefill/decode.\n\nThree trends from MLSys 2026\n\nMay 29, 2026\n\nWhy LLM Inference Needs a New Kind of Router - Part 1\n\nMay 8, 2026\n\nTileTensor Part 1 - Safer, More Efficient GPU Kernels\n\nApril 13, 2026\n\nBuild the future of AI with Modular\n\nSign up today\n\nSignup to our Cloud Platform today to get started easily.\n\nBrowse open models\n\nBrowse our model catalog, or deploy your own custom model\n\nGet all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime.\n\n⚠️ This form requires JavaScript to function. Please enable JavaScript in your browser to continue.\n\nThanks for signing up to our newsletter! 🚀\n\nThank you,\n\nModular Sales Team", "url": "https://wpnews.pro/news/modular-why-llm-inference-needs-a-new-kind-of-router-part-2", "canonical_source": "https://www.modular.com/blog/why-llm-inference-needs-a-new-kind-of-router-part-2", "published_at": "2026-05-21 00:00:00+00:00", "updated_at": "2026-05-29 23:57:33.102000+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-products", "generative-ai", "ai-research"], "entities": ["Modular", "Hippocratic AI", "DeepSeek", "FLUX", "Kimi", "MiniMax", "Wan", "Aayush Deshpande"], "alternates": {"html": "https://wpnews.pro/news/modular-why-llm-inference-needs-a-new-kind-of-router-part-2", "markdown": "https://wpnews.pro/news/modular-why-llm-inference-needs-a-new-kind-of-router-part-2.md", "text": "https://wpnews.pro/news/modular-why-llm-inference-needs-a-new-kind-of-router-part-2.txt", "jsonld": "https://wpnews.pro/news/modular-why-llm-inference-needs-a-new-kind-of-router-part-2.jsonld"}}