{"slug": "anatomy-of-a-high-performance-ep-kernel", "title": "Anatomy of a high-performance EP kernel", "summary": "A high-performance Expert Parallelism (EP) kernel is essential for running large Mixture-of-Experts (MoE) language models across multiple GPUs, as it handles the dynamic routing of tokens to experts located on different nodes. Unlike other parallelism methods with fixed communication patterns, EP must move activation data between GPUs in real-time based on the router's decisions, then return the results. DeepSeek's DeepEP library set the modern standard for these dispatch and combine kernels, enabling efficient large-scale serving by managing both high-throughput and low-latency data transfers across NVLink and RDMA connections.", "body_md": "# Anatomy of a high-performance EP kernel\n\n*El mundo físico*(1882), via\n\n[Wikimedia Commons](https://commons.wikimedia.org/wiki/File:El_mundo_f%C3%ADsico,_1882_%22Estaci%C3%B3n_telef%C3%B3nica_central_en_Paris%22._%284074931516%29.jpg).\n\nLarge language models are large. Because they’re large, we need lots of GPUs to run them. It would be nice if LLM inference were ‘embarrassingly parallel’ and we could just always compute independent things on different GPUs. But alas, to use lots of GPUs on LLM inference, we need to get those GPUs talking to one another.\n\nThere are lots of different ways to get different GPUs working together: Tensor\nParallelism, Pipeline Parallelism, Context Parallelism, Expert Parallelism,\netc. All have their place. But for MoE models, in the MoE layers, when you want\nto serve at large scale, ‘wide Expert Parallelism’ (wideEP) is kingSee vLLM’s original [DeepSeek large-scale serving\npost](https://vllm.ai/blog/2025-12-17-large-scale-serving) for a demonstration\nat production scale: DeepSeek at 2.2k tokens/s per GPU on an H200 cluster,\nserved with wideEP and data parallel attention..\n\nThe other kinds of parallelism all require communication between GPUs, but their patterns are fixed by the architecture: who sends, who receives, and how much, are all known before the forward pass begins, and are the same on every step. The comms can run as standard collectives.\n\nExpert parallelism is different. Which tokens need to reach which GPUs is\ndecided by the router, from the data, at runtime, fresh in every MoE layer.\nAnd the tokens have somewhere to be reached *from*: we’ll assume the ‘data\nparallel attention’ arrangement DeepSeek serves with, where each token lives\non exactly one rank (a rank being one GPU somewhere in our cluster). The\nexperts are spread across those same ranks, so a token and the experts it’s\nrouted to will generally not be in the same place. Here’s an example, with 8\nGPUs split across 2 nodes, two experts per GPU, 1 token per rank, and 2 routed\nexperts per token:\n\nHover a rank chip for its token’s round trip, or an expert for everything routed to it. Four of the sixteen experts drew no tokens at all this step: the routing is lumpy.\n\nWhen it comes time to run our MoE layers, our tokens have to go and meet their experts, wherever they might be in the network fabric. It’s the job of the EP communication kernel to make that happen.\n\nThe modern shape of these kernels was set by DeepSeek’s\n[DeepEP](https://github.com/deepseek-ai/DeepEP) library. In this post we’ll\nbuild up the anatomy of a DeepEP-style dispatch and combine kernel: the\nhigh-throughput shape first, then the low-latency one.\n\n## The job we have to do\n\nLet’s make the setup concrete. We have 8 GPUs, split across 2 nodes, connected with RDMA, and each data parallel rank owns a single GPU. Attention runs on each GPU over a batch of tokens, where can vary between GPUs. We’re doing expert parallel with experts, two per GPU, of which are routed for each token.\n\nAt each rank , at the entrance to the EP layer, we have a tensor of shape is the hidden size.. The routing layer will run locally, and give us expert assignments for each token. We’re routing 2-out-of-16: for each token, the router gives us a set of logits of length (i.e. a tensor of shape ), from which we’ll take the indices of the top 2, to get a tensor of shape . For example, if token is routed to experts and , then row will be .\n\nSo at the entrance to the EP layer each rank holds two things: the activation rows it produced, and, after the local routing pass, the top-2 expert assignment for each of those rows.\n\nNot all of the experts live locally. Some live next door, on neighbouring NVLink peers, and some live far away, on nodes reachable only over RDMA. The goal of the expert parallelism kernels is to get the activations where they need to go, run the expert GEMMs when they get there, and then bring them back home.\n\nWe’re doing communications here, and with communications it’s handy to specialise on what we care about most: throughput, or latency. The split maps onto the two phases of inference: prefill brings big, compute-bound batches with plenty of other work to hide communication behind, while at decode there is hardly anything else to do, so the transfer itself is what we wait on. We’ll start with the throughput-optimised standard EP shape, then discuss the latency-optimised shape.\n\n## High throughput: ask, then send\n\n### Dispatch\n\nThe point of dispatch is to feed a grouped GEMMA grouped GEMM is one kernel launch running a separate matrix multiply per expert, each over a different number of rows. Routing produces exactly this raggedness: each expert’s group is however many tokens it drew.. After routing, every expert has to run a matrix multiply over exactly the tokens assigned to it. Before routing, those tokens are scattered across every rank in the cluster.\n\nSo on each rank, dispatch has to gather the tokens bound for the local experts into a single dense buffer that a grouped GEMM can consume in one go.\n\nThe difficulty is that we don’t know the shape of that buffer ahead of time. Which token goes to which expert is decided by the router at runtime, and the distribution is lumpy and changes every step. An expert might draw a thousand tokens now and none next time. We can’t know how many our local experts will receive until every other rank has run its router. So neither the size of the local activation buffer nor the slot in that buffer into which any particular token lands is known in advance.\n\nThere are two ways to cope with not knowing the size.\n\n- We can reserve enough room for the worst case, by allocating a fixed rectangle with a padded slot per expert.\n- We can find out the real counts first and then allocate exactly what we need.\n\nThe fixed rectangle is simpler, but it has to be sized for the worst case.\nThe worst case doesn’t scale with the tokens you actually receive: all\nthe peers might route their whole batches at the same expert, so every padded\nslot has to be big enough for everyone at once. At prefill batch sizes that means far\nmore HBM holding emptiness than data, and spare HBM is exactly the resource\nwe want back, because it becomes KV cache, which is what keeps sequences in\nflight. The price of exactness, meanwhile, is affordable here: when batches\nare large and the GEMMs are compute bound, the extra communication it takes\nto learn the real counts can hide behind computeServing stacks manufacture room to hide it in: with two-batch\noverlap, the step is split into two microbatches, so that while one\nmicrobatch’s tokens are on the wire, the other’s GEMMs keep the SMs busy. See\n[SGLang’s large-scale EP\nwriteup](https://lmsys.org/blog/2025-05-05-large-scale-ep/).. So the throughput path\nallocates a ragged buffer, , sized to the tokens\nwe’ll actually receive. The fixed rectangle is the low-latency story, which\nwe’ll come to.\n\nIf we want to allocate only what we need, we have to learn the counts before\nany activations move. We can do so by running a *coordination pass*. Every rank\nalready knows from its own routing how many tokens it’s sending to each peer.\nIf everyone trades those numbers, each rank can add up how much it’s about\nto receiveThe exchange mirrors the fabric: counts cross between nodes over RDMA, then\nbetween GPUs within a node over NVLink, gathering as they go, so the coordination\ncosts the same two hops the real dispatch will..\n\nThe coordination pass is cheap in bytes, only a handful of integers per peer rather than megabytes of hidden state. Once a rank knows how many tokens are coming from each source, a write-safe layout of the buffer comes naturally as a prefix sum: the first source’s tokens start at zero, the next source’s start where those end, and so on. The counts hand us both of the things we were missing: how big to make the buffer, and where every block sits inside it.\n\nWith the layout fixed, we can actually send the activations. The sender never writes to the final buffer directly. It couldn’t if it wanted to: RDMA writes can only land in memory that was registered with the NIC ahead of time, and the compact buffer is allocated fresh each step, at a size we only just learned. So the sender streams its tokens into a small fixed-size queue on the destination, carved out of pre-registered memory, and the receiver, which owns the compact buffer, drains that queue and copies each token into the slot the prefix sum assigned it. The queue also lets the two sides run at their own pace, with its depth bounding how far ahead the sender can get before it has to wait for the receiver to catch up. The queue is fixed-size too, but fixed at a constant.\n\nFor the queues that cross nodes there’s one more hop hiding in the picture. A token never travels point-to-point to an arbitrary remote GPU: it goes over RDMA to the GPU with the same index on the destination node, and that GPU forwards it over NVLink to its final host. Each GPU then only ever talks to its own counterparts across nodes, which keeps every RDMA flow on its own rail of the fabric and caps the number of connections each NIC has to keep fed.\n\nWhat lands in the compact buffer is grouped by where it came from, not by which expert it’s for: the transfer is coarser than the routingA token is sent to a peer once if any of that peer’s experts want it, even if two do, with the per-token expert assignment carried alongside. That is what makes the transfer coarser than the routing.. The grouped GEMM wants contiguous per-expert blocks, so the last step of dispatch is a local permute, from by-source order into by-expert order. In DeepEP this last step is the caller’s: dispatch hands back the by-source buffer along with per-expert counts, and the serving framework does the reordering, or feeds the indices straight to a GEMM that can consume them.\n\n### Combine\n\nThe point of combine is to un-run the dispatch kernel, and add up the contributions for each token.\n\nThe GEMM left its outputs grouped by expert, so the first thing we have to do is to undo the permutation we did on the way in. The inverse permutation puts the outputs back into the by-source-rank layout that dispatch delivered tokens in.\n\nFrom there the transport runs in reverse. The rank that hosted the expert is now the sender, streaming its outputs back through the home rank’s per-peer queues into the positions the tokens came from.\n\nWe don’t need to do the coordination pass, since combine is handed the same routing information dispatch produced, so it already knows where everything needs to return to.\n\nEach token was routed to experts, and those experts can sit on different ranks, so several partial outputs converge on the token’s home rank. There they are summed, weighted by the router’s gate weights, into the single vector that is the layer’s output for that tokenThe transport itself just adds the returning contributions together. The gate weights are applied separately, either folded into the activations before the expert GEMM or in the reduction step, so the kernel moving bytes doesn’t have to know about them..\n\nSo combine mirrors dispatch at every step. The permute becomes an unpermute, the coordination pass is replaced by reusing dispatch’s routing, and where dispatch compacted arriving tokens into contiguous positions, combine sums them into slots.\n\n## Low latency: send without asking\n\nThe reason to optimise the kernel for latency is the decode regime. Each rank holds only a handful of tokens, often one per sequence in the batch. The coordination pass was cheap in bytes, but it’s a full network round trip with barriers, and it has to finish before any activations move. At decode there is little to overlap it with, and it becomes a large fraction of the layerA second important penalty: dynamic shapes like the ones in the high-throughput kernels are tough to push into CUDA graphs, which are more important during decode..\n\nSo we want to figure out how to skip it. The coordination pass only ever existed to turn counts into write offsets, and that was only needed because the compact buffer made each sender’s offset depend on what every other sender did. If we’re willing to give up compactness, we can prearrange space for each peer rank to write into. Instead of one packed buffer we pre-reserve a fixed, private region for every (source rank, expert) pair. This is the fixed rectangle from the dispatch fork, with one refinement: the padding is per (source rank, expert) rather than per expert, so no two senders ever write into the same region.\n\nThe address a sender writes to is now a formula, the region for its (source, expert) plus a local slot, and every rank can compute it alone:\n\nwith the local expert, the source rank, the number of ranks, and the per-region cap. The dynamic prefix sum over real counts becomes a static stride times a fixed maximum, and the first thing that happens in the layer is the data send itself.\n\nSince the transfer is now the thing we wait on, the bytes are the latency, and DeepEP’s low-latency dispatch shrinks them by quantising the payload to FP8 on the wire by default. The return path stays in BF16: combine’s sums are where precision matters.\n\nThe catch is that each private region has to be big enough for whatever its source might send, sized for the worst case rather than the actual count. Left unbounded that worst case is enormousWithout a cap, each region would have to hold the most tokens any rank could ever present, and there is one region per source rank, so the receive buffer would grow with the total number of tokens in flight across the system rather than with a fixed budget. The cap replaces that with a constant., so we need to cap how many tokens a rank may dispatch in one call to a fixed chunk size, and microbatch anything larger. With the cap each region is one chunk tall, and a receiver’s buffer is . Each slab is sized for a source dumping its whole chunk into one expert, but routing spreads those tokens across all the experts, so the slabs sit mostly empty.\n\nMostly empty means the receiver cannot just hand the buffer to the GEMM, because most of it is uninitialised. It needs to know how many rows each source actually wrote, and it needs to learn that without reintroducing the round trip we just removed. So when a sender finishes filling its region, it writes one more value, into a fixed slot on the receiver: the count of tokens it put there. The count does double duty: the slot starts empty, so its arrival is also the signal that the region’s data has landedTwo things make this safe. The count is written in a form the receiver can tell apart from the empty initial value, so even a source that sends zero tokens produces a signal distinguishable from one that simply has not arrived yet. And the count is ordered after the data it vouches for: it’s issued on the same ordered channel (the same RDMA queue pair, or behind a memory fence over NVLink), so by the time the count shows up, the data is already in place.. The receiver watches the counts and learns the valid range of every region as it fills; the grouped GEMM masks the padded rows.\n\nCombine works the same way. It never needed a coordination pass, but the throughput path still staged its returns through queues; here even those go away. Dispatch delivered each token tagged with where it came from, so the expert’s host can compute a return address directly: a private slot on the token’s home rank, indexed by the token’s position there and which of its experts this one was. The sender writes the output into that slot, raises the same kind of flag, and the home rank does the weighted sum once all contributions have landed.\n\nIn sum: low latency inverts the throughput tradeoff. Throughput spends a round trip to keep memory tight; low latency spends memory, in the form of mostly-empty worst-case buffers, to remove the round trip.\n\n## Coming home\n\nThat’s the story. If you open\n[DeepEP](https://github.com/deepseek-ai/DeepEP)’s codebase you should\nrecognise the shape, though you’ll now find these kernels under `legacy/`\n\n:\nthe recent V2 rewrite rebuilds the library on top of NCCL’s new device-side\ncommunication API. What the anatomy looks like after that rewrite is worth a\npost of its own.\n\nDeepEP itself is built for NVIDIA’s stack: Hopper and Blackwell GPUs, NVLink\nwithin the node, InfiniBand-class RDMA between nodes. The\n[UCCL](https://github.com/uccl-project/uccl) project reimplements the same\nprimitives for lots more: AMD as well as NVIDIA GPUs, and any RDMA NIC, AWS’s\nEFA, Broadcom’s, at comparable performance.\n\nThere’s a growing stack of optimisations on top. Expert load balancing\n([EPLB](https://github.com/deepseek-ai/EPLB)) computes replication and\nplacement plans for hot experts, which serving systems apply periodically:\nplacement is only an indirection the kernel consults, so it’s free to\nchange. [vLLM’s elastic\nEP](https://vllm.ai/blog/2026-05-14-elastic-expert-parallelism) grows and\nshrinks the deployment: the world size only enters through , in the\ncounts and the regions, so ranks can join and leave. And the routing\nstatistics are observable at exactly this layer: at Doubleword [we\nfound](https://blog.doubleword.ai/moe-expert-coactivations) that similar\nrequests co-activate similar experts, so you can use EPLB to gather\nco-activated experts into domains with good networking, and steer the\nrequests that light them up to those domains.\n\nWork is ongoing to fuse these kinds of comms primitives into the compute kernels\nthemselves ([mKernel](https://uccl-project.github.io/posts/mkernel/),\n[ParallelKittens](https://hazyresearch.stanford.edu/blog/2025-11-17-pk)), so we\ncan do fine-grained overlap and better pipelining: one SM can be receiving data\nwhile the GEMM tiles start firing off of it.\n\nHowever those boundaries move, the job stays the one we started with: the tokens have to go and meet their experts, and then come home.\n\n[Suggest an edit](https://github.com/fergusfinn/blog/edit/main/src/content/blog/anatomy-of-a-high-performance-ep-kernel.mdx)\n\nLast modified: 12 Jun 2026", "url": "https://wpnews.pro/news/anatomy-of-a-high-performance-ep-kernel", "canonical_source": "https://fergusfinn.com/blog/anatomy-of-a-high-performance-ep-kernel/", "published_at": "2026-06-10 00:00:00+00:00", "updated_at": "2026-06-12 13:14:42.403509+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-infrastructure", "ai-research"], "entities": ["vLLM", "DeepSeek", "H200", "NVIDIA", "MoE"], "alternates": {"html": "https://wpnews.pro/news/anatomy-of-a-high-performance-ep-kernel", "markdown": "https://wpnews.pro/news/anatomy-of-a-high-performance-ep-kernel.md", "text": "https://wpnews.pro/news/anatomy-of-a-high-performance-ep-kernel.txt", "jsonld": "https://wpnews.pro/news/anatomy-of-a-high-performance-ep-kernel.jsonld"}}