Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

A developer explains the sparse Mixture of Experts (MoE) architecture used in models like Mixtral, DeepSeek-MoE, and Grok-1, detailing how the router selects which experts to activate per token and why load-balancing is the hardest training challenge. The post clarifies that MoE offers better compute efficiency than a dense model of equivalent total parameters, but still requires memory proportional to the total parameter count, debunking the '70B performance at 7B compute' claim. The analysis includes a comparison table showing that Mixtral 8x7B, with 45B total parameters and ~12.9B active per token, needs ~90 GB memory, making it unsuitable for single consumer GPUs.

You deployed a 7B model in production. Response times are fine — 45 ms per token — but you want to scale to a 70B without buying four more GPUs. Someone mentions MoE: "70B performance at 7B compute." It sounds like free lunch. So you look at the Mixtral 8x7B paper, you see 45 billion parameters and a claim that each token only activates about 13 billion of them, and you wonder: how is that physically possible, and what is the catch? This post explains the sparse MoE architecture that powers Mixtral, DeepSeek-MoE, Qwen2.5-MoE, DBRX, and Grok-1: what the router actually does, why load-balancing is the hardest problem in training them, and the three specific constraints that determine whether MoE is the right choice for your deployment. A dense transformer like Llama 3.2 activates 100 percent of its parameters for every token. The FFN layer in each transformer block runs the same matrix multiplication for every input. This makes memory use predictable and throughput easy to model, but it also means that scaling from 7B to 70B multiplies both memory and compute by 10x. MoE decouples the two. The model stores more parameters more memory , but each token only uses a fraction of them less compute . Here is the core trade-off expressed in numbers: | Metric | Dense 7B | Dense 70B | MoE 45B Mixtral | |---|---|---|---| | Total parameters | 7B | 70B | 45B 8 experts | | Active per token | 7B | 70B | ~12.9B 2 experts | | Compute per token | 7B-equiv | 70B-equiv | 14B-equiv | | Memory weights | ~14 GB | ~140 GB | ~90 GB | | Throughput tokens/s | high | low | medium-high | The headline is this: MoE gives you better compute efficiency than a dense 70B, but you still pay the memory cost of a much larger model. You cannot run Mixtral on a single consumer GPU. You need at least two 24 GB cards to fit the weights. The computational savings only show up once the model is already loaded — that is the catch that the "70B performance at 7B compute" tagline often omits. In a standard transformer, every layer has an FFN block two linear projections with an activation in between . In a sparse MoE transformer, each FFN is replaced by multiple parallel "expert" FFNs plus a learned router that picks which experts to use for each token. Here is the data flow for a single token passing through one MoE layer: php flowchart LR A Input token<br/ hidden states -- B Router / Gate<br/ learned linear layer B -- C{Softmax over<br/ N experts} C -- D Select top-k<br/ experts D -- E1 Expert 1<br/ FFN D -- E2 Expert 2<br/ FFN D -- E3 ...<br/ idle D -- E4 Expert N<br/ idle E1 -- F Weighted sum<br/ by router scores E2 -- F F -- G Output token<br/ hidden states The router is a small learned linear layer that takes the token's hidden state and outputs a score for each expert. You take the softmax over all experts, pick the k with the highest scores, run the token through only those experts, and combine the results weighted by the router scores. For Mixtral, k=2 out of 8 experts. For DeepSeek-MoE, k=6 out of 64 experts. The router itself adds negligible compute — a single matrix multiply of size hidden dim, n experts . A common mental model is that the router is a load balancer that assigns tokens to experts similar to how a distributed scheduler assigns work to machines. This is misleading. The router is a learned differentiable gate trained end-to-end with the rest of the model through backpropagation. It learns which experts specialize in which types of patterns — subject-matter expertise, syntactic structures, token positions — without any explicit supervision. When you inspect the routed outputs after training, individual experts do develop preferences. One expert in Mixtral handles arithmetic-heavy tokens disproportionately often. Another handles function words and punctuation. A third handles code syntax. But these specializations are soft, not hard: there is no constraint that says "expert 3 is the math expert." The router simply learns the assignment that minimizes the loss. The hardest part of MoE training is preventing the router from sending every token to the same two experts. If there is no corrective signal, the router quickly collapses: it sends everything to the experts that happen to initialize well, those experts get more gradient updates, they get better, the router sends even more traffic their way, and the unused experts atrophy. The standard fix is an auxiliary load-balancing loss added to the total training loss. The most common formulation used in Mixtral, GShard, and ST-MoE penalizes the router for imbalance: Simplified load-balancing loss following the Switch Transformer formulation def load balancing loss router logits, num experts, num tokens : """ router logits: num tokens, num experts — raw router scores before softmax """ router probs = torch.softmax router logits, dim=-1 tokens, experts fraction per expert = router probs.mean dim=0 experts, avg probability per expert Fraction of tokens routed to each expert , selected experts = router probs.topk k=2, dim=-1 tokens per expert = torch.zeros num experts, device=router logits.device tokens per expert.scatter add 0, selected experts.flatten , torch.ones num tokens 2, device=router logits.device load per expert = tokens per expert / num tokens 2 experts, normalized token count Auxiliary loss: dot product of fraction and load Minimized zero when all experts have equal probability AND equal load aux loss = num experts fraction per expert load per expert .sum return aux loss The num experts multiplier scales the loss so it does not vanish at different expert counts. Typical aux loss coefficients are between 0.01 and 0.001. Too high and the router loses discriminative power. Too low and the expert collapse returns. Recent work has introduced alternatives that reduce or eliminate the auxiliary loss: Serving an MoE model requires different infrastructure than a dense model. The key insight is that expert weights are wide but narrowly used : Here is a concrete serving comparison: vLLM configuration for MoE vs dense on 4x A100-80GB Dense 70B: model: meta-llama/Llama-3.3-70B-Instruct tensor parallel size: 2 max model len: 8192 estimated throughput: ~1800 tokens/s MoE 45B Mixtral : model: mistralai/Mixtral-8x7B-Instruct-v0.1 tensor parallel size: 2 max model len: 32768 sliding window attention estimated throughput: ~3200 tokens/s The MoE throughput advantage is real but narrower than the parameter count suggests, because the dispatch overhead and the memory ceiling eat into the margin. Router collapse during training. Even with load-balancing loss, the router can still collapse in the first few thousand steps. Monitor the expert utilization histogram during training. If one expert receives more than 30 percent of tokens while another receives less than 5 percent, increase the auxiliary loss coefficient or switch to a different routing strategy e.g., DeepSeek's shared-expert design . Ignoring dispatch overhead in latency budgets. The all-to-all communication in expert routing adds 5-15 ms per MoE layer depending on batch size and interconnect bandwidth. For a 32-layer model with 16 MoE layers, that is 80-240 ms of overhead before any compute happens. For latency-sensitive applications, this cost can erase the throughput gains. Training on too-small batch sizes. MoE models require larger batch sizes than dense models because the expert capacity constrain means that each expert sees only a fraction of the batch. A batch of 256 tokens with 8 experts and k=2 means each expert processes roughly 64 tokens. Training on small batches leads to underutilized experts and noisy gradients. Using MoE for fine-tuning without adaptation. Most MoE models were trained from scratch with MoE architecture. Taking a dense checkpoint and converting it to MoE as in DeepSpeed-MoE's d2s approach requires careful initialization and a warm-up schedule. Simple LoRA fine-tuning on an existing MoE model can break the learned routing patterns. Always evaluate the downstream task before and after fine-tuning to verify the routing did not drift. Measuring memory wrong. The total parameter count of an MoE model determines model.parameters , but the memory you need to serve it is the sum of all experts plus the shared layers. For DeepSeek-MoE-16B, the 64 experts each with intermediate size 1408 at hidden size 2048 means the expert weights alone occupy roughly 45 GB at FP16. The total 16B label refers to the active parameter count, not the storage requirement. MoE is not always the right architecture for your model: You need consistent latency for every request. Because the router's top-k selection varies per token, and because batch composition affects which experts are active, MoE latency has higher variance than dense models. If your SLO requires 99th percentile latency under 200 ms per token, a dense model is easier to calibrate. You are deploying on a single GPU with less than 48 GB VRAM. MoE models with real quality anything above 2-3 active billion parameters require at least two GPUs to fit the total weights. If your deployment is a single RTX 4090 or A5000, stick with dense models in the 7B-13B range. You are building a small model under 3B parameters. The overhead of the router, the auxiliary loss, and the expert parallelism infrastructure is not worth it at this scale. MoE starts to pay off when the dense baseline you are trying to beat is above 30-50B parameters. Your batch size is small and latency-critical. A batch of 1 streaming chat does not benefit from expert parallelism because the dispatch overhead dominates. The throughput advantage of MoE is most visible at batch sizes above 64. You cannot afford the engineering complexity. MoE serving requires custom kernel support Triton or CUDA kernels for fused experts, dispatch, and combine , non-trivial CI for load-balancing validation, and integration with inference engines that are still maturing their MoE support. If your team has limited ML infrastructure, a dense model with QLoRA is the safer bet. Next post: structured output — how JSON mode, function calling, and grammar-constrained decoding work under the hood, and when each approach fails.