Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

wpnews.pro

You deployed a 7B model in production. Response times are fine — 45 ms per token — but you want to scale to a 70B without buying four more GPUs. Someone mentions MoE: "70B performance at 7B compute." It sounds like free lunch. So you look at the Mixtral 8x7B paper, you see 45 billion parameters and a claim that each token only activates about 13 billion of them, and you wonder: how is that physically possible, and what is the catch?

This post explains the sparse MoE architecture that powers Mixtral, DeepSeek-MoE, Qwen2.5-MoE, DBRX, and Grok-1: what the router actually does, why load-balancing is the hardest problem in training them, and the three specific constraints that determine whether MoE is the right choice for your deployment.

A dense transformer (like Llama 3.2) activates 100 percent of its parameters for every token. The FFN layer in each transformer block runs the same matrix multiplication for every input. This makes memory use predictable and throughput easy to model, but it also means that scaling from 7B to 70B multiplies both memory and compute by 10x.

MoE decouples the two. The model stores more parameters (more memory), but each token only uses a fraction of them (less compute). Here is the core trade-off expressed in numbers:

Metric	Dense 7B	Dense 70B	MoE 45B (Mixtral)
Total parameters	7B	70B	45B (8 experts)
Active per token	7B	70B	~12.9B (2 experts)
Compute per token	7B-equiv	70B-equiv	14B-equiv
Memory (weights)	~14 GB	~140 GB	~90 GB
Throughput (tokens/s)	high	low	medium-high

The headline is this: MoE gives you better compute efficiency than a dense 70B, but you still pay the memory cost of a much larger model. You cannot run Mixtral on a single consumer GPU. You need at least two 24 GB cards to fit the weights. The computational savings only show up once the model is already loaded — that is the catch that the "70B performance at 7B compute" tagline often omits.

In a standard transformer, every layer has an FFN block (two linear projections with an activation in between). In a sparse MoE transformer, each FFN is replaced by multiple parallel "expert" FFNs plus a learned router that picks which experts to use for each token.

Here is the data flow for a single token passing through one MoE layer:

flowchart LR
    A[Input token<br/>hidden states] --> B[Router / Gate<br/>learned linear layer]
    B --> C{Softmax over<br/>N experts}
    C --> D[Select top-k<br/>experts]
    D --> E1[Expert 1<br/>FFN]
    D --> E2[Expert 2<br/>FFN]
    D --> E3[...<br/>idle]
    D --> E4[Expert N<br/>idle]
    E1 --> F[Weighted sum<br/>by router scores]
    E2 --> F
    F --> G[Output token<br/>hidden states]

The router is a small learned linear layer that takes the token's hidden state and outputs a score for each expert. You take the softmax over all experts, pick the k with the highest scores, run the token through only those experts, and combine the results weighted by the router scores. For Mixtral, k=2 out of 8 experts. For DeepSeek-MoE, k=6 out of 64 experts. The router itself adds negligible compute — a single matrix multiply of size (hidden_dim, n_experts).

A common mental model is that the router is a load balancer that assigns tokens to experts similar to how a distributed scheduler assigns work to machines. This is misleading. The router is a learned differentiable gate trained end-to-end with the rest of the model through backpropagation. It learns which experts specialize in which types of patterns — subject-matter expertise, syntactic structures, token positions — without any explicit supervision.

When you inspect the routed outputs after training, individual experts do develop preferences. One expert in Mixtral handles arithmetic-heavy tokens disproportionately often. Another handles function words and punctuation. A third handles code syntax. But these specializations are soft, not hard: there is no constraint that says "expert 3 is the math expert." The router simply learns the assignment that minimizes the loss.

The hardest part of MoE training is preventing the router from sending every token to the same two experts. If there is no corrective signal, the router quickly collapses: it sends everything to the experts that happen to initialize well, those experts get more gradient updates, they get better, the router sends even more traffic their way, and the unused experts atrophy.

The standard fix is an auxiliary load-balancing loss added to the total training loss. The most common formulation (used in Mixtral, GShard, and ST-MoE) penalizes the router for imbalance:

def load_balancing_loss(router_logits, num_experts, num_tokens):
    """
    router_logits: (num_tokens, num_experts) — raw router scores before softmax
    """
    router_probs = torch.softmax(router_logits, dim=-1)             # (tokens, experts)
    fraction_per_expert = router_probs.mean(dim=0)                  # (experts,) avg probability per expert

    _, selected_experts = router_probs.topk(k=2, dim=-1)
    tokens_per_expert = torch.zeros(num_experts, device=router_logits.device)
    tokens_per_expert.scatter_add_(0, selected_experts.flatten(), 
                                    torch.ones(num_tokens * 2, device=router_logits.device))
    load_per_expert = tokens_per_expert / (num_tokens * 2)          # (experts,) normalized token count

    aux_loss = num_experts * (fraction_per_expert * load_per_expert).sum()
    return aux_loss

The num_experts

multiplier scales the loss so it does not vanish at different expert counts. Typical aux_loss

coefficients are between 0.01 and 0.001. Too high and the router loses discriminative power. Too low and the expert collapse returns.

Recent work has introduced alternatives that reduce or eliminate the auxiliary loss:

Serving an MoE model requires different infrastructure than a dense model. The key insight is that expert weights are wide but narrowly used:

Here is a concrete serving comparison:

  model: meta-llama/Llama-3.3-70B-Instruct
  tensor_parallel_size: 2
  max_model_len: 8192
  estimated throughput: ~1800 tokens/s

  model: mistralai/Mixtral-8x7B-Instruct-v0.1
  tensor_parallel_size: 2
  max_model_len: 32768  # sliding window attention
  estimated throughput: ~3200 tokens/s

The MoE throughput advantage is real but narrower than the parameter count suggests, because the dispatch overhead and the memory ceiling eat into the margin.

Router collapse during training. Even with load-balancing loss, the router can still collapse in the first few thousand steps. Monitor the expert utilization histogram during training. If one expert receives more than 30 percent of tokens while another receives less than 5 percent, increase the auxiliary loss coefficient or switch to a different routing strategy (e.g., DeepSeek's shared-expert design).

Ignoring dispatch overhead in latency budgets. The all-to-all communication in expert routing adds 5-15 ms per MoE layer depending on batch size and interconnect bandwidth. For a 32-layer model with 16 MoE layers, that is 80-240 ms of overhead before any compute happens. For latency-sensitive applications, this cost can erase the throughput gains.

Training on too-small batch sizes. MoE models require larger batch sizes than dense models because the expert capacity constrain means that each expert sees only a fraction of the batch. A batch of 256 tokens with 8 experts and k=2 means each expert processes roughly 64 tokens. Training on small batches leads to underutilized experts and noisy gradients.

Using MoE for fine-tuning without adaptation. Most MoE models were trained from scratch with MoE architecture. Taking a dense checkpoint and converting it to MoE (as in DeepSpeed-MoE's d2s approach) requires careful initialization and a warm-up schedule. Simple LoRA fine-tuning on an existing MoE model can break the learned routing patterns. Always evaluate the downstream task before and after fine-tuning to verify the routing did not drift.

Measuring memory wrong. The total parameter count of an MoE model determines model.parameters()

, but the memory you need to serve it is the sum of all experts plus the shared layers. For DeepSeek-MoE-16B, the 64 experts (each with intermediate_size 1408 at hidden_size 2048) means the expert weights alone occupy roughly 45 GB at FP16. The total 16B label refers to the active parameter count, not the storage requirement.

MoE is not always the right architecture for your model:

You need consistent latency for every request. Because the router's top-k selection varies per token, and because batch composition affects which experts are active, MoE latency has higher variance than dense models. If your SLO requires 99th percentile latency under 200 ms per token, a dense model is easier to calibrate.

You are deploying on a single GPU with less than 48 GB VRAM. MoE models with real quality (anything above 2-3 active billion parameters) require at least two GPUs to fit the total weights. If your deployment is a single RTX 4090 or A5000, stick with dense models in the 7B-13B range.

You are building a small model under 3B parameters. The overhead of the router, the auxiliary loss, and the expert parallelism infrastructure is not worth it at this scale. MoE starts to pay off when the dense baseline you are trying to beat is above 30-50B parameters.

Your batch size is small and latency-critical. A batch of 1 (streaming chat) does not benefit from expert parallelism because the dispatch overhead dominates. The throughput advantage of MoE is most visible at batch sizes above 64.

You cannot afford the engineering complexity. MoE serving requires custom kernel support (Triton or CUDA kernels for fused experts, dispatch, and combine), non-trivial CI for load-balancing validation, and integration with inference engines that are still maturing their MoE support. If your team has limited ML infrastructure, a dense model with QLoRA is the safer bet.

Next post: structured output — how JSON mode, function calling, and grammar-constrained decoding work under the hood, and when each approach fails.

source & further reading

dev.to — original article Building Local AI Agents in Java with Tools4AI and Ollama: An Insurance Claims Use Case Run and Compare AI Evaluations with a CLI for Developers and Coding Agents We Open-Sourced Both Halves of Our Security Stack — Detection and Deliberation

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

Run your AI side-project on zahid.host