Mixture of Experts (MoE): what it actually does under the hood, and when it pays off A developer explains the sparse Mixture of Experts (MoE) architecture used in models like Mixtral, DeepSeek-MoE, and Grok-1, detailing how the router selects which experts to activate per token and why load-balancing is the hardest training challenge. The post clarifies that MoE offers better compute efficiency than a dense model of equivalent total parameters, but still requires memory proportional to the total parameter count, debunking the '70B performance at 7B compute' claim. The analysis includes a comparison table showing that Mixtral 8x7B, with 45B total parameters and ~12.9B active per token, needs ~90 GB memory, making it unsuitable for single consumer GPUs. You deployed a 7B model in production. Response times are fine — 45 ms per token — but you want to scale to a 70B without buying four more GPUs. Someone mentions MoE: "70B performance at 7B compute." It sounds like free lunch. So you look at the Mixtral 8x7B paper, you see 45 billion parameters and a claim that each token only activates about 13 billion of them, and you wonder: how is that physically possible, and what is the catch? This post explains the sparse MoE architecture that powers Mixtral, DeepSeek-MoE, Qwen2.5-MoE, DBRX, and Grok-1: what the router actually does, why load-balancing is the hardest problem in training them, and the three specific constraints that determine whether MoE is the right choice for your deployment. A dense transformer like Llama 3.2 activates 100 percent of its parameters for every token. The FFN layer in each transformer block runs the same matrix multiplication for every input. This makes memory use predictable and throughput easy to model, but it also means that scaling from 7B to 70B multiplies both memory and compute by 10x. MoE decouples the two. The model stores more parameters more memory , but each token only uses a fraction of them less compute . Here is the core trade-off expressed in numbers: | Metric | Dense 7B | Dense 70B | MoE 45B Mixtral | |---|---|---|---| | Total parameters | 7B | 70B | 45B 8 experts | | Active per token | 7B | 70B | ~12.9B 2 experts | | Compute per token | 7B-equiv | 70B-equiv | 14B-equiv | | Memory weights | ~14 GB | ~140 GB | ~90 GB | | Throughput tokens/s | high | low | medium-high | The headline is this: MoE gives you better compute efficiency than a dense 70B, but you still pay the memory cost of a much larger model. You cannot run Mixtral on a single consumer GPU. You need at least two 24 GB cards to fit the weights. The computational savings only show up once the model is already loaded — that is the catch that the "70B performance at 7B compute" tagline often omits. In a standard transformer, every layer has an FFN block two linear projections with an activation in between . In a sparse MoE transformer, each FFN is replaced by multiple parallel "expert" FFNs plus a learned router that picks which experts to use for each token. Here is the data flow for a single token passing through one MoE layer: php flowchart LR A Input token