{"slug": "mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your", "title": "Mixture-of-Experts (MoE), Explained: Why “Active Parameters” Decide What Runs on Your Machine", "summary": "Mixture-of-Experts (MoE) architecture allows large language models to use only a fraction of their total parameters for each token, enabling models with 671 billion total parameters to run at speeds comparable to a 37-billion-parameter model. The key metric is \"active parameters\" — the subset of experts activated per token — which determines compute speed, while total parameters dictate memory requirements. This design breaks the traditional trade-off between model quality and speed, as demonstrated by models like Mixtral 8×7B and DeepSeek-V3, which achieve frontier-level intelligence at a fraction of the computational cost.", "body_md": "Here is a puzzle that trips up almost everyone new to local AI: a 671-billion-parameter model can run at usable speeds on the right desktop, while a \"smaller\" 70B model feels sluggish on the same hardware. How? The answer is an architecture called **Mixture-of-Experts (MoE)** — and once you understand the single number it hinges on, model names like \"Qwen 35B-A3B\" or \"DeepSeek-V3 671B-A37B\" suddenly tell you exactly what your machine can and can't do.\n\nThis is the plain-English version: what MoE actually is, the one number that predicts performance, the memory trap that catches buyers, and the research behind it. No assuming you've read the papers.\n\n## The old rule MoE broke\n\nIn a traditional \"dense\" model, every parameter fires for every token. A 70B dense model does 70 billion parameters' worth of math to produce each word. That makes quality and speed a straight trade-off: bigger means smarter *and* slower. For years that was the iron law of local LLMs.\n\nMixture-of-Experts breaks it. Instead of one giant network, an MoE model splits much of itself into many smaller sub-networks called **experts**. For each token, a small **router** network picks just a few experts to actually run — the rest sit idle. So the model can be enormous in total, but only a slice of it does work at any moment. You get the knowledge of a huge model at the compute cost of a small one.\n\n## The one number that matters: active vs total parameters\n\nEvery MoE model has two parameter counts, and confusing them is the single most common mistake:\n\n**Total parameters**— every expert added up. This determines how much** memory**the model needs.** Active (or \"activated\") parameters**— what actually runs per token. This determines** speed**and compute.\n\nThe naming convention you'll see encodes both. \"**A3B**\" means 3B *active*; the number before it is the total. Some real examples:\n\n| Model | Total params | Active params | Runs like a… |\n|---|---|---|---|\n| Mixtral 8×7B | 47B | 13B (top-2 of 8 experts) | 13B for speed, 47B for smarts |\n| \"35B-A3B\" class | ~35B | ~3B | 3B-fast, 35B-smart |\n| DeepSeek-V3 | 671B | 37B | 37B for speed, 671B for smarts |\n\nMixtral is the model that made this mainstream: per its [technical report](https://arxiv.org/abs/2401.04088?ref=vettedconsumer.com), each token routes to 2 of 8 experts, so it touches just **13B of its 47B parameters** — yet it matches or beats Llama 2 70B and GPT-3.5. [DeepSeek-V3](https://arxiv.org/abs/2412.19437?ref=vettedconsumer.com) pushes the idea to the frontier: **671B total, 37B active**. It thinks like a 671B model but computes like a 37B one.\n\n## Under the hood: what an “expert” actually is\n\nIt’s tempting to picture experts as little specialists — one for code, one for French, one for math. That’s a myth. An “expert” is simply a copy of the model’s **feed-forward block** (the dense number-crunching layer that sits after attention). An MoE swaps the single feed-forward block in each layer for many of them, and the router learns which combination to use — the specialization that emerges is statistical and messy, not human-readable.\n\nTwo details matter for understanding the behavior:\n\n**Attention usually stays dense.** Only the feed-forward layers are split into experts; the attention mechanism still runs in full for every token. That’s part of why MoE quality doesn’t collapse despite the sparsity — the part of the model that mixes context together is untouched.**The router is tiny but decisive.** A small gating network scores the experts per token and picks the top few. Train it badly and you get “dead” experts that never fire or hot experts that overload — the load-balancing problem that[Switch Transformers](https://arxiv.org/abs/2101.03961?ref=vettedconsumer.com)and later[DeepSeek-V3](https://arxiv.org/abs/2412.19437?ref=vettedconsumer.com)spent real effort solving.\n\nNewer designs add a twist: **shared experts** that run for *every* token (capturing common knowledge) alongside the **routed experts** that specialize, plus “fine-grained” experts — more, smaller experts for finer routing. That’s the DeepSeek recipe, and it’s why its active count (37B) buys more than the raw number suggests.\n\n## The memory trap: MoE saves compute, NOT memory\n\nThis is the part that catches buyers, so read it twice. **Active parameters set your speed. Total parameters set your memory.** The router only *runs* a few experts per token — but it could pick *any* of them next token, so **all** the experts must be sitting in fast memory, ready. You don't get to store only the active slice.\n\nConcretely:\n\n- A ~35B-total MoE at 4-bit needs roughly\n**~18–20 GB just to hold the weights**— the same as a 35B dense model — even though only ~3B are active. The memory bill is set by the total. - DeepSeek-V3's 671B, even quantized to 4-bit, wants\n**~380 GB**— server-and-cluster territory — despite \"only\" 37B active. Fast to run*if*you can hold it; almost nobody can.\n\nThis is exactly why the **large-unified-memory box** became the local-LLM darling. A 128 GB Strix Halo, Framework Desktop, or Mac Studio isn't about raw compute — it's about having enough fast memory to *hold every expert* of a big MoE, so the model's tiny active footprint can then rip through tokens. MoE is the software trend that makes high-capacity-memory hardware worth buying. (More on the boxes in our [Unified-Memory AI](https://vettedconsumer.com/tag/unified-memory-ai/) guides and [how much VRAM you actually need](https://vettedconsumer.com/how-much-vram-do-you-actually-need-to-run-a-70b-model-locally/).)\n\n## What it looks like in the real world\n\nThe speed side of the bargain is dramatic, and owners notice immediately. In a popular r/LocalLLaMA thread titled [\"Qwen3.5-35B-A3B is a gamechanger for agentic coding\"](https://redlib.catsarch.com/r/LocalLLaMA/comments/1rdxfdu/qwen3535ba3b_is_a_gamechanger_for_agentic_coding/?ref=vettedconsumer.com) (u/jslominski), owners reported running a ~35B-total model at speeds you'd normally only see from a 3B model — because, for compute, it *is* a 3B model:\n\n\"On RTX 5060 Ti 16 GB + 32 GB RAM, I got 800 t/s pp and35 t/stg.\"— a commenter in that thread, running a 35B-class MoE on a mid-range GPU\n\n\"On M2 Max 64 GB MBP, I got 350 t/s pp and27 t/stg.\"— another commenter, same thread\n\nA 35B dense model on that 16 GB card would crawl — if it fit at all. The MoE flies because only ~3B parameters fire per token, while the other ~32B wait in the 32 GB of system RAM. That's the whole magic trick in one benchmark: **memory holds the total, speed follows the active.**\n\n## A 60-second history (with the receipts)\n\nMoE isn't new — it's a 2017 idea that finally met the right moment:\n\n**2017 — the origin.** Shazeer et al.,[\"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer\"](https://arxiv.org/abs/1701.06538?ref=vettedconsumer.com), introduced the trainable router + experts design, showing >1000× more model capacity at minor extra compute.**2021 — made practical.** Fedus, Zoph & Shazeer's[Switch Transformers](https://arxiv.org/abs/2101.03961?ref=vettedconsumer.com)simplified routing to a single expert per token, hit up to**7× faster pre-training**, and scaled to a trillion parameters.** 2024 — mainstream.**[Mixtral 8×7B](https://arxiv.org/abs/2401.04088?ref=vettedconsumer.com)proved an open MoE could beat a much larger dense model, and the local community ran with it.**2024–25 — the frontier.**[DeepSeek-V3](https://arxiv.org/abs/2412.19437?ref=vettedconsumer.com)refined the recipe (fine-grained experts plus always-on \"shared\" experts, and load balancing without an auxiliary loss) to reach 671B/37B at a fraction of the training cost.\n\n## The trade-offs (because there's no free lunch)\n\nMoE isn't strictly better than dense — it's a different set of compromises:\n\n**Lower quality** A 47B MoE is not as capable as a hypothetical 47B dense model would be; it's roughly \"13B-active smart.\" You're trading parameter efficiency for speed. The win is that the memory is often affordable while the compute is cheap.*per total parameter*.**It's memory-bound, not compute-bound.** Performance leans on memory*bandwidth*and capacity, not raw TFLOPS — which is why Apple Silicon and unified-memory boxes punch above their weight on MoE, and why offloading idle experts to slower RAM works at all.**Routing can be uneven,** and some users find MoEs weaker at tasks needing tight, consistent global reasoning — one r/LocalLLaMA benchmark thread flatly noted \"MoEs struggle with strict global rules.\" Treat the architecture as a speed/capacity win, not a guaranteed quality win.**Quantization still applies on top.** MoE sets the memory*before*you quantize; you then pick a quant (see our[GGUF vs GPTQ vs AWQ guide](https://vettedconsumer.com/gguf-vs-gptq-vs-awq-the-plain-english-guide-to-llm-quantization-and-which-one-to-pick/)) to fit it in your box.\n\n## The cheat sheet\n\n| Question | Look at… |\n|---|---|\nWill it fit in my machine? | Total parameters (× bytes-per-weight for your quant) |\nHow fast will it generate? | Active parameters |\n| What does \"35B-A3B\" mean? | 35B total (memory), 3B active (speed) |\n| Why buy a 128 GB box? | To hold every expert of a big MoE |\n| Dense vs MoE at the same size? | Dense = smarter per GB; MoE = faster per GB |\n\nThe one-line rule to remember: **buy memory for the total, expect speed from the active.** Once you read model names that way, the whole local-LLM hardware market stops being mysterious — you can look at \"671B-A37B\" and instantly know it's a cluster model, or \"30B-A3B\" and know it'll fly on a mid-range box with enough RAM.\n\n## Dense or MoE: which should you actually download?\n\nBoth architectures are everywhere now, and the right choice comes down to which resource you’re short on.\n\n**Reach for an MoE when:**\n\n- You have\n**plenty of memory but modest compute**— the classic unified-memory box, or a mid-range GPU paired with lots of system RAM. MoE turns spare memory into speed. - You want\n**big-model knowledge at interactive speeds**— agentic coding, long chat sessions, anything where tokens-per-second matters as much as raw smarts. - You can\n**offload idle experts to CPU RAM.** Because only the active experts move per token, engines like llama.cpp can park the rest in slower system memory with a smaller speed penalty than you’d expect — which is how a 16 GB GPU runs a 35B-total model at all.\n\n**Reach for a dense model when:**\n\n- You’re\n**memory-constrained and want maximum quality per gigabyte.** A 14B dense model can be a better use of 10 GB than a 30B MoE you can barely fit. - Your task needs\n**tight, consistent reasoning**, where MoE routing variability has been a sore point for some users. - You’re on a\n**single fast GPU** whose compute — not memory — is the bottleneck; dense models use that compute fully.\n\nA useful mental model: **dense models are limited by compute, MoE models are limited by memory.** Match the architecture to whichever you have to spare. If you bought a 128 GB box precisely so you’d never run out of memory, MoE is how you cash in that investment.\n\n## Sources & how we researched this\n\nThis explainer synthesizes the primary MoE literature — Shazeer et al. ([2017](https://arxiv.org/abs/1701.06538?ref=vettedconsumer.com)), Switch Transformers ([2021](https://arxiv.org/abs/2101.03961?ref=vettedconsumer.com)), the [Mixtral](https://arxiv.org/abs/2401.04088?ref=vettedconsumer.com) report (2024), and the [DeepSeek-V3](https://arxiv.org/abs/2412.19437?ref=vettedconsumer.com) technical report (2024) — for the architecture and the active/total parameter figures, which come straight from those papers. The real-world speed numbers and the \"gamechanger\" framing are owner reports from [r/LocalLLaMA](https://redlib.catsarch.com/r/LocalLLaMA/comments/1rdxfdu/qwen3535ba3b_is_a_gamechanger_for_agentic_coding/?ref=vettedconsumer.com), linked so you can verify; we have not benchmarked these machines first-hand. Memory estimates are weights-only approximations rounded for clarity (add headroom for context and KV cache).", "url": "https://wpnews.pro/news/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your", "canonical_source": "https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/", "published_at": "2026-06-11 21:46:31+00:00", "updated_at": "2026-06-11 22:21:37.019193+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "neural-networks", "ai-research"], "entities": ["Mixture-of-Experts", "MoE", "Qwen", "DeepSeek-V3"], "alternates": {"html": "https://wpnews.pro/news/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your", "markdown": "https://wpnews.pro/news/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your.md", "text": "https://wpnews.pro/news/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your.txt", "jsonld": "https://wpnews.pro/news/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your.jsonld"}}