Why a 30B model can run like a 3B: dense vs MoE for running models locally

wpnews.pro

You’re scrolling HuggingFace for something to run tonight. Two names sit next to each other: Qwen3-30B-A3B

and Llama-3.3-70B . The first looks like the safe pick — half the size, surely the kinder one for your GPU.

The size part is right. The “kinder to run” part is where the name stops telling you the whole truth. That A3B

is the giveaway: the model holds 30 billion parameters in memory but only fires about 3 billion of them per token. It fits like a 30B and runs like a 3B. The 70B next to it loads smaller-sounding but works every one of its 70 billion parameters on every token — slower, hungrier on compute.

Two numbers are hiding in those names, and most “which model should I run” confusion comes from reading one when you needed the other. Here’s the decoder.

The two numbers that actually matter #

Every model card is really telling you two things, even when it only prints one:

Total parameters— how much weight you have to hold in memory. This is your VRAM/RAM bill. You pay it whether or not a given parameter does anything for the current token.Active parameters— how much compute runsper token. This sets your speed and roughly your compute cost.

In a dense model these two numbers are the same. Every parameter participates in every token. Llama-3.1-8B is 8B total and 8B active. A 70B dense model is 70B total and 70B active. One number, printed once, because there’s nothing to split.

A Mixture-of-Experts (MoE) model splits them on purpose. The feed-forward layers are carved into many “experts,” and a small router picks a handful to run for each token. The rest sit in memory doing nothing this pass. So total stays large — you still have to store every expert — but active drops to whatever the router switched on.

That’s the whole mental model:

Total params = memory. Active params = speed and compute.

Dense ties them together. MoE pries them apart. Once you can read a card for both numbers separately, the rest of this post is bookkeeping.

Decoding the names #

HuggingFace names are compressed spec sheets. Here’s how to expand them.

** 8x7B — Mixtral.** Reads like “eight sevens,” and people assume 8 × 7 = 56B. It doesn’t add up that cleanly, because the experts only replace the feed-forward blocks, not the whole network — attention and embeddings are shared. Mixtral-8x7B is

46.7B total, and its router uses

2 of 8 experts per token, so

~12.9B active. Mistral’s own announcement puts it plainly: it “processes input and generates output at the same speed and for the same cost as a 12.9B model.” Memory of a ~47B model, speed of a ~13B one.

** 30B-A3B — Qwen3.** The

A

is the active count, spelled out for you. Qwen3-30B-A3B

is 30.5B total, 3.3B activated, routing

8 of 128 experts per token. This is the naming convention worth memorizing, because Qwen ships a whole family this way —

235B-A22B

is 235B total, 22B active, same 8-of-128 routing. Once you see

A_B

, you’re reading active params directly off the name.** 671B — DeepSeek-V3.** The big scary number. DeepSeek-V3 is

671B total parameters with 37B activated for each token— straight from the model card. You cannot load it on a hobbyist rig; 671B of weights is 671B of weights. But the

computeper token is in 37B territory, which is why an inference provider can serve it at a price that doesn’t match its headline size.

Put them in one place and the pattern jumps out:

|---|---|---|---|---|
| Llama-3.1-8B | Dense | ~8B | ~8B (all) | — |

| Mixtral-8x7B | MoE | 46.7B | 12.9B | 2 / 8 | | Qwen3-30B-A3B | MoE | 30.5B | 3.3B | 8 / 128 | | Qwen3-235B-A22B | MoE | 235B | 22B | 8 / 128 |

| DeepSeek-V3 | MoE | 671B | 37B | 8 / 256 (+1 shared) | (Figures from each model’s HuggingFace card / the Mixtral announcement, verified June 2026. Counts drift between checkpoints — re-read the card on the day you download.)

Read across the dense row and the two numbers are identical. Read across any MoE row and the gap between them is the entire point.

What this means for your machine #

Three consequences fall straight out of the two-number model.

Memory tracks total, not active. Your GPU has to hold every expert, because the router might call any of them on the next token. The rough arithmetic: at 16-bit precision, weights cost about 2 GB per billion parameters; 4-bit quantization cuts that to roughly 0.5 GB per billion. So Qwen3-30B-A3B is ~61 GB at fp16, ~15 GB at 4-bit — sized by its 30.5B total, not its 3.3B active. The A3B

buys you nothing on the memory line. This is the trap in the opening: a “30B” MoE is not a free lunch on VRAM just because it runs fast.

Throughput tracks active. Once the weights are loaded, speed follows the active count. Qwen3-30B-A3B generates at roughly the pace of a 3B dense model, because that’s how much math runs per token. That’s the upside — it punches well above its active weight on quality while staying quick. A 70B dense model that fits the same VRAM budget (heavily quantized) will feel sluggish by comparison, because all 70B fire every token.

Quality-per-compute is where MoE wins. For a fixed compute budget per token, a well-trained MoE generally lands more capability than a dense model with the same active count — you’re getting the knowledge stored across 128 experts, with the bill of running 8. The cost you pay back is memory. That’s the classic trade: MoE spends memory to buy quality-per-FLOP.

Quantization and offload — the prosumer combo #

If MoE is memory-bound, the obvious move is to shrink the memory. That’s why the runnable-at-home story is almost always MoE + quantization + offload. Quantization(GGUF for llama.cpp/Ollama, AWQ/GPTQ for vLLM) drops the per-param cost. A 4-bit DeepSeek-V3 is still enormous, but a 4-bit Qwen3-30B-A3B fits comfortably in 24 GB with room for context.CPU/RAM offload parks experts you can’t fit in VRAM out in system RAM, paging them in when the router calls them. MoE is unusually friendly to this because only a few experts are needed per token.

The gotchas are real, though:

VRAM is still dominated by total size. Offload moves the overflow to slower memory; it doesn’t make the model small. Expert thrash— if the router keeps calling experts that live in RAM, you eat the PCIe transfer over and over, and tokens-per-second falls off a cliff. Offload is smoothest when your hardware holds most experts and spills only the tail.

Quantization buys you the download. Offload buys you the load. Neither rewrites the rule that total params set the memory floor.

The decision rule #

No flowchart graphic needed — it fits in a few lines:

VRAM-tight and the task is simple(classification, extraction, autocomplete, structured output)? →** Small dense, ≤8–14B.Predictable, no router surprises, fits everywhere. You want maximum quality, have plenty of RAM, but limited compute/patience?→ MoE.You’ll spend the memory; you’ll get capability-per-token a same-speed dense model can’t match. Tiny or shared hardware where latency must be predictable?→ Small dense.Offload thrash and variable expert routing make MoE latency spikier on constrained rigs. Serving many users / cost-per-token matters and you have the VRAM?→ MoE.**High total for quality, low active for throughput is exactly the economics that made DeepSeek-V3 cheap to serve.

The shortest version: pick dense when memory is your constraint, pick MoE when compute is.

The 2026 reality check #

If MoE feels like it’s suddenly everywhere on HuggingFace, that’s because it is. As of mid-2026, most new frontier open-weight releases ship MoE — the DeepSeek line, Qwen3’s bigger members, GPT-OSS, the Mixtral lineage, Llama 4’s MoE variants. The frontier discovered that scaling total params for quality while holding active params for cost is the better deal when you can afford the memory. Dense isn’t going anywhere, though — it still owns the small end, roughly ≤8–14B, where there’s little to gain from routing and a lot to lose in complexity. That’s the model you reach for when you want something that just fits and just runs.

So the cards will keep showing you both shapes. The names will keep compressing the spec. But you’ve got the decoder now: find the total, find the active, and read them as two different bills — one paid in memory, one paid in speed. The A3B

was never lying. You just had to know it was talking about a different number than the one in front of it.

Once you’ve picked a shape, you still need a harness to drive it. If you’re wiring a local or self-hosted model into an agent, terminal agents in 2026 covers which ones let you bring your own weights.

What are you running locally right now — and did the param count surprise you once you split it into total and active?

source & further reading

outofcontext.dev — original article Cursor auto-review vs. YOLO – picking the middle safety tier Terminal Agents in 2026: goose, Claude Code, OpenCode, and Pi Compared Claude Code Safe Mode: Find Which Customization Broke Your Agent