Last week the headline was GLM-5.2, the most powerful open-weight model yet, and a 1.51 TB monster almost nobody can run at home. This week's story is the opposite, and frankly the more useful one: the model most people should actually be running locally is small, fast, Apache-licensed, and fits on a single 24 GB graphics card. It's Qwen3-30B-A3B, from Alibaba's Qwen team, and the local-AI community has quietly made it the default answer to "what should I run?" We haven't run it first-hand; what follows synthesizes Qwen's docs, the technical report, owner reports, and the hardware math.
What it is, and the trick in the name #
That "A3B" is the whole story. Qwen3-30B-A3B is a Mixture-of-Experts model with 30.5 billion total parameters but only ~3.3 billion active per token, the network is large, but for any given token only a small slice fires (the mechanism is in our MoE explainer). That single design choice is why a "30B" model runs as quickly as a 3B one while answering like something far bigger. Per Qwen's Hugging Face model card, it's Apache-2.0 licensed, ships a 262K-token context, and is part of the broader Qwen3 family that spans 0.6B to 235B in both dense and MoE flavors.
Which version should you download? #
"Qwen3-30B-A3B" is really a small family, and picking the right build matters more than the exact quant:
Instruct (the 2507 update), the general-purpose default; the all-rounder for chat, RAG and agent work.** Thinking**, tuned to reason step-by-step by default; stronger on hard math and logic, at the cost of more tokens and latency.** Coder**, a code-specialized variant if programming is your main use.
Qwen has since shipped newer A3B-class successors (the Qwen3.5 / Qwen3.6 35B-A3B line), if you want the bleeding edge, check the Qwen Hugging Face org. But the 30B-A3B remains the proven, widely-quantized baseline most local guides still point to, which is why it's our reference point here.
The research behind it #
Unlike a lot of point releases, Qwen3 has a real, readable paper: the Qwen3 Technical Report (Qwen Team, May 2025). Its headline contribution is a unified "thinking" and "non-thinking" mode in one model, you can let it reason step-by-step for hard math/coding, or switch to fast direct answers, without swapping models, plus a "thinking budget" that lets you cap how much reasoning compute it spends per query, trading latency for accuracy on demand. The report also documents a jump in multilingual coverage from Qwen2.5's 29 languages to 119. For a local runner, the practical upshot is that one ~17 GB download covers both "quick assistant" and "slow careful reasoner" duty.
Why the community calls it "the new default" #
The reception among people who actually run models locally has been unusually unanimous: quantized 30B-A3B is, in the words of one widely-shared community write-up, the new default for local RAG and agent work, landing roughly 98% of frontier-class quality at a fraction of the latency, cost, and energy. The numbers owners report back this up: community quantizations run around ~45 tokens/second on a 24 GB GPU at solid accuracy, and an Apple-silicon MLX port has been clocked near ~64 tokens/second, both comfortably in "feels instant" territory for chat and fast enough for agent loops.
It isn't flawless, and the honest caveats are worth stating. Thinking mode is token-hungry and verbose, great for accuracy, but it'll happily spend hundreds of tokens "reasoning" about a simple question unless you cap the budget or switch to non-thinking mode. And the leaderboard crowd will always point at something bigger. But that's exactly the trap this model dodges: for local use, picking the biggest model on the chart is rarely the right call, picking the best one you can actually run well almost always is.
Why this and not a faster 8B model, or a bigger one? An 8B dense model runs even quicker but visibly trails on reasoning and coding; the 235B-class models answer a notch better but need a small server's worth of memory. Qwen3-30B-A3B sits in the gap that actually matters for a single machine, roughly 8B-class speed with near-frontier quality, which is precisely why the community treats it as the default rather than a compromise.
What it takes to run it #
This is where Qwen3-30B-A3B earns its "default" status, the hardware bar is low and broad:
| Your hardware | How it runs |
|---|---|
| A 24 GB GPU (used RTX 3090, RTX 4090, RTX 5090) | The sweet spot. Q4_K_M is ~17 GB, so it fits with context headroom and runs ~45 tok/s. |
| 16 GB GPU (5060 Ti / 4060 Ti / Arc) | Works with a lower quant or a few layers offloaded to system RAM, slower, still very usable. |
| Apple Silicon (M-series, 24 GB+ unified) | Excellent, the MLX build hits ~64 tok/s; unified memory makes the fit easy. |
| CPU + plenty of RAM | Genuinely viable: only 3.3 B params are active per token, so it's far less brutal on CPU than a dense 30B. |
The standout value play is the one most of r/LocalLLaMA already landed on: a used RTX 3090 at around $700 gives you 24 GB of fast VRAM, exactly enough to run this model comfortably at Q4 with room for long context. Newer 24 GB cards like the RTX 5090 are faster but the 3090 is the price-per-token king here. There's no $9,000 Mac in this story.
To make it concrete for your machine: run it through our Can I run it? calculator to see the exact fit and speed, use the quant picker to grab the right GGUF (Q4_K_M from bartowski or unsloth is the consensus), and if you're weighing a GPU purchase against just renting, the cost calculator shows where the lines cross.
The bottom line #
If GLM-5.2 is the model you admire from a distance, Qwen3-30B-A3B is the one you actually install. For the overwhelming majority of local users, coding assistants, RAG, agents, private chat, a quantized 30B-A3B on a single 24 GB card is the right answer: ~98% of the quality of models you can't run, at speeds that feel instant, under a no-strings Apache license. It's the strongest argument going that the local-AI frontier isn't about who has the biggest box; it's about how much capability now fits in a modest one.
Sources & how we researched this #
We have not run Qwen3-30B-A3B first-hand. This synthesizes the official Hugging Face model card and the Qwen3 repo (specs, license, modes); the verified Qwen3 Technical Report (the thinking-mode and thinking-budget design); community write-ups and owner reports for the "new default" consensus and the tok/s figures (e.g. this community post); and quant/VRAM references for the hardware math. Tokens/sec are owner-reported and directional, not controlled benchmarks; verify against your own setup.
Related guides #
GLM-5.2: the most powerful open-weight model yet, and the brutal reality of running it locally(the model you can't run; this is the one you can)The used RTX 3090 in 2026: still local AI's best dealMixture-of-Experts, explained: why "active parameters" decide what runs on your machineCan I run it? calculator·Quant picker