Qwen-AgentWorld-35B-A3B: a local 'world model' you can run at home

wpnews.pro

Qwen shipped something on June 22 that does not behave like the chat models you are used to, and the most common reaction in the first few days has been a confused "wait, what is this even for?" Qwen-AgentWorld-35B-A3B is not a coding assistant or a general chatbot. Alibaba calls it a language world model: a model trained to predict what an environment will do next when an agent takes an action, rather than to pick the action itself. It is Apache-2.0, the weights are on Hugging Face, GGUF quants already exist, and the active-parameter count is small enough that a used GPU can run it fast. So it is worth a serious look, as long as you understand what you are down.

We have not run it first-hand. What follows is built from Qwen's model card and technical report, the early community reaction, and the hardware math, with every source linked at the bottom.

What Qwen claims #

The pitch, in Alibaba's framing, is that a regular LLM picks the next action, while a world model predicts the next state. Feed it the current screen or terminal output plus a proposed action, and it simulates what happens next. Qwen says it covers seven agent domains in one model: MCP tool-calling, Search, Terminal, software engineering (SWE), Android, Web, and OS, spanning both text environments and GUI ones. The training pipeline is described as three stages: continued pre-training to inject environment knowledge, supervised fine-tuning to activate next-state-prediction, and reinforcement learning to sharpen simulation fidelity, over more than 10 million real interaction trajectories.

The numbers that matter for a local runner: 35 billion total parameters, roughly 3 billion active per token, a mixture-of-experts design with 256 experts (8 routed plus 1 shared activated), and a native context window of 262,144 tokens (Qwen's docs suggest 128K as the practical floor for the intended use). There is a much larger sibling, Qwen-AgentWorld-397B-A17B, which is the headline-benchmark model; the 35B-A3B is the one most people can realistically run.

On Qwen's own benchmark, AgentWorldBench, the 35B-A3B scores 56.39 out of 100 overall, with the strongest results on OS (65.92), SWE (65.63), and MCP tool-calling (64.79), and the weakest on Search (36.69). Treat those as the creator's claimed numbers on the creator's own test, not independent results.

The architecture is the interesting part #

The spec sheet hides the genuinely novel bit. The model card describes the stack as a hybrid of Gated DeltaNet blocks and gated attention blocks feeding the MoE layers, not standard full attention all the way down. Gated DeltaNet is a linear-attention variant, and the practical payoff for anyone running this at home is the KV cache. Full attention grows the KV cache with context length, which is what makes long-context runs eat memory; a linear-attention hybrid keeps that growth in check. Pair that with a 256K native window and you have a model designed to chew through long agent traces (a full terminal session, a multi-step browser task) without the memory blowup a dense 35B would hit at the same context.

That is also why the "world model" label is doing real work here, and why some people push back on it (more on that below). If you want the deeper version, the architecture and the linear-attention trade-off are the same family of ideas we walk through in our quantization guide and the VRAM-sizing explainer.

The research behind it #

There is a real paper. arXiv:2606.24597, "Qwen-AgentWorld: Language World Models for General Agents" (Zuo, Xiao, Sheng, Huang and 29 co-authors at Alibaba, submitted June 23, 2026), lays out the world-model framing: predict environment dynamics from the current observation and action, and use that as a cognitive substrate for planning. The report covers both the 35B-A3B and the 397B-A17B and the CPT/SFT/RL recipe. We verified the paper resolves and read the abstract and framing; some of the deepest architecture details sit in the full PDF rather than the landing page, so we attribute the Gated DeltaNet specifics to the model card.

What the early community is saying #

It is four days old, so there is no settled verdict yet. The signal so far is split, which is the useful part.

On the positive side, the Hugging Face discussions tab has a "Awesome model" thread with real engagement, users reporting it works well dropped into agent frameworks, and people already swapping chat-template recipes for the agent loop. On Hacker News, one commenter called it "completely underrated news" and pointed at the practical angle: a cheap world model could help smaller agent models keep track of workflow state. Several people reported running quantized versions on gaming GPUs within days, with unsloth and other community quants landing fast.

The skeptics are worth listening to. On the Hacker News thread, one user argued Qwen "has decided to rebrand certain LLMs that were trained slightly differently as 'world models'," noting the term usually means something other than an LLM. Another suggested the result is "probably more of a data scale win than a world model breakthrough," crediting the 10M trajectories over the architecture. A third caught apparent errors in the marketing benchmark chart. And the single most common reaction, on both HN and Hugging Face, is a version of "I am struggling to understand where this fits in a workflow." That confusion is the real headline: this is a specialist component for agent builders, not a model you load to ask questions.

What it takes to run it #

Here is the part the spec sheet buries. Yes, only ~3B parameters are active per token, which makes generation fast. But all 35B have to be loaded into memory first. At the Q4_K_M sweet spot that is about 21 GB of weights, before the KV cache and overhead. So this is not a "runs on anything because it is 3B-active" model. It is a model that needs a 24 GB card or a unified-memory machine to hold, and then rewards you with 3B-class speed.

For real speeds, we lean on owner-measured numbers for the closely related 30-to-35B-A3B class (Qwen3-30B-A3B and the Qwen3.5-35B-A3B sibling), since AgentWorld itself is days old and has no public benchmarks yet. Every number below is a real owner report, linked, not a lab figure or a vendor claim:

| Machine | Fit at Q4_K_M (~21 GB) | Owner-measured tok/s, 30-35B-A3B class |
|---|---|---|

| RTX 5060 Ti / other 16 GB cards | Only with a smaller quant | ~47-51, IQ3_XXS at long context ( |

glukhov.org)Used RTX 3090 / RTX 4090 (24 GB)Mac, 64 GB unified llmcheck.net)Strix Halo box, 128 GB unified visorcraft)Mac Studio, 128 GB unified HN)The shape of it: a 24 GB card (a used 3090, or a 4090) is the natural home, it holds a Q4 with room for a useful context window, and its memory bandwidth is far higher than the unified-memory boxes above, so expect it at the fast end. A 64 GB-plus Mac or a Strix Halo box trades raw speed for the ability to hold the whole thing plus a long context without breaking a sweat. A 16 GB card can run it, but only by dropping to a smaller quant or off layers to system RAM, which is where the ~47-57 numbers come from. We do not have a verified owner run on AgentWorld itself on a 24 GB card yet, so do not guess: check your exact machine, context, and quant in the tools below before you download a 21 GB file.

Is it worth down? #

For most people, not as a daily driver, and Qwen is not pretending otherwise. If you want a model to chat with or write code, a standard Qwen3 release is the better grab. AgentWorld earns its place if you are building agents and want a local component that can simulate or look ahead at what a tool call, a terminal command, or a browser action will do, cheaply and with long context, without sending every step to a frontier API. That is a narrow but real use case, and the fact that it runs on a single used GPU at 100-plus tok/s is what makes it interesting rather than academic. The skeptics are also right that "world model" is a generous label and that a lot of the win here is data scale. Both things can be true. Give it a few weeks for owner reports to firm up before you build anything important on it.

Sources and how we researched this #

We have not tested this model first-hand. This write-up is synthesized from: Qwen's Hugging Face model card and announcement; the technical report, arXiv:2606.24597 (verified to resolve); the Hugging Face discussions and the Hacker News thread for early community reaction; and owner-measured tok/s for the 30-to-35B-A3B class from the benchmark links in the table above. Claimed benchmarks are Qwen's own numbers on Qwen's own test. Speed figures are real owner reports, not lab runs, and your numbers will vary by quant, context, and runtime.

Can I run it? calculator, check whether the 35B-A3B fits your exact GPU or MacQuant picker, find the exact GGUF file to download for your machineCost calculator, buy vs rent vs API for running itThe plain-English quantization guide

source & further reading

vettedconsumer.com — original article Serving a Local LLM as an API: From Ollama's Endpoint to vLLM Throughput (and When to Rent Instead) Show HN: Local LLM Hardware Calculator Three RTX 3060s vs One RTX 3090 for Local AI: What a $1,500 Build Actually Measured