# Run GLM-5.2 Locally: The Open Model Nobody Can Ban

> Source: <https://dev.to/max_quimby/run-glm-52-locally-the-open-model-nobody-can-ban-pnb>
> Published: 2026-06-15 03:40:15+00:00

On June 9, Anthropic shipped Claude Fable 5 — the most capable coding model the industry had ever seen. [Three days later, the U.S. government ordered it offline for every user on Earth](https://www.tomshardware.com/tech-industry/artificial-intelligence/us-export-control-order-forces-anthropic-to-disable-claude-fable-5-and-mythos-5-worldwide). No warning. No transition period. One directive, and the frontier vanished overnight.

📖

[Read the full version with charts and embedded sources on ComputeLeap →]

The same week, Z.ai (Zhipu AI) [released GLM-5.2](https://codersera.com/blog/glm-5-2-release-1m-context-coding-2026/) — a 744-billion-parameter coding model with a one-million-token context window, MIT-licensed open weights arriving within days. The timing was not lost on the developer community.

ℹ️ The message landed clearly on Hacker News: as user Reubend put it, they're "grateful to Chinese labs for being open with their work" — especially after "the Fable 5 fiasco." Open weights aren't just a cost play anymore. They're insurance.

This guide walks you through actually running GLM-5.2 on your own hardware — the VRAM you need, the quantization that fits, and the exact commands for llama.cpp, Ollama, and LM Studio. No API keys. No cloud dependency. No one can pull the plug.

GLM-5.2 is the third major iteration in Z.ai's GLM-5 line, purpose-built for [agentic coding and long-horizon software engineering](https://github.com/zai-org/GLM-5). Here is what you are working with:

| Spec | Value |
|---|---|
Architecture |
Mixture-of-Experts (MoE) |
Total Parameters |
744 billion |
Active Parameters |
~40 billion per token |
Context Window |
1,000,000 tokens |
Max Output |
131,072 tokens |
Training Data |
28.5 trillion tokens |
License |
MIT (open weights) |
Thinking Modes |
High and Max |

The MoE architecture is the key to local viability. Only ~40 billion parameters fire per token — the rest sit idle. That is what makes aggressive quantization work: you are compressing 744B weights, but inference only touches a fraction of them at any given time.

GLM-5.2 supports two thinking-effort presets: High and Max. [Z.ai recommends Max as the default for coding work](https://www.buildfastwithai.com/blogs/glm-5-2-review-2026) — it produces longer reasoning chains before generating output.

The model [launched on June 13](https://codersera.com/blog/glm-5-2-release-1m-context-coding-2026/) on Z.ai's Coding Plan tiers (Lite at ~$18/month through Team), with the standalone API and MIT-licensed weights following within the week. It ships with first-day support for Claude Code, Cline, OpenCode, Roo Code, Goose, and several other agent harnesses — so you can slot it into your existing workflow without rebuilding anything.

**The benchmark caveat.** Z.ai published zero official GLM-5.2 benchmarks at launch. The numbers circulating — including the "#1 SWE-bench Pro" claim — are inherited from GLM-5.1, which scored 58.4 on SWE-bench Pro (ahead of Claude Opus 4.6's 57.3 at the time). Early Hacker News commenter LaurensBER offered a [more measured take](https://news.ycombinator.com/item?id=48518684): GLM-5.2 is "about 6 months behind the frontier labs — very similar to Opus in January." Strong for open weights, not yet matching Claude Opus 4.8 or GPT-5.5 on independently verified evals.

Let's be honest about what "run locally" means for a 744B-parameter model. The VRAM requirements scale dramatically with quantization level:

| Quantization | Disk Size | Minimum Memory | Practical Setup |
|---|---|---|---|
2-bit Dynamic (UD-IQ2_XXS) |
241 GB | 256 GB unified | M4 Ultra Mac Studio, or 1x24GB GPU + 256GB RAM |
1-bit Dynamic |
176 GB | 180 GB | High-RAM workstation + GPU offload |
Q2_K_XL (2-bit) |
~280 GB | 300 GB | 1x24GB GPU + 300GB system RAM |
Q4_K_M |
~476 GB | 500 GB+ | Multi-GPU (2xA100 80GB + large RAM) |
FP8 |
~754 GB | 800 GB+ | 8x H200 SXM5 or equivalent |
FP16 (full) |
~1,701 GB | 1.7 TB+ | Enterprise GPU cluster |

For most developers reading this, the realistic options are the 2-bit quants. The [Unsloth Dynamic 2-bit GGUF](https://unsloth.ai/docs/models/tutorials/glm-5) reduces the model to 241GB — an 85% compression from full precision. That fits on a 256GB unified-memory Mac (M4 Ultra Mac Studio or a maxed-out MacBook Pro) or a workstation with a mid-range GPU plus 256–300GB of system RAM using MoE offloading.

⚠️ "Fits in memory" and "runs fast" are different things. On consumer hardware with 2-bit quants, expect roughly 3–9 tokens per second depending on your setup. The DataCamp tutorial reports ~8.7 tok/s on an H200 with the Q2_K_XL variant. A Mac Studio will be slower. This is fine for batch coding tasks — not ideal for real-time chat.

**Don't have 256GB?** You are not locked out. Cloud GPU rentals ([RunPod](https://www.runpod.io/), Lambda, etc.) with H200 or A100 instances can run the 2-bit quant for a few dollars per hour. That is still cheaper than a Coding Plan subscription if you are running it intermittently — and the weights live on your disk, not someone else's server.

[llama.cpp](https://github.com/ggml-org/llama.cpp) is the foundational C++ inference engine that both Ollama and LM Studio build on. Running it directly gives you the most control over compilation flags, hardware-specific optimizations, and serving parameters.

The [DataCamp tutorial](https://www.datacamp.com/tutorial/run-glm-5-locally) and [Unsloth documentation](https://unsloth.ai/docs/models/tutorials/glm-5) both provide step-by-step walkthroughs. Here is the condensed version.

```
sudo apt-get update && sudo apt-get install -y \
  build-essential cmake curl libcurl4-openssl-dev pciutils

git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j \
    --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp
```

On Mac (Metal), swap `-DGGML_CUDA=ON`

for `-DGGML_CUDA=OFF`

— Metal acceleration is enabled by default.

The [Unsloth quantized GGUFs](https://huggingface.co/zai-org/GLM-5) are the go-to for local deployment:

```
pip install -U "huggingface_hub[hf_xet]" hf-xet hf_transfer

huggingface-cli download unsloth/GLM-5-GGUF \
    --local-dir GLM-5-GGUF \
    --include "*UD-IQ2_XXS*"
```

With HF transfer acceleration, download speeds can hit ~1.2 GB/s.

```
./llama.cpp/llama-server \
  --model GLM-5-GGUF/UD-IQ2_XXS/GLM-5-UD-IQ2_XXS-00001-of-00006.gguf \
  --alias "GLM-5.2" \
  --host 0.0.0.0 --port 8080 \
  --jinja --fit on \
  --threads 32 \
  --ctx-size 16384 \
  --batch-size 512 \
  --ubatch-size 128 \
  --flash-attn auto \
  --temp 0.7 --top-p 0.95
```

Key flags: `--fit on`

maximizes GPU VRAM utilization before spilling to system RAM. `--flash-attn auto`

enables optimized attention kernels. `--ctx-size 16384`

sets a practical context window (push higher if memory allows).

Verify it is running:

```
curl -s http://127.0.0.1:8080/v1/models | jq
```

You now have an OpenAI-compatible API at `localhost:8080`

. Point Claude Code, Aider, or any other coding agent at it.

```
export OPENAI_API_BASE=http://127.0.0.1:8080/v1
export OPENAI_API_KEY=local

aider --model openai/GLM-5.2 --no-show-model-warnings
```

If you want to connect this to Claude Code or other tools, see our [guide to running Claude Code with Ollama and OpenRouter](https://computeleap.com/blog/run-claude-code-cheap-ollama-openrouter-guide-2026) — the same pattern applies to any OpenAI-compatible local endpoint.

If you want GLM-5.2 running in under five minutes, [Ollama](https://ollama.com) is the path. It wraps llama.cpp in a managed runtime with one-command model pulls.

```
curl -fsSL https://ollama.com/install.sh | sh

ollama pull glm5:latest

ollama run glm5
```

Ollama handles model downloading, VRAM allocation, and context management automatically. The trade-off: you lose the fine-grained control over batch sizes, thread counts, and quantization variants that llama.cpp provides. For most developers who want local inference without tuning knobs, that is the right deal.

You can also run Ollama as a persistent server and connect coding agents to it. It exposes an OpenAI-compatible API at `localhost:11434`

:

```
ollama serve &

export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
```

For more on using Ollama as a local backend for coding agents, see our [guide to running Claude Code with Ollama](https://computeleap.com/blog/run-claude-code-cheap-ollama-openrouter-guide-2026).

[LM Studio](https://lmstudio.ai) wraps the same inference engine in a desktop application with a visual model browser, one-click downloads from Hugging Face, and a built-in chat interface.

LM Studio is the right choice if you prefer a graphical workflow and do not need the CLI flexibility of llama.cpp. It also makes switching between quantization variants easy — useful for experimenting with the quality-vs-speed trade-off.

For a walkthrough of the LM Studio setup pattern with another open model, see our [Qwen3 local Mac setup guide](https://computeleap.com/blog/qwen3-35b-a3b-local-mac-setup-lm-studio-open-source).

The quantization decision comes down to one question: how much memory do you have?

| Your Hardware | Recommended Quant | Why |
|---|---|---|
256GB Mac Studio / MacBook Pro |
UD-IQ2_XXS (2-bit, 241GB) | Fits in unified memory. Expect 3–5 tok/s |
Workstation + 24GB GPU + 256–300GB RAM |
UD-Q2_K_XL (2-bit, 280GB) | Slightly higher quality with MoE offloading |
Multi-GPU (2xA100/H100) |
Q4_K_M (~476GB) | Noticeable quality bump. Good for production |
Cloud rental (8xH200) |
FP8 (~754GB) | Near-lossless. Best for eval runs |
Budget / testing only |
1-bit Dynamic (176GB) | Minimum viable. "Does my pipeline work?" |

💡 Start with 2-bit. If you are doing serious development work and the output quality is not cutting it, move up to Q4. Most users running GLM-5.2 locally for coding tasks report that 2-bit is "surprisingly usable" — the MoE architecture means quantization errors are diluted across the inactive experts.

Let's set honest expectations. GLM-5.2 is not Claude Opus 4.8. It is not GPT-5.5. Here is where it actually stands.

**Where it is strong:**

**Where it falls short:**

**The honest framing:** GLM-5.2 at 2-bit quantization running locally gives you roughly "Opus-in-January" capability (per the Hacker News community assessment) that nobody can revoke. For many workflows — batch refactors, code generation, agentic loops where latency is less critical — that is more than enough.

The Fable 5 ban was an inflection point, not an aberration.

[VentureBeat's enterprise analysis](https://venturebeat.com/technology/anthropic-blocks-all-public-access-to-claude-fable-5-mythos-5-following-us-government-order-what-enterprises-should-do) recommended that companies "build intelligent routing layers that can dynamically switch from a frontier model to an open-weights fallback" to survive future disruptions. That is not paranoia — it is continuity planning. If the best model you depend on can disappear in 72 hours, you need a layer you actually own.

Open-weight models like GLM-5.2 provide that layer. Once you download the weights, they are yours. MIT license. No API key. No export control order can reach into your local disk. Multiple Hacker News commenters noted the practical advantage: open-weight models [can be downloaded and modified locally](https://news.ycombinator.com/item?id=48518684), circumventing any API-level restrictions.

The deeper question is not whether GLM-5.2 matches Claude Opus 4.8 on benchmarks (it does not). It is whether having a [capable, self-hosted fallback](https://asksurf.ai/pulse/en/glm-5-2-open-weights-pricing-pressure) is worth the hardware investment. After this week, a lot of teams are answering yes.

For a broader look at the local AI landscape, see our [comprehensive guide to running AI locally in 2026](https://computeleap.com/blog/how-to-run-ai-locally-2026) and our deep dive into [why local models are now good enough for real work](https://computeleap.com/blog/local-models-good-enough-stanford-71-percent-xiaomi-mimo-2026).

If you just want GLM-5.2 running as fast as possible:

`localhost`

The weights are MIT-licensed. The inference stack is open source. The hardware is yours. That is the whole point.

*Originally published at ComputeLeap*