GLM-5.2 – How to Run Locally

Z.ai released GLM-5.2, a 744B-parameter open model with 40B active parameters and a 1M context window, claiming it matches or exceeds proprietary models like Claude 4.8 Opus and GPT-5.5. Unsloth released dynamic GGUF quantizations reducing the model size by up to 86%, enabling local execution on consumer hardware such as a 256GB Mac or a single 24GB GPU with RAM offloading.

GLM-5.2 - How to Run Locally Run the new GLM-5.2 model by Z.ai on local hardware GLM-5.2 is Z.ai’s new open model, delivering SOTA performance across long-horizon coding, reasoning, and agentic tasks. With 744B parameters , 40B active parameters, and a 1M context window, it can now be run locally using Unsloth Dynamic /docs/basics/unsloth-dynamic-2.0-ggufs GGUFs. GLM-5.2 is the strongest open model to date, performing on par with Claude 4.8 Opus, GPT-5.5, and Gemini 3.1 Pro across Artificial Analysis and many other benchmarks. The full model requires 1.51TB of disk space, while Unsloth Dynamic 2-bit GGUF reduces this to 239GB -84% size by upcasting important layers to 8 or 16-bit. Dynamic 1-bit lowers further to 217GB -86% . Thanks Z.ai for giving Unsloth day-zero access. GLM-5.2-GGUF Run GLM-5.2 Tutorials /docs/models/glm-5.2 run-glm-5.2-tutorials Quantization Results /docs/models/glm-5.2 quantization-analysis ⚙️ Usage Guide ⚙️ Usage Guide The 2-bit dynamic quant UD-IQ2 M uses 239GB of disk space - this can directly fit on a 256GB unified memory Mac and works well in a 1x24GB GPU and 256GB of RAM with MoE offloading. The 1-bit quant will fit on a 223GB RAM and 8-bit requires 810GB RAM. Table: Inference hardware requirements units = total memory: RAM + VRAM, or unified memory 223 GB 245 GB 290-360 GB 372-475 GB 570 GB 810 GB For best performance, make sure your total available memory, including VRAM and system RAM, exceeds the quantized model file size by a comfortable margin. Recommended Settings GLM-5.2 has 3 thinking modes . Non-thinking and Thinking in two modes: High + Max . Use Max Thinking for complicated tasks. In Unsloth Studio /docs/models/glm-5.2 run-glm-5.2-in-unsloth-studio you can easily toggle High + Max Thinking and non-Thinking with a UI. Use these settings for most use cases: temperature = 1.0 temperature = 1.0 top p = 0.95 top p = 1.0 Maximum context window: 1,048,576 . GLM 5.2 uses thinking mode by default. And supports reasoning effort as "high", "max" or disabled thinking. To disable thinking, use --chat-template-kwargs '{"enable thinking":false}' If you're on Windows Powershell, use: --chat-template-kwargs "{\"enable thinking\":false}" Use 'true' and 'false' interchangeably. You can also use --reasoning on or --reasoning off in llama.cpp as well now 📈 Quantization analysis We also ran KLD KL Divergence to gauge the accuracy of our quantizations of GLM-5.2-GGUF. In general, dynamic 4-bit UD-Q4 K XL and dynamic 5-bit UD-Q5 K XL are generally lossless, and smaller quants also work great On pure top-1% accuracy, dynamic 1-bit gets around 76.2% accuracy yet being 86% smaller Dynamic 2-bit gets around 82% accuracy whilst being 84% smaller. 99.9% KLD is also generally good - there is a larger uplift from 4bit onwards though, so for massive out of distribution tasks, dynamic 4-bit is probably best. The mean KLD generally follows a clear monotonic trend vs disk space, and shows even at 1-bit GLM 5.2 works well Run GLM-5.2 Tutorials: You can now run GLM-5.2 in llama.cpp /docs/models/glm-5.2 run-in-llama.cpp and Unsloth Studio /docs/models/glm-5.2 run-glm-5.2-in-unsloth-studio . We will be utilizing the 239GB UD-IQ2 M https://huggingface.co/unsloth/GLM-5.2-GGUF/tree/main/UD-IQ2 M quant for best results in terms of accessbility and accuracy. 🦥 Run GLM-5.2 in Unsloth Studio GLM-5.2 can run in Unsloth Studio /docs/new/studio , an open-source web UI for local AI. Unsloth Studio automatically offloads to RAM and detects multiGPU setups . With Unsloth Studio, you can run models locally on MacOS, Windows , Linux and: Search, download, run GGUFs /docs/new/studio run-models-locally and safetensor models+ Self-healing tool calling web search Code execution Python, Bash Automatic inference /docs/new/studio model-arena parameter tuning temp, top-p, etc. Fast CPU + GPU inference via llama.cpp Train LLMs /docs/new/studio no-code-training 2x faster with 70% less VRAM Install and Launch Unsloth To install, run in your terminal: MacOS, Linux, WSL: Windows PowerShell: Launch Unsloth MacOS, Linux, WSL and Windows: Then open http://127.0.0.1:8888 or your specific URL in your browser. Launch Unsloth securely with HTTPS and Cloudflare NEW Unsloth now provides a secure way to launch Studio over HTTPS through a free Cloudflare tunnel. Use the below works in Windows, Mac & Linux : Search and download GLM-5.2 Unsloth Studio automatically offloads to RAM and detects multiGPU setups. On first launch you will need to create a password to secure your account and sign in again later. Then go to the Studio Chat /docs/new/studio/chat tab and search for GLM-5.2 in the search bar and download your desired model and quant. Ensure you have enough compute the run the model. Run GLM-5.2 Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings. For more information, you can view our Unsloth Studio inference guide /docs/new/studio/chat . 🦙 Run GLM-5.2 in llama.cpp For this guide we'll be running the UD-IQ2 M quant which will require at least 245GB RAM. Feel free to change quantization type. For these tutorials, we will using llama.cpp llama.cpphttps://github.com/ggml-org/llama.cpp for fast local inference. GGUF: GLM-5.2-GGUF Obtain the latest llama.cpp on GitHub here https://github.com/ggml-org/llama.cpp . You can follow the build instructions below as well. Change -DGGML CUDA=ON to -DGGML CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices , set -DGGML CUDA=OFF then continue as usual - Metal support is on by default.You can now use llama.cpp directly to load and download models, just like ollama run . First, select the quantization type you want like UD-IQ2 M . Also use export LLAMA CACHE="unsloth/GLM-5.2-GGUF" to force llama.cpp to save to a specific location. Note this download process might be very slow , so it's probably best to use the manual download process in the next section. If you want to download the model manually much faster , we can download the model via the code below after installing pip install huggingface hub . If downloads get stuck, see: Hugging Face Hub, XET debugging /docs/basics/troubleshooting-and-faqs/hugging-face-hub-xet-debugging If you want to use the dynamic 1bit, then do: Then run the model in conversation mode. Use unsloth/GLM-5.2-GGUF/UD-IQ2 M/GLM-5.2-UD-IQ2 M-00001-of-00006.gguf for 2bit or unsloth/GLM-5.2-GGUF/UD-IQ1 S/GLM-5.2-UD-IQ1 S-00001-of-00006.gguf for 1bit. 📐Long context via KV Cache quantization To utilize long context in llama.cpp, we need to employ KV cache quantization to reduce memory usage. Recently llama.cpp added higher accuracy tricks to KV cache quantization - see https://github.com/ggml-org/llama.cpp/pull/21038 and other PRs Currently, these KV cache dtypes are supported: By default f16 is used. If you use q4 0 which is around 4.5 bits per weight, you can extend around 16 / 4.5 = 3.5x longer context lengths So if you model used to support 10K, 35K can be in reach q4 1 is probably better since you also get a shifting parameter, and is 5 bits per weight - so 3.2x longer contexts. Use it like below: 📊 Benchmarks You can view further below for GLM-5.2 benchmarks in table format: Reasoning HLE 40.5 49.8 41.4 45 31 41.4 37 37.7 HLE w/ Tools 54.7 57.9 52.2 51.4 52.3 53.5 - 48.2 CritPt 20.9 20.9 27.1 17.7 4.6 13.4 3.7 12.9 AIME 2026 99.2 95.7 98.3 98.2 95.3 97 - 94.6 HMMT Nov. 2025 94.4 96.5 96.5 94.8 94 95 84.4 94.4 HMMT Feb. 2026 92.5 96.7 96.7 87.3 82.6 97.1 84.4 95.2 IMOAnswerBench 91.0 83.5 - 81 83.8 90 - 89.8 GPQA-Diamond 91.2 93.6 93.6 94.3 86.2 90 93 90.1 Coding SWE-bench Pro 62.1 69.2 58.6 54.2 58.4 60.6 59 55.4 NL2Repo 48.9 69.7 50.7 33.4 42.7 47.2 42.1 35.5 DeepSWE 46.2 58 70 10 18 18 20 8 ProgramBench 63.7 71.9 70.8 39.5 50.9 - - 47.8 Terminal Bench 2.1 Terminus-2 81.0 85 84 74 63.5 75 65 64 Terminal Bench 2.1 Best Reported Harness 82.7 78.9 83.4 70.7 69 - - - FrontierSWE Dominance 74.4 75.1 72.6 39.6 30.5 - - 29.0 PostTrainBench 34.3 37.2 28.4 21.6 20.1 - - - SWE-Marathon 13.0 26.0 12.0 4.0 1.0 - - - Agentic MCP-Atlas Public Set 76.8 77.8 75.3 69.2 71.8 76.4 74.2 73.6 Tool-Decathlon 48.2 59.9 55.6 48.8 40.7 - - 52.8 Last updated Was this helpful?