{"slug": "glm-5-2-how-to-run-locally", "title": "GLM-5.2 – How to Run Locally", "summary": "Z.ai released GLM-5.2, a 744B-parameter open model with 40B active parameters and a 1M context window, claiming it matches or exceeds proprietary models like Claude 4.8 Opus and GPT-5.5. Unsloth released dynamic GGUF quantizations reducing the model size by up to 86%, enabling local execution on consumer hardware such as a 256GB Mac or a single 24GB GPU with RAM offloading.", "body_md": "# GLM-5.2 - How to Run Locally\n\nRun the new GLM-5.2 model by Z.ai on local hardware!\n\nGLM-5.2 is Z.ai’s new open model, delivering SOTA performance across long-horizon coding, reasoning, and agentic tasks. With **744B parameters**, 40B active parameters, and a **1M context** window, it can now be run locally using [Unsloth Dynamic](/docs/basics/unsloth-dynamic-2.0-ggufs) GGUFs. GLM-5.2 is the **strongest open model** to date, performing on par with Claude 4.8 Opus, GPT-5.5, and Gemini 3.1 Pro across Artificial Analysis and many other benchmarks.\n\nThe full model requires **1.51TB** of disk space, while Unsloth Dynamic 2-bit GGUF reduces this to **239GB (-84% size) **by** **upcasting important layers to 8 or 16-bit. Dynamic 1-bit lowers further to **217GB (-86%)**. Thanks Z.ai for giving Unsloth day-zero access. **GLM-5.2-GGUF**\n\n[Run GLM-5.2 Tutorials](/docs/models/glm-5.2#run-glm-5.2-tutorials)[Quantization Results](/docs/models/glm-5.2#quantization-analysis)\n\n** **⚙️ Usage Guide\n\n**⚙️ Usage Guide**\n\nThe 2-bit dynamic quant `UD-IQ2_M`\n\nuses **239GB** of disk space - this can directly fit on a **256GB unified memory Mac **and works well in a **1x24GB GPU **and** 256GB of RAM** with MoE offloading. The **1-bit** quant will fit on a 223GB RAM and 8-bit requires 810GB RAM.\n\n**Table: Inference hardware requirements** (units = total memory: RAM + VRAM, or unified memory)\n\n223 GB\n\n245 GB\n\n290-360 GB\n\n372-475 GB\n\n570 GB\n\n810 GB\n\nFor best performance, make sure your total available memory, including VRAM and system RAM, exceeds the quantized model file size by a comfortable margin.\n\n### Recommended Settings\n\nGLM-5.2 has **3 thinking modes**. Non-thinking and Thinking in two modes: **High** + **Max**. Use Max Thinking for complicated tasks. In [Unsloth Studio](/docs/models/glm-5.2#run-glm-5.2-in-unsloth-studio) you can easily toggle High + Max Thinking and non-Thinking with a UI.\n\nUse these settings for most use cases:\n\n`temperature`\n\n= 1.0\n\n`temperature`\n\n= 1.0\n\n`top_p`\n\n= 0.95\n\n`top_p`\n\n= 1.0\n\n**Maximum context window:**`1,048,576`\n\n.\n\nGLM 5.2 uses thinking mode by default. And supports `reasoning_effort`\n\nas \"high\", \"max\" or disabled thinking. To disable thinking, use `--chat-template-kwargs '{\"enable_thinking\":false}'`\n\nIf you're on **Windows** Powershell, use: `--chat-template-kwargs \"{\\\"enable_thinking\\\":false}\"`\n\nUse 'true' and 'false' interchangeably.\n\nYou can also use `--reasoning on`\n\nor `--reasoning off`\n\nin llama.cpp as well now!\n\n### 📈 Quantization analysis\n\nWe also ran KLD (KL Divergence) to gauge the accuracy of our quantizations of GLM-5.2-GGUF. In general, dynamic 4-bit UD-Q4_K_XL and dynamic 5-bit UD-Q5_K_XL are generally lossless, and smaller quants also work great!\n\nOn pure top-1% accuracy, **dynamic 1-bit gets around 76.2% accuracy yet being 86% smaller**! Dynamic 2-bit gets around 82% accuracy whilst being 84% smaller.\n\n99.9% KLD is also generally good - there is a larger uplift from 4bit onwards though, so for massive out of distribution tasks, dynamic 4-bit is probably best.\n\nThe mean KLD generally follows a clear monotonic trend vs disk space, and shows even at 1-bit GLM 5.2 works well!\n\n## Run GLM-5.2 Tutorials:\n\nYou can now run GLM-5.2 in [llama.cpp](/docs/models/glm-5.2#run-in-llama.cpp) and [Unsloth Studio](/docs/models/glm-5.2#run-glm-5.2-in-unsloth-studio). We will be utilizing the 239GB [ UD-IQ2_M](https://huggingface.co/unsloth/GLM-5.2-GGUF/tree/main/UD-IQ2_M) quant for best results in terms of accessbility and accuracy.\n\n### 🦥 Run GLM-5.2 in Unsloth Studio\n\nGLM-5.2 can run in [Unsloth Studio](/docs/new/studio), an open-source web UI for local AI. **Unsloth Studio automatically offloads to RAM and detects multiGPU setups**. With Unsloth Studio, you can run models locally on **MacOS, Windows**, Linux and:\n\nSearch, download,\n\n[run GGUFs](/docs/new/studio#run-models-locally)and safetensor models+**Self-healing** tool calling**web search****Code execution**(Python, Bash)[Automatic inference](/docs/new/studio#model-arena)parameter tuning (temp, top-p, etc.)Fast CPU + GPU inference via llama.cpp\n\n[Train LLMs](/docs/new/studio#no-code-training)2x faster with 70% less VRAM\n\n**Install and Launch Unsloth**\n\nTo install, run in your terminal:\n\nMacOS, Linux, WSL:\n\nWindows PowerShell:\n\n**Launch Unsloth**\n\nMacOS, Linux, WSL and Windows:\n\nThen open `http://127.0.0.1:8888`\n\n(or your specific URL) in your browser.\n\n**Launch Unsloth securely with HTTPS and Cloudflare**\n\n**NEW! **Unsloth now provides a secure way to launch Studio over HTTPS through a free Cloudflare tunnel. Use the below (works in Windows, Mac & Linux):\n\n**Search and download GLM-5.2**\n\nUnsloth Studio automatically offloads to RAM and detects multiGPU setups. On first launch you will need to create a password to secure your account and sign in again later.\n\nThen go to the [Studio Chat](/docs/new/studio/chat) tab and search for **GLM-5.2** in the search bar and download your desired model and quant. Ensure you have enough compute the run the model.\n\n**Run GLM-5.2**\n\nInference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.\n\nFor more information, you can view our [Unsloth Studio inference guide](/docs/new/studio/chat).\n\n### 🦙 Run GLM-5.2 in llama.cpp\n\nFor this guide we'll be running the `UD-IQ2_M`\n\nquant which will require at least 245GB RAM. Feel free to change quantization type. For these tutorials, we will using [llama.cpp](llama.cpphttps://github.com/ggml-org/llama.cpp) for fast local inference. GGUF: **GLM-5.2-GGUF**** **\n\nObtain the latest `llama.cpp`\n\n**on** [ GitHub here](https://github.com/ggml-org/llama.cpp). You can follow the build instructions below as well. Change\n\n`-DGGML_CUDA=ON`\n\nto `-DGGML_CUDA=OFF`\n\nif you don't have a GPU or just want CPU inference. **For Apple Mac / Metal devices**, set\n\n`-DGGML_CUDA=OFF`\n\nthen continue as usual - Metal support is on by default.You can now use `llama.cpp`\n\ndirectly to load and download models, just like `ollama run`\n\n. First, select the quantization type you want like `UD-IQ2_M`\n\n. Also use `export LLAMA_CACHE=\"unsloth/GLM-5.2-GGUF\"`\n\nto force `llama.cpp`\n\nto save to a specific location. **Note this download process might be very slow**, so it's probably best to use the manual download process in the next section.\n\nIf you want to download the model manually **(much faster!)**, we can download the model via the code below (after installing `pip install huggingface_hub`\n\n). If downloads get stuck, see: [Hugging Face Hub, XET debugging](/docs/basics/troubleshooting-and-faqs/hugging-face-hub-xet-debugging)\n\nIf you want to use the dynamic 1bit, then do:\n\nThen run the model in conversation mode. Use `unsloth/GLM-5.2-GGUF/UD-IQ2_M/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf`\n\nfor 2bit or `unsloth/GLM-5.2-GGUF/UD-IQ1_S/GLM-5.2-UD-IQ1_S-00001-of-00006.gguf`\n\nfor 1bit.\n\n### 📐Long context via KV Cache quantization\n\nTo utilize long context in llama.cpp, we need to employ KV cache quantization to reduce memory usage. Recently llama.cpp added higher accuracy tricks to KV cache quantization - [see](https://github.com/ggml-org/llama.cpp/pull/21038) and other PRs!\n\nCurrently, these KV cache dtypes are supported:\n\nBy default `f16`\n\nis used. If you use `q4_0`\n\nwhich is around 4.5 bits per weight, you can extend around 16 / 4.5 = **3.5x longer context lengths**! So if you model used to support 10K, 35K can be in reach! `q4_1`\n\nis probably better since you also get a shifting parameter, and is 5 bits per weight - so 3.2x longer contexts.\n\nUse it like below:\n\n## 📊 Benchmarks\n\nYou can view further below for GLM-5.2 benchmarks in table format:\n\n**Reasoning**\n\nHLE\n\n40.5\n\n49.8*\n\n41.4*\n\n45\n\n31\n\n41.4\n\n37\n\n37.7\n\nHLE (w/ Tools)\n\n54.7\n\n57.9*\n\n52.2*\n\n51.4*\n\n52.3\n\n53.5\n\n-\n\n48.2\n\nCritPt\n\n20.9\n\n20.9\n\n27.1\n\n17.7\n\n4.6\n\n13.4\n\n3.7\n\n12.9\n\nAIME 2026\n\n99.2\n\n95.7\n\n98.3\n\n98.2\n\n95.3\n\n97\n\n-\n\n94.6\n\nHMMT Nov. 2025\n\n94.4\n\n96.5\n\n96.5\n\n94.8\n\n94\n\n95\n\n84.4\n\n94.4\n\nHMMT Feb. 2026\n\n92.5\n\n96.7\n\n96.7\n\n87.3\n\n82.6\n\n97.1\n\n84.4\n\n95.2\n\nIMOAnswerBench\n\n91.0\n\n83.5\n\n-\n\n81\n\n83.8\n\n90\n\n-\n\n89.8\n\nGPQA-Diamond\n\n91.2\n\n93.6\n\n93.6\n\n94.3\n\n86.2\n\n90\n\n93\n\n90.1\n\n**Coding**\n\nSWE-bench Pro\n\n62.1\n\n69.2\n\n58.6\n\n54.2\n\n58.4\n\n60.6\n\n59\n\n55.4\n\nNL2Repo\n\n48.9\n\n69.7\n\n50.7\n\n33.4\n\n42.7\n\n47.2\n\n42.1\n\n35.5\n\nDeepSWE\n\n46.2\n\n58\n\n70\n\n10\n\n18\n\n18\n\n20\n\n8\n\nProgramBench\n\n63.7\n\n71.9\n\n70.8\n\n39.5\n\n50.9\n\n-\n\n-\n\n47.8\n\nTerminal Bench 2.1 (Terminus-2)\n\n81.0\n\n85\n\n84\n\n74\n\n63.5\n\n75\n\n65\n\n64\n\nTerminal Bench 2.1 (Best Reported Harness)\n\n82.7\n\n78.9\n\n83.4\n\n70.7\n\n69\n\n-\n\n-\n\n-\n\nFrontierSWE (Dominance)\n\n74.4\n\n75.1\n\n72.6\n\n39.6\n\n30.5\n\n-\n\n-\n\n29.0\n\nPostTrainBench\n\n34.3\n\n37.2\n\n28.4\n\n21.6\n\n20.1\n\n-\n\n-\n\n-\n\nSWE-Marathon\n\n13.0\n\n26.0\n\n12.0\n\n4.0\n\n1.0\n\n-\n\n-\n\n-\n\n**Agentic**\n\nMCP-Atlas (Public Set)\n\n76.8\n\n77.8\n\n75.3\n\n69.2\n\n71.8\n\n76.4\n\n74.2\n\n73.6\n\nTool-Decathlon\n\n48.2\n\n59.9\n\n55.6\n\n48.8\n\n40.7\n\n-\n\n-\n\n52.8\n\nLast updated\n\nWas this helpful?", "url": "https://wpnews.pro/news/glm-5-2-how-to-run-locally", "canonical_source": "https://unsloth.ai/docs/models/glm-5.2", "published_at": "2026-06-19 14:49:38+00:00", "updated_at": "2026-06-19 15:08:38.853923+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-tools", "ai-infrastructure", "ai-products"], "entities": ["Z.ai", "GLM-5.2", "Unsloth", "Claude 4.8 Opus", "GPT-5.5", "Gemini 3.1 Pro", "Mac", "llama.cpp"], "alternates": {"html": "https://wpnews.pro/news/glm-5-2-how-to-run-locally", "markdown": "https://wpnews.pro/news/glm-5-2-how-to-run-locally.md", "text": "https://wpnews.pro/news/glm-5-2-how-to-run-locally.txt", "jsonld": "https://wpnews.pro/news/glm-5-2-how-to-run-locally.jsonld"}}