Z.ai shipped GLM-5.2 on June 13, 2026 and published exactly zero benchmark numbers. No SWE-bench Verified. No LiveCodeBench. No HumanEval. No AIDER polyglot scores. The model went live across every GLM Coding Plan tier β Lite, Pro, Max, Team β with a usable 1-million-token context window, a new dual thinking-effort system, and 131,072 max output tokens per response. Third-party evals are pending. That's the full picture as of June 15, 2026.
The benchmark silence is deliberate. Zhipu has been burned before by releasing numbers that competitors then picked apart. This time they're shipping the model first, letting developers run it on their own codebases, and releasing the standalone API and MIT open weights next week β around June 16β20. By the time those land, the community will have its own results.
Whether that strategy helps or hurts adoption is a separate question. What matters right now: GLM-5.2 is live, the context window is real, and the setup is three config lines. Here's everything you need to know.
GLM-5.2 runs on the same 744-billion-parameter Mixture-of-Experts foundation as GLM-5.1. Zhipu did not rebuild the architecture β they fine-tuned it toward coding tasks and extended the context window significantly. The numbers:
| Spec | GLM-5.1 | GLM-5.2 | |---|
| Total parameters | 744B MoE | 744B MoE |
| Context window | 200,000 tokens | 1,000,000 tokens |
| Max output tokens | ~26,000 | 131,072 |
| Thinking-effort levels | Single (default) | High, Max |
| Benchmarks published | SWE-bench Pro (verified) | None at launch |
| Open weights | MIT (released) | MIT (arriving ~June 16β20) |
The 1M-token context window is the headline change. GLM-5.1 shipped with 200K tokens β already competitive with the Claude Sonnet tier. GLM-5.2's 1M puts it alongside Claude Opus 4.8 and Gemini 3.1 Ultra on context length, though raw context size and the ability to reason well across a million tokens are different things. That second part is what the community benchmarks will tell us.
The 131,072 output token limit is also meaningful. Most coding tasks are bounded by input (reading a large codebase) rather than output, but generating complete multi-file refactors or long test suites in a single pass becomes viable at that output ceiling. GLM-5.1's ~26,000 output limit was the bottleneck on those tasks.
GLM-5.2 introduces a two-tier reasoning system: High and Max. Zhipu has not published latency or cost comparisons between the two tiers, which makes it hard to give a definitive recommendation. What they've said:
High β Default for most coding tasks. Uses structured chain-of-thought before responding. Equivalent in behavior to the reasoning budget on other models, but capped at a lower token count. Zhipu recommends it for routine code generation, refactoring, and bug fixes.
Max β Extended reasoning. Uses more tokens on the internal chain-of-thought before producing output. Recommended for architectural decisions, complex debugging across large codebases, and multi-step agentic tasks where correctness trumps speed.
The practical implication: start on High for your daily driver. Switch to Max when you're debugging a non-obvious issue or planning a large refactor. Neither mode changes the pricing tier on the Coding Plan β both count as a single prompt against your weekly allowance.
A rough token count guide for developers:
A 400-file Next.js codebase β approximately 200,000β400,000 tokens depending on file size
A full monorepo with tests and config β 500,000β900,000 tokens
A large legacy Java service with Javadoc β can hit 1M+ on its own
Until GLM-5.2, no model on a sub-$30/month coding plan could hold an entire mid-size codebase in context. The typical workaround was chunking: send files in batches, summarize state between calls, manage context windows manually. That's real engineering overhead that disappears when the context limit is removed as the binding constraint.
The catch, and it's real: does the model reason well across all 1M tokens? Context length and context utilization are different things. Several frontier models have demonstrated that retrieval quality degrades significantly past 128K tokens, even when context is nominally larger. Without published benchmark data, we don't know where GLM-5.2's effective reasoning ceiling sits. The community will figure this out in the next two weeks.
GLM-5.2 ships with an OpenAI-compatible API endpoint, which means any tool that supports custom model providers can use it. The Coding Plan API endpoint is already live for subscribers β the public API lands next week.
Three changes to your .claude/settings.json
:
{
"env": {
"ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-5.2[1m]",
"ANTHROPIC_BASE_URL": "https://api.z.ai/v1",
"ANTHROPIC_API_KEY": "YOUR_Z_AI_CODING_PLAN_KEY",
"CLAUDE_CODE_AUTO_COMPACT_WINDOW": "1000000"
}
}
The glm-5.2[1m]
model ID enables the full 1M-token context. Without the [1m]
suffix, Z.ai routes requests to a smaller context variant. The CLAUDE_CODE_AUTO_COMPACT_WINDOW
bump prevents Claude Code from triggering its auto-compact heuristic prematurely β without it, Claude Code would compact at its default threshold and waste the context advantage.
One important note: Claude Code sends Anthropic-formatted messages. Z.ai's endpoint handles the translation, but tool-use formatting differences can surface on complex agentic tasks. If you see unexpected tool call failures, check whether Z.ai's message format translation is handling your specific tool schema correctly.
Provider: OpenAI Compatible
Base URL: https://api.z.ai/v1
API Key: YOUR_Z_AI_CODING_PLAN_KEY
Model: glm-5.2[1m]
Max Tokens: 131072
All three tools use the same OpenAI-compatible config pattern. Add a provider entry:
providers:
z-ai:
base_url: https://api.z.ai/v1
api_key: YOUR_Z_AI_CODING_PLAN_KEY
default_model: glm-5.2[1m]
Then reference z-ai/glm-5.2[1m]
in your routing rules. You can also use our AI Model Cost Calculator to compare the effective cost-per-prompt across these tools before committing.
GLM-5.2 is not token-priced β it uses prompt-based pricing through the Coding Plan. That's unusual, and worth understanding before you build on top of it.
| Tier | Price | Prompts/Week | Best For | |---|
| Lite | ~$18/month | ~400 | Individual developers, exploration |
| Pro | ~$45/month | ~2,000 | Full-time professional use |
| Max | ~$95/month | ~8,000 | Heavy agentic workloads |
| Team | Custom | Custom | Shared team access |
Prompt-based pricing is predictable but can be limiting. A single complex agentic task that makes 40 tool calls counts as 40 prompts β the Lite tier's 400 weekly prompts burns through quickly if you're running multi-step agents. For that use case, Pro or Max makes more economic sense.
The comparison that matters: Kimi K2.7-Code at $0.75/M tokens via OpenRouter, versus GLM-5.2 at ~$18/month on Lite. If you're running fewer than 400 prompts a week on small-to-medium tasks, the flat rate wins. If you're running token-heavy 1M-context jobs repeatedly, you want to model this carefully. Use the AI Prompt Cost Calculator to run the numbers for your specific usage pattern.
Z.ai confirmed that MIT-licensed open weights for GLM-5.2 arrive around June 16β20. That's meaningful for three groups:
Self-hosters β If you have the infrastructure to run a 744B MoE model (the memory footprint is substantial β expect similar requirements to Kimi K2.7-Code's 340GB checkpoint plus KV cache overhead for long-context inference), self-hosting removes the per-prompt cost floor entirely.
Fine-tuners β MIT license with no training restriction means you can fine-tune GLM-5.2 for domain-specific tasks β legal code review, financial analysis, proprietary frameworks β and redistribute the fine-tune. This is more permissive than Moonshot's modified MIT license on K2.7-Code, which prohibits training competing foundation models.
Researchers β Open weights enable independent benchmark runs. Expect SWE-bench Verified and LiveCodeBench numbers from the community within days of weights landing. Those results will either validate the "no benchmarks at launch" gamble or explain it.
Use GLM-5.2 now if your primary bottleneck is context window size β specifically, if you're working on a codebase that exceeds 200K tokens and you've been managing context manually. The 5x jump in context length is a real workflow improvement even without benchmark validation. You can verify context utilization yourself in 30 minutes by your codebase and asking GLM-5.2 to describe relationships between files that are far apart in the token sequence.
Wait if you need independent benchmark validation before switching production pipelines. The MIT weights and third-party SWE-bench results should arrive by June 20. Two weeks of patience gets you actual data on GLM-5.2's coding accuracy β which matters more than context size for most generation tasks.
Don't use GLM-5.2 as your only model for high-stakes code. The benchmark gap is real right now: Kimi K2.7-Code has MCPMark scores, Claude Opus 4.8 has SWE-bench numbers, GPT-5.5 has published evals. GLM-5.2 has a 1M-token context window and a week-old launch. For a two-model architecture that pairs a validated coder with GLM-5.2 for large-context retrieval tasks, read our Kimi K2.7-Code developer guide for the routing pattern.
Zhipu's decision to ship without benchmarks is a calculated risk. The benefit: no number to attack. When Kimi K2.7-Code launched with MCPMark numbers, the immediate community response was to find tasks where the benchmark didn't hold. Benchmarks are both a marketing asset and a liability.
The cost: every developer who wants to evaluate GLM-5.2 seriously has to run their own tests instead of reading a table. That raises the adoption friction, particularly for teams with procurement processes that require published evals.
Z.ai is betting that the 1M context window is compelling enough to drive experimentation on its own β and that independent benchmark results will land faster and carry more credibility than vendor-published numbers. That's a defensible position. It's also a test of whether the model is good enough to sell itself in the first two weeks.
The open weights will answer that question. In the meantime, run it on your codebase. Check whether the 1M-token context retrieves relationships correctly. Check whether Max thinking-effort outperforms High on your specific debugging tasks. Those are the data points that determine whether GLM-5.2 earns a slot in your production routing config β not vendor slides.
Originally published at wowhow.cloud