# GLM-5.2 Hallucinates 3x Less Than GPT-5.5 — Open Weight Wins

> Source: <https://byteiota.com/glm-5-2-hallucinates-3x-less-than-gpt-5-5-open-weight-wins/>
> Published: 2026-06-20 10:12:31+00:00

Z.ai released GLM-5.2 on June 16 under an MIT license — full weights, no restrictions, free to deploy. Within 48 hours, benchmarks revealed something that should make every team running GPT-5.5 uncomfortable: the free open-weight model outperforms OpenAI’s flagship on coding tasks and hallucinates at 28% versus GPT-5.5’s 86% on the [AA-Omniscience hallucination benchmark](https://artificialanalysis.ai/evaluations/omniscience). That’s a 3x reliability gap, in favor of the model you can run for free.

GLM-5.2 is a 753-billion-parameter mixture-of-experts model from Z.ai (Zhipu AI’s international brand, spun out of Tsinghua University), with 40 billion parameters active per token. It scored 62.1 on SWE-bench Pro — real GitHub issue resolution — compared to GPT-5.5’s 58.6. At $1.40 per million input tokens on OpenRouter, it costs roughly one-sixth what you’d pay for GPT-5.5 output tokens. The benchmark data is now on the [Hacker News front page](https://news.ycombinator.com/item?id=48600167) with 156 points and active debate.

## The Benchmark Gap Is Real

The coding numbers hold up across multiple evaluations. On FrontierSWE — which tests long-horizon task completion — GLM-5.2 scores 74.4% versus GPT-5.5’s 72.6%. On MCP-Atlas, which measures tool-use in agentic workflows, GLM-5.2 scores 77.0 against GPT-5.5’s 75.3. These aren’t rounding errors; the free model is consistently ahead of the paid one on the benchmarks that matter for production coding work.

However, the hallucination data is the bigger story. The AA-Omniscience benchmark, published by [Artificial Analysis](https://artificialanalysis.ai/evaluations/omniscience), is designed specifically to penalize confident wrong answers and reward abstention. Most benchmarks incentivize models to guess; this one explicitly penalizes hallucination and rewards saying “I don’t know.” Under those conditions, the full ranking looks like this: DeepSeek V4 Pro at 94%, GPT-5.5 at 86%, Claude Fable 5 at 48%, Opus 4.8 at 36%, and GLM-5.2 at 28%. The largest models are at the bottom.

## The “I Don’t Know” Problem

An analyst at [arrowtsx.dev published a direct test](https://arrowtsx.dev/bigger-models/) on June 18 that illustrates why. The prompt: design a custom asyncio event loop policy in Python that overrides `get_child_watcher()`

, with architectural constraints that are technically impossible to satisfy simultaneously. GLM-5.2 identified the logical impossibility in 12 seconds. DeepSeek V4 Pro — 1.6 trillion parameters — spent four minutes generating convincingly wrong code. As the analyst put it: “Bigger models will actively convince you that a solution is correct.”

The mechanism is training dynamics. Models trained on massive factual datasets develop confidence rather than calibration. They learn to produce fluent, authoritative-sounding responses — which is exactly what gets rewarded during training. What doesn’t get rewarded is knowing when to stop. GPT-5.5’s 86% hallucination rate means that when it encounters something at the edge of its knowledge, it guesses confidently rather than abstaining. For an agent running unattended, that’s a silent failure mode.

## MIT License Changes the Math

GLM-5.2’s weights are released under MIT — the most permissive open-source license. That means you can run it locally (the full weights are 1.51TB), fine-tune it on proprietary code, deploy it on any infrastructure, and pay no royalties. Simon Willison called it [“probably the most powerful text-only open weights LLM”](https://simonwillison.net/2026/Jun/17/glm-52/) and the Artificial Analysis Intelligence Index ranks it first among all open-weights models. This isn’t just cheaper than GPT-5.5 — it’s structurally different. Zero lock-in, full data control, no usage caps.

The MIT license also sidesteps the main concern about Z.ai’s hosted API, which routes through Chinese servers subject to the National Intelligence Law. For regulated industries or sensitive codebases, that’s a non-starter. However, self-hosting on your own infrastructure or using OpenRouter’s Western endpoints eliminates that concern entirely. The model’s open license is what makes it a real alternative — not just a benchmark curiosity.

Related:[MiniMax M3: Open-Weight Frontier Model at 5% of Opus Cost]

## What You’re Trading

The limitations are worth stating plainly. Average response time runs around 75 seconds, compared to sub-30 seconds for Opus 4.8 — that’s a meaningful latency difference for interactive use cases. GLM-5.2 also uses more output tokens per task (roughly 43K versus 24-37K for competitors), partially offsetting the cost advantage if your workload generates heavy output. And it’s text-only: no vision inputs, no multimodal tasks. If your agent stack depends on image understanding, look elsewhere.

For long-horizon coding agents, large codebase analysis, and cost-sensitive production deployments, GLM-5.2 is now a credible first option — not a fallback. The 1-million-token context window has been tested on 400K-token monorepos without degradation. The SWE-bench Pro numbers reflect real-world software engineering performance. The hallucination advantage is documented and significant.

## Key Takeaways

- GLM-5.2 beats GPT-5.5 on SWE-bench Pro (62.1 vs 58.6), FrontierSWE, and MCP-Atlas, at one-sixth the cost per token
- On the AA-Omniscience hallucination benchmark, GPT-5.5 hallucinates at 86%; GLM-5.2 hallucinates at 28% — larger models consistently score worse on this metric
- The MIT license makes full deployment and fine-tuning genuinely viable; avoid Z.ai’s own API for sensitive workloads and use self-hosting or OpenRouter instead
- Real limitations: 75-second average response time, high output token usage, text-only (no vision) — rule it out if latency or multimodal matter to your stack

The era of paying a premium for worse reliability is ending. GLM-5.2 isn’t a benchmark curiosity — it’s a signal that the open-weight tier has caught up on reliability, not just capability.
