cd /news/large-language-models/glm-5-2-vs-claude-opus-what-the-numb… · home topics large-language-models article
[ARTICLE · art-41516] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

GLM-5.2 vs Claude Opus: What the Numbers Actually Say for Developers

Z.ai released GLM-5.2, an open-weight text-only model with a 1 million token context window, priced at one-fifth the output cost of Claude Opus. Independent tests by James Daniel Whitford at TechStackups showed Claude Opus completed a complex WebGL game build in half the time, while GLM-5.2 struggled with visual tasks due to its lack of multimodality. The cost advantage of GLM-5.2 is significant for heavy API users, but subscription pricing can alter the comparison.

read7 min views1 publishedJun 27, 2026

GLM-5.2 from Z.ai dropped recently and the reaction was loud. Some called it the end of closed models. Others dismissed it as benchmark gaming. This article cuts through the noise with data from an independent hands-on test, benchmark numbers, and community discussion.

To be clear upfront: I did not run my own head-to-head test. This article synthesizes work by James Daniel Whitford at TechStackups, independent benchmarks from Artificial Analysis, and community discussion from Hacker News. All sources are cited at the end. The goal is to help you decide which model fits your workflow.

GLM-5.2 is Z.ai's latest flagship model, released under an MIT license as open weights. You can download it, run it locally, or call it through Z.ai's API. It ships with a 1 million token context window and is designed for long-horizon agentic tasks, the kind of multi-hour coding work that coding agents do.

One key limitation: GLM-5.2 is text-only. It cannot read images, parse screenshots, or understand diagrams. Claude Opus is multimodal. This difference turns out to matter a lot in practice.

Per 1 million tokens (source: TechStackups, citing Z.ai and Anthropic pricing):

Metric Claude Opus 4.8 GLM-5.2
Input $5.00 $1.40
Cache read $0.50 $0.26
Output $25.00 $4.40

On output tokens, GLM-5.2 costs roughly one-fifth of what Opus charges. If you run coding agents for hours every day, that difference compounds fast.

A Hacker News commenter raised a valid counterpoint: if you are on a $100/month Claude Max subscription and use it fully, the per-token cost difference shrinks considerably. Subscription pricing changes the math for heavy daily users.

James Daniel Whitford at TechStackups ran both models with the same one-shot prompt: build a third-person 3D platformer in raw WebGL with no libraries. The game needed a character controller, collision detection, a follow camera, a GLB model , GLSL shaders, and skinned animation.

This is not a "make me a landing page" test. A 3D engine in raw WebGL has layers of interdependent systems. If one piece is wrong, the whole thing breaks visibly.

Metric GLM-5.2 Claude Opus 4.8
Build time 1h 10m 40s 33m 30s
Output tokens 131,000 216,809
Cost $5.39 ~$21.92 (estimated)
Tool calls 128 153

Opus finished in half the time. GLM-5.2 cost a fraction of the price.

Opus shipped a cleaner game. The character had textures applied correctly. The spike hazard killed the player. There was a working win condition. The camera and controls felt right. Bugs were minor edge cases, like standing on thin air near platforms due to an overly generous coyote-time grace period.

GLM-5.2 shipped a rougher game. The character rendered as flat gray with missing textures. The spike hazard did nothing when you touched it. There was no win condition. The character model faced backwards the entire time. These were fundamental issues, not polish problems.

GLM-5.2 did nail one thing: a spring launch mechanic that let you bounce up to higher platforms. So it is not that the model cannot code. It struggles to hold a complex multi-file build together at the same level as Opus.

Both models were told to verify their work before stopping. Opus took a screenshot of the rendered game, looked at it, noticed it had left debug overlays on screen, and cleaned them up. It could see the result and catch visual problems.

GLM-5.2 cannot read images. Instead of looking at a screenshot, it wrote scripts to sample pixel colors from the saved frame. It checked whether the colors matched expectations: grass green, dirt brown, coin gold. The colors were there, so it declared the game finished.

But the character was gray with missing textures, and the debug overlay was still visible. GLM-5.2 never saw those problems because it was reading numbers instead of looking at the image.

On visual tasks, this is a real disadvantage. An agent that can inspect its own output catches bugs that a text-only model will ship blind.

The table below shows numbers from Z.ai's model card. An asterisk (*) marks self-reported scores (each vendor reports its own numbers). Independent results from Artificial Analysis broadly agree with these rankings.

Benchmark GLM-5.2 Opus 4.8* GPT-5.5* Gemini 3.1 Pro*
AIME 2026 99.2
95.7 98.3 98.2
GPQA-Diamond 91.2 93.6
93.6
94.3
SWE-bench Pro 62.1 69.2
58.6 54.2
Terminal Bench (Terminus-2) 81.0 85
84 74
SWE-Marathon 13.0 26.0
12.0 4.0

GLM-5.2 actually beats Opus on AIME 2026 (math competition). But Opus dominates the coding and long-horizon agentic benchmarks, especially SWE-Marathon where it doubles GLM-5.2's score. GPT-5.5 trails GLM-5.2 on coding benchmarks like SWE-bench Pro (58.6 vs 62.1) and SWE-Marathon (12.0 vs 13.0), but edges ahead on Terminal Bench (84 vs 81).

Independent benchmarking from Artificial Analysis ranks GLM-5.2 as the leading open-weights model with an Intelligence Index score of 51, ahead of MiniMax-M3 (44) and DeepSeek V4 Pro (44). They note it is token-hungry, using about 43k output tokens per task, more than any other leading open model.

Simon Willison, who has reviewed nearly every major model release, called GLM-5.2 "probably the most powerful text-only open weights LLM" on X. Nathan Lambert from the Allen Institute for AI noted that Chinese labs are reaching these scores on less compute, and the open-closed gap is closing faster than many expected.

The Hacker News discussion (170+ points, 149 comments) added practical ground truth:

The WebGL test is one data point from one prompt. Real development work is different. Here is how to think about the tradeoffs for everyday use.

For boilerplate and standard CRUD code, GLM-5.2 is likely sufficient. Writing a JPA repository, a REST controller, or a Kafka consumer configuration is well-trodden territory. At one-fifth the cost of Opus, GLM-5.2 makes economic sense for these tasks.

For debugging complex issues, Opus pulls ahead. When you have a Kafka rebalance storm caused by a subtle consumer group configuration issue, or a Redis cache invalidation race condition, the difference between SWE-bench Pro 69.2 and 62.1 could matter. Correctness matters more than cost when you are chasing a production bug.

The multimodal gap depends on your work. If you build UIs, run visual regression tests, or work with screenshots, Opus can inspect its own output. If your work is mostly text (stack traces, log files, SQL queries, configuration), GLM-5.2's text-only limitation is less of a problem.

The real value of open weights is operational. A closed model can have an outage, change its pricing, or restrict access. We saw Claude outages hit HN's front page multiple times already this year. GLM-5.2 running on your own hardware has none of those risks.

Both models are accessible through their official platforms:

Z.ai's platform supports an OpenAI-compatible SDK, so if you already use the OpenAI Python library, migration is minimal. Anthropic provides its own Python SDK. Both have free tiers or trial credits to get started.

Neither model wins everything.

Use Claude Opus when:

Use GLM-5.2 when: The smartest approach is to keep both in your toolkit. Use GLM-5.2 for the bulk of text-heavy work where the cost savings add up. Switch to Opus when you need visual judgment, maximum coding reliability, or the kind of long-horizon reasoning where it clearly leads.

The open weights gap is real, but it is narrowing. GLM-5.2 proves you no longer need to pay premium prices to get a genuinely capable coding model. It does not beat Opus yet, but it does not need to. It just needs to be good enough for most tasks, and cheap enough that the math works.

── more in #large-language-models 4 stories · sorted by recency
── more on @z.ai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/glm-5-2-vs-claude-op…] indexed:0 read:7min 2026-06-27 ·