Z.ai's GLM-5.2 tops open-weight models on Artificial Analysis work benchmark

Z.ai's GLM-5.2 ranked as the top open-weight model and No. 3 overall on Artificial Analysis' GDPval-AA benchmark with a 1524 Elo score, placing it alongside proprietary frontier systems in multi-turn agent tasks. The model, priced at $1.40 per million input tokens, uses 43,000 output tokens per task, raising cost efficiency concerns despite its strong performance.

Z.ai @Zai org https://x.com/Zai org 's GLM-5.2 https://z.ai/blog/glm-5.2 ranked as the leading open-weight model and No. 3 overall on Artificial Analysis' GDPval-AA benchmark, according to Artificial Analysis @ArtificialAnlys https://x.com/ArtificialAnlys in a three-post thread on X https://x.com/ArtificialAnlys/status/2069121548670406947 on Monday, June 22. The score is the useful part: Artificial Analysis says GLM-5.2 posted a 1524 Elo on GDPval-AA, a benchmark built to test long-horizon, multi-turn agent work on economically valuable knowledge tasks. That puts an open-weight Chinese model in the same conversation as proprietary frontier systems on a workload category that is closer to how companies are beginning to use AI agents: not chat, but multi-step deliverables. https://x.com/ArtificialAnlys/status/2069121548670406947 https://x.com/ArtificialAnlys/status/2069121548670406947 The result follows Artificial Analysis' broader June 17 benchmark note https://artificialanalysis.ai/articles/glm-5-2-is-the-new-leading-open-weights-model-on-the-artificial-analysis-intelligence-index/ , which ranked GLM-5.2 as the top open-weight model on its Intelligence Index v4.1 with a score of 51. Artificial Analysis placed it ahead of MiniMax-M3 and DeepSeek V4 Pro, both at 44, and Kimi K2.6 at 43. On GDPval-AA v2 specifically, Artificial Analysis reported GLM-5.2 at 1524, ahead of MiniMax-M3 at 1418 and DeepSeek V4 Pro at 1328, and effectively level with GPT-5.5 xhigh at 1514. Z.ai is making a familiar open-weight argument with sharper economics: if an enterprise or developer team can get near-frontier agent performance without locking into a closed model provider, the deployment decision becomes less about raw benchmark rank and more about control, cost and operational flexibility. Z.ai's pricing page https://docs.z.ai/guides/overview/pricing lists GLM-5.2 at $1.40 per 1 million input tokens, $0.26 per 1 million cached input tokens and $4.40 per 1 million output tokens. Artificial Analysis calculated GLM-5.2 at about $0.46 per Intelligence Index task, higher than several open-weight peers but low for its score band. The catch is token efficiency. Artificial Analysis said GLM-5.2 used 43,000 output tokens per Intelligence Index task, up from 26,000 for GLM-5.1 and above MiniMax-M3, Kimi K2.6 and DeepSeek V4 Pro. That matters because per-token pricing understates the actual cost of agentic systems when the model has to reason, inspect, revise and produce files over many turns. A cheaper token can still become an expensive task if the model spends enough tokens getting there. Z.ai introduced GLM-5.2 in a June 16 research post https://z.ai/blog/glm-5.2 as a long-horizon model with a 1 million-token context window, an MIT open-source license and an architecture change called IndexShare, which Z.ai says reduces per-token FLOPs at long context lengths. Artificial Analysis lists the model at 744 billion total parameters with 40 billion active parameters, the same size profile as GLM-5.1, with the context window expanded from 200,000 tokens to 1 million. The GDPval framing is important because it is not another multiple-choice exam. The original GDPval paper https://arxiv.org/abs/2510.04374 , published in October 2025, described a benchmark for real-world economically valuable tasks across 44 occupations and nine major U.S. GDP sectors, using work products created by experienced professionals. Artificial Analysis' GDPval-AA v2 adapts that line of evaluation for model comparison with a human baseline of 1000 Elo, a rotating panel of frontier-model judges and a higher turn limit for longer agent trajectories. That design also sets limits on what the score proves. GDPval-style work is still scoped digital knowledge work, not the messy totality of a job. It excludes manual work, tacit organizational judgment, private data access and live collaboration. The benchmark is better read as evidence that GLM-5.2 can produce competitive deliverables under controlled agent conditions, not that it can replace a professional workflow end to end. Artificial Analysis' other new agent benchmark points in the same direction. In AA-Briefcase https://artificialanalysis.ai/articles/aa-briefcase , a benchmark built around multi-week knowledge-work projects using thousands of source files, Artificial Analysis says GLM-5.2 max is the clear leader among open-weight models and ranks behind Claude Fable 5 and Claude Opus 4.8 max , while ahead of GPT-5.5 xhigh . The benchmark combines rubric pass rate, analytical quality and presentation quality, which is a tougher test for agent systems than answering a single prompt. The competitive signal is straightforward: the open-weight frontier is moving from coding demos and leaderboard claims into the professional-work benchmarks that closed labs have used to justify premium pricing. GLM-5.2 does not remove the tradeoffs. It appears less token-efficient than some open-weight peers, and its top-line benchmark rank still comes from Artificial Analysis' evaluation stack rather than broad production evidence. But for Z.ai, the GDPval-AA result gives GLM-5.2 a stronger enterprise-facing claim than the usual open-model pitch. It is not just open. On this benchmark, it is close enough to make closed-model procurement harder to defend without a task-specific test.