GLM 5.2 playing text adventures

wpnews.pro

cd /news/large-language-models/glm-5-2-playing-text-adventures · home › topics › large-language-models › article

[ARTICLE · art-32212] src=entropicthoughts.com ↗ pub=2026-06-18T06:39Z topic=large-language-models verified=true sentiment=· neutral

GLM 5.2 playing text adventures

GLM 5.2, a new open-weights model, achieved 15% fewer achievements than Gemini 3 Flash in text adventure games, a statistically significant difference. The benchmark, costing $5.1, controlled for game difficulty and found GLM 5.2 about 0.8 noise levels worse than the top performer.

read2 min views34 publishedJun 18, 2026

I’ve heard some buzz around the new glm 5.2 open-weights model. They say it’s very capable! I won’t run a full comparison benchmark, but I have some credits sloshing around on OpenRouter so I figured I might compare glm 5.2 to the similarly-priced Gemini 3 Flash1 The market currently infers with the glm 5.2 model at $4.4 per million output tokens, whereas Google charges $3 per million output tokens for their model. I expect the price of the glm model to go down somewhat when people figure out how to deploy it more efficiently and/or the buzz dies down. That’s what happened with previous open-weight models I’ve tested., and see where things land.

This uses the same setup as the previous benchmark: each llm gets a few attempts at playing the game, with each attempt being limited to a fixed budget of around $0.15. The llm doesn’t know it, but the harness tracks achievements

for each game, and counts how many the llm earns in each attempt. Here are the number of attempts for each game in this run.

Game	Attempts per model
Lost Pig	4
Organ Grinder’s Monkey	2
Not All That Shimmers	3
Kill Wizard	3
9:05	5
Total	17
💸	$5.1

Then I did the stupid, silly thing and fitted a plain linear regression predicting the achievement count for each attempt, with the llm model as an explainatory fixed effect, and the game as a random effect.2 Why didn’t I use random effects for game difficulty before? I should have! But I didn’t know about mixed-effects modeling then. I learn things. When thusly controlling for game difficulty, Gemini 3 Flash earns just over eight achievements in a typical attempt. The new glm 5.2 earns 15 % fewer, and this is statistically significant at customary significance levels.

This does not tell us much – is 15 % fewer achievements very bad or reasonable? Hard to tell without comparing to other models, but it’s roughly the same magnitude as the standard deviation of the resitual noise in the fitted model. Thus we can say it’s about 0.8 levels of noise worse from the king of text adventure playing llms. That’s impressive. For example, it is definitely better than Gemini 2.5 Flash, which is 1.6 noise levels worse than Gemini 3 Flash.

(Due to the budget constraint, models like Sonnet 4.5 or gpt 5.2 are 2.5× noise and 3× worse than the noise level.)

source & further reading

entropicthoughts.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/glm-5-2-playing-text-adv…

Read original on entropicthoughts.com → entropicthoughts.com/glm-5-2-playing-text-advent…

mentioned entities

GLM 5.2

Gemini 3 Flash

OpenRouter

Google

Gemini 2.5 Flash

Sonnet 4.5

GPT 5.2

metadata

slugglm-5-2-playing-text-adventures

topic#large-language-models

secondary2 topics

sentimentneutral

canonicalentropicthoughts.com

navigation

← prevAn Open Source Implementation of…

next →An open-source AI just beat Open…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 2 Aug · #large-language-models

One API key across OpenAI, Claude and Gemini: how to compare token cost per model

byteiota.com · 2 Aug · #large-language-models

GLM-5.2 Beats GPT-5.5 on SWE-bench — And You Can Self-Host It

byteiota.com · 2 Aug · #large-language-models

GitHub Copilot Drops Gemini 2.5 Pro and 3 Flash: Migrate Now

cryptobriefing.com · 2 Aug · #large-language-models

Google DeepMind introduces SkillSmith for dynamic model adaptation

── more on @glm 5.2 3 stories trending now

wpnews · 1 Aug · #ai-products

OpenAI Atlas Shuts Down August 9: Migration Guide

wpnews · 2 Aug · #artificial-intelligence

I Ran 8 AI APIs Through the Same 50 Prompts — Here's the Real Cost Breakdown

wpnews · 2 Aug · #developer-tools

Agent-Browser – Browser Automation for AI

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required