{"slug": "glm-5-2-playing-text-adventures", "title": "GLM 5.2 playing text adventures", "summary": "GLM 5.2, a new open-weights model, achieved 15% fewer achievements than Gemini 3 Flash in text adventure games, a statistically significant difference. The benchmark, costing $5.1, controlled for game difficulty and found GLM 5.2 about 0.8 noise levels worse than the top performer.", "body_md": "# GLM 5.2 playing text adventures\n\nI’ve heard some buzz around the new glm 5.2 open-weights model. They say it’s\nvery capable! I won’t run a full comparison benchmark, but I have some credits\nsloshing around on OpenRouter so I figured I might compare glm 5.2 to the\nsimilarly-priced Gemini 3 Flash1 The market currently infers with the glm\n5.2 model at $4.4 per million output tokens, whereas Google charges $3 per\nmillion output tokens for their model. I expect the price of the glm model to\ngo down somewhat when people figure out how to deploy it more efficiently and/or\nthe buzz dies down. That’s what happened with previous open-weight models I’ve\ntested., and see where things land.\n\nThis uses the same setup as [the previous benchmark](updated-llm-benchmark): each llm gets a few\nattempts at playing the game, with each attempt being limited to a fixed budget\nof around $0.15. The llm doesn’t know it, but the harness tracks achievements\nfor each game, and counts how many the llm earns in each attempt.\n\nHere are the number of attempts for each game in this run.\n\n| Game | Attempts per model |\n|---|---|\n| Lost Pig | 4 |\n| Organ Grinder’s Monkey | 2 |\n| Not All That Shimmers | 3 |\n| Kill Wizard | 3 |\n| 9:05 | 5 |\n| Total | 17 |\n| 💸 | $5.1 |\n\nThen I did the stupid, silly thing and fitted a plain linear regression\npredicting the achievement count for each attempt, with the llm model as an\nexplainatory fixed effect, and the game as a random effect.2 Why didn’t I use\nrandom effects for game difficulty before? I should have! But I didn’t know\nabout mixed-effects modeling then. I learn things. When thusly controlling for\ngame difficulty, Gemini 3 Flash earns just over eight achievements in a typical\nattempt. The new glm 5.2 earns 15 % fewer, and this is statistically\nsignificant at customary significance levels.\n\nThis does not tell us much – is 15 % fewer achievements very bad or reasonable? Hard to tell without comparing to other models, but it’s roughly the same magnitude as the standard deviation of the resitual noise in the fitted model. Thus we can say it’s about 0.8 levels of noise worse from the king of text adventure playing llms. That’s impressive. For example, it is definitely better than Gemini 2.5 Flash, which is 1.6 noise levels worse than Gemini 3 Flash.\n\n(Due to the budget constraint, models like Sonnet 4.5 or gpt 5.2 are 2.5× noise and 3× worse than the noise level.)", "url": "https://wpnews.pro/news/glm-5-2-playing-text-adventures", "canonical_source": "https://entropicthoughts.com/glm-5-2-playing-text-adventures", "published_at": "2026-06-18 06:39:33+00:00", "updated_at": "2026-06-18 06:52:40.798731+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-products"], "entities": ["GLM 5.2", "Gemini 3 Flash", "OpenRouter", "Google", "Gemini 2.5 Flash", "Sonnet 4.5", "GPT 5.2"], "alternates": {"html": "https://wpnews.pro/news/glm-5-2-playing-text-adventures", "markdown": "https://wpnews.pro/news/glm-5-2-playing-text-adventures.md", "text": "https://wpnews.pro/news/glm-5-2-playing-text-adventures.txt", "jsonld": "https://wpnews.pro/news/glm-5-2-playing-text-adventures.jsonld"}}