{"slug": "glm-5-2-reasoning-effort-is-the-cost-lever", "title": "GLM 5.2: Reasoning Effort Is the Cost Lever", "summary": "Zhipu's GLM 5.2 open-weight model, available on Synthorai at roughly one-sixth of frontier per-token prices, achieves frontier-level benchmarks but its per-task cost varies by over an order of magnitude depending on the reasoning effort setting. Testing shows that with reasoning turned off, GLM 5.2 is correct and cheaper than frontier models on both easy and hard coding tasks, while the default unbounded reasoning makes the same answer twenty times more expensive and takes minutes.", "body_md": "GLM 5.2 is now on Synthorai at about a sixth of frontier per-token prices, and the open-weight, frontier-benchmark headline is real. But the per-token price is the wrong number to anchor on. What a coding task actually costs on GLM 5.2 swings by more than an order of magnitude depending on a single knob, reasoning effort, and the default leaves that knob in the worst position. Set it well and GLM 5.2 is correct and cheaper than frontier on both easy and hard work. Leave it on the default and the same answer costs twenty times more and takes minutes. We measured it.\n\nGLM 5.2 is Zhipu's open-weight frontier model, released 2026-06-13: a mixture-of-experts network (~744B total, ~40B active), a usable 1M-token context, and an MIT license you can self-host. It targets coding and agentic work, with strong published benchmarks (SWE-bench Pro 62.1, Terminal-Bench 2.1 81.0, AIME 2026 99.2, GPQA Diamond 91.2). On Synthorai it's `glm-5.2`\n\n, priced at $1.40 per million input tokens and $4.40 per million output.\n\nThe detail that drives everything below: it is a reasoning model, and how much it reasons is something you set.\n\nOn per-token listing price, GLM 5.2 sits well below the Western frontier and among the cheaper Chinese models. Synthorai's rates for a representative set:\n\n| Model | Input ($/M) | Output ($/M) | Cache read ($/M) |\n|---|---|---|---|\n`deepseek-v4-pro` |\n0.44 | 0.87 | 0.0036 |\n`kimi-k2.5` |\n0.57 | 3.01 | 0.12 |\n`glm-5.2` |\n1.40 |\n4.40 |\n0.26 |\n`qwen3-max` |\n1.20 | 6.00 | 0.36 |\n`gemini-3.1-pro` |\n2.00 | 12.00 | 0.20 |\n`claude-opus-4-8` |\n5.00 | 25.00 | 0.50 |\n`gpt-5.5` |\n5.00 | 30.00 | 0.50 |\n\nIts $4.40 output rate is about a seventh of `gpt-5.5`\n\nand a sixth of `claude-opus-4-8`\n\n, though `deepseek-v4-pro`\n\nand `kimi-k2.5`\n\nundercut it. So GLM 5.2 is frontier-class capability at roughly Chinese-model prices, not the absolute floor. There is no separate cache-write charge: a cache write bills at the input rate, and only the cache read is discounted to the rate above. The discount varies by vendor, with GLM 5.2's cache read about a fifth of its input rate and the frontier models (`gpt-5.5`\n\n, `claude-opus-4-8`\n\n, `gemini-3.1-pro`\n\n) discounting reads to roughly a tenth.\n\nIt is also a step up from its own predecessors. The previous GLM generation was extraordinarily cheap; the GLM 5 line raised prices, and GLM 5.2 lands at about 3x the input rate of GLM-4.6 (Zhipu's official rates):\n\n| GLM model | Released | Input ($/M) | Output ($/M) |\n|---|---|---|---|\n| GLM-4.5 | 2025-07 | 0.60 | 2.20 |\n| GLM-4.6 | 2025-09 | 0.43 | 1.74 |\n| GLM-5 | 2026 | 1.00 | 3.20 |\nGLM-5.2 |\n2026-06 | 1.40 |\n4.40 |\n\nThat buys the 1M context and the frontier benchmarks. But the per-token rate is only the headline. What you actually pay per task is set by the reasoning effort.\n\nGLM 5.2's reasoning is a dial, not a switch. You can turn it off (`enable_thinking: false`\n\n), set `reasoning_effort`\n\nto low, medium, or high, or leave it on the default, which runs reasoning unbounded. That setting changes cost and latency by far more than the price does. We ran one easy and one hard coding task across the settings, checking every answer against a reference on hundreds of randomized cases.\n\nWeighted interval scheduling, a moderate dynamic-programming problem:\n\n| Mode | Reasoning tokens | Answer tokens | Cost | Latency | Correct |\n|---|---|---|---|---|---|\n`glm-5.2` , thinking off |\n0 | 169 | $0.0008 |\n≈5s | yes |\n`glm-5.2` , `reasoning_effort: low`\n|\n1,563 | 150 | $0.0076 | 39s | yes |\n`glm-5.2` , unbounded default |\n≈6,290 | ≈150 | $0.0285 | 137s | yes |\n`gpt-5.5` (reference) |\n59 | 141 | $0.0064 | 4.8s | yes |\n`claude-opus-4-8` (reference) |\n0 | 201 | $0.0057 | 3.3s | yes |\n\nTwo things stand out. Thinking off is correct and the cheapest thing on the board, about 8x under the frontier models, and every step up the dial just adds cost for the same answer. And the bill tracks the reasoning, not the answer: the code GLM returns is roughly 150 tokens every time, while the reasoning in front of it grows from nothing to about 6,300, billed at the same $4.40/M output rate. The unbounded default spends that reasoning to reach the same answer thinking off produced with none, and the gap is the entire cost difference. The frontier models answer here with little or no reported reasoning: `gpt-5.5`\n\nspends 59 reasoning tokens, and `claude-opus-4-8`\n\n's usage reports none.\n\nWildcard string matching (`?`\n\nand `*`\n\n), the classic problem that is easy to get subtly wrong. Here thinking off broke. It returned a memoized recursion:\n\n``` python\ndef is_match(s, p):\n    memo = {}\n    def match(i, j):\n        if (i, j) in memo:\n            return memo[(i, j)]\n        if j == len(p):\n            result = i == len(s)\n        elif i < len(s) and p[j] in (s[i], '?'):\n            result = match(i + 1, j + 1)\n        elif p[j] == '*':\n            result = match(i + 1, j) or match(i, j + 1)\n        else:\n            result = False\n        memo[(i, j)] = result\n        return result\n    return match(0, 0)\n```\n\nIt looks right, and the memo even suggests some care. But the `*`\n\nbranch recurses `match(i + 1, j)`\n\nwithout bounding `i`\n\n. Once the string is consumed and the pattern still has a `*`\n\n, `i`\n\nclimbs forever and the stack overflows. Fast, cheap, and wrong.\n\nTurn the dial up and it returns the correct iterative two-pointer algorithm, which backtracks to the last `*`\n\ninstead of recursing:\n\n``` python\ndef is_match(s, p):\n    s_idx, p_idx, star_idx, match_idx = 0, 0, -1, 0\n    while s_idx < len(s):\n        if p_idx < len(p) and (p[p_idx] == '?' or p[p_idx] == s[s_idx]):\n            s_idx += 1\n            p_idx += 1\n        elif p_idx < len(p) and p[p_idx] == '*':\n            star_idx = p_idx\n            match_idx = s_idx\n            p_idx += 1\n        elif star_idx != -1:\n            p_idx = star_idx + 1\n            match_idx += 1\n            s_idx = match_idx\n        else:\n            return False\n    while p_idx < len(p) and p[p_idx] == '*':\n        p_idx += 1\n    return p_idx == len(p)\n```\n\nThe full dial on this task:\n\n| GLM 5.2 setting | Cost | Latency | Correct |\n|---|---|---|---|\n| thinking off | $0.0007 | 6s | no (stack overflow) |\n`reasoning_effort: high` |\n$0.0031 |\n13s | yes |\n`reasoning_effort: medium` |\n$0.0032 | 16s | yes |\n`reasoning_effort: low` |\n$0.0068 | 40s | yes |\n| unbounded default | $0.062 | 405s | yes |\n`gpt-5.5` (reference) |\n$0.0064 | 5.4s | yes |\n`claude-opus-4-8` (reference) |\n$0.0069 | 4.6s | yes |\n\nEvery explicit effort level solved it. `reasoning_effort: high`\n\ndid it for $0.0031 in 13 seconds, about twenty times cheaper and thirty times faster than the unbounded default for the same answer, and it undercuts the frontier models on cost, just a few seconds slower. One quirk worth knowing: GLM's `low`\n\nproduced more reasoning than `high`\n\n, consistently across both tasks, so the names don't track token count. Medium and high were the cheap, fast settings.\n\nThe unbounded default is the one setting to avoid. It is the worst of both worlds: it buys reasoning the task may not need and takes minutes to do it, reaching the same answer `reasoning_effort: high`\n\ngave for twenty times the cost.\n\nThe lever is the reasoning effort, and the right setting belongs to the task, not the model:\n\n`enable_thinking: false`\n\n). Correct and about 8x under frontier.`reasoning_effort: medium`\n\nor `high`\n\n. Correct, around $0.003 a task, under frontier on cost and only a few seconds slower.If you cannot tell in advance whether a task needs reasoning, `reasoning_effort: high`\n\nis a safe default: it was cheap, it solved both tasks, and it never ran away.\n\nGLM 5.2 supports caching on the gateway, and it helps where you'd expect. We sent a 1,494-token shared prefix (a code module to review) with several different questions:\n\n| Call | Prompt tokens | Cached | Output | Cost | Latency |\n|---|---|---|---|---|---|\n| new question, prefix not yet cached | 1,493 | 0 | 120 | $0.0026 | 6.5s |\n| new question, prefix cached | 1,494 | 1,472 | 120 | $0.0009 | 5.1s |\n| exact repeat (semantic hit) | 1,494 | 1,494 | 120 | $0.0009 | 1.0s |\n\nOnce a large prefix has been seen, it caches. The cached input tokens bill at roughly a fifth of the normal input rate, which cut an otherwise identical request from $0.0026 to $0.0009, about 64%. An exact repeat is served straight from the semantic cache: the same answer at the same cost as the cached call, but back in about a second instead of five.\n\nThe catch is the same one the dial taught: caching discounts the input, and the moment reasoning is on, the cost and latency live in the reasoning output, which is not cached. So caching is a real win for thinking-off, high-context work (the same system prompt or codebase on every call), and a small one once reasoning is on.\n\n`glm-5.2`\n\nis live on the gateway. Three practical notes from our testing:\n\n`enable_thinking: false`\n\nfor simple work and `reasoning_effort: medium`\n\nor `high`\n\nfor harder problems. The one thing to avoid is leaving reasoning on with no effort cap (the unbounded default), which is the $0.06, seven-minute trap.`stream: true`\n\nand you get incremental output and the full result.Pricing is $1.40 / $4.40 per million tokens, and the gateway returns a `cost`\n\nfield per call so you can see exactly what each request cost.\n\nGLM 5.2 is a genuinely cheap, capable coding model, and configured well it beats frontier prices on both easy and hard work. The catch is the configuration. Its reasoning is a dial, and the default leaves it unbounded, which is how a task that should cost $0.003 becomes a $0.06, seven-minute call. Set `enable_thinking: false`\n\nfor simple work and `reasoning_effort: medium`\n\nor `high`\n\nfor the rest, and GLM 5.2 is cheap and correct across the board. Leave reasoning on its default, and it is the slowest, priciest option you could have picked.\n\n(Synthorai listing prices above are this platform's rates as of 2026-06-24; GLM generational rates are Zhipu's official list.)\n\n*Costs measured on Synthorai on 2026-06-24 ( glm-5.2 at $1.40 / $4.40 per M tokens); verify current pricing before relying on it.*", "url": "https://wpnews.pro/news/glm-5-2-reasoning-effort-is-the-cost-lever", "canonical_source": "https://dev.to/synthorai/glm-52-reasoning-effort-is-the-cost-lever-5ddm", "published_at": "2026-06-24 15:19:00+00:00", "updated_at": "2026-06-24 15:39:48.609298+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-products", "developer-tools"], "entities": ["Zhipu", "GLM 5.2", "Synthorai", "DeepSeek", "Kimi", "Qwen", "Gemini", "Claude"], "alternates": {"html": "https://wpnews.pro/news/glm-5-2-reasoning-effort-is-the-cost-lever", "markdown": "https://wpnews.pro/news/glm-5-2-reasoning-effort-is-the-cost-lever.md", "text": "https://wpnews.pro/news/glm-5-2-reasoning-effort-is-the-cost-lever.txt", "jsonld": "https://wpnews.pro/news/glm-5-2-reasoning-effort-is-the-cost-lever.jsonld"}}