GLM 5.2: Reasoning Effort Is the Cost Lever

Zhipu's GLM 5.2 open-weight model, available on Synthorai at roughly one-sixth of frontier per-token prices, achieves frontier-level benchmarks but its per-task cost varies by over an order of magnitude depending on the reasoning effort setting. Testing shows that with reasoning turned off, GLM 5.2 is correct and cheaper than frontier models on both easy and hard coding tasks, while the default unbounded reasoning makes the same answer twenty times more expensive and takes minutes.

GLM 5.2 is now on Synthorai at about a sixth of frontier per-token prices, and the open-weight, frontier-benchmark headline is real. But the per-token price is the wrong number to anchor on. What a coding task actually costs on GLM 5.2 swings by more than an order of magnitude depending on a single knob, reasoning effort, and the default leaves that knob in the worst position. Set it well and GLM 5.2 is correct and cheaper than frontier on both easy and hard work. Leave it on the default and the same answer costs twenty times more and takes minutes. We measured it. GLM 5.2 is Zhipu's open-weight frontier model, released 2026-06-13: a mixture-of-experts network ~744B total, ~40B active , a usable 1M-token context, and an MIT license you can self-host. It targets coding and agentic work, with strong published benchmarks SWE-bench Pro 62.1, Terminal-Bench 2.1 81.0, AIME 2026 99.2, GPQA Diamond 91.2 . On Synthorai it's glm-5.2 , priced at $1.40 per million input tokens and $4.40 per million output. The detail that drives everything below: it is a reasoning model, and how much it reasons is something you set. On per-token listing price, GLM 5.2 sits well below the Western frontier and among the cheaper Chinese models. Synthorai's rates for a representative set: | Model | Input $/M | Output $/M | Cache read $/M | |---|---|---|---| deepseek-v4-pro | 0.44 | 0.87 | 0.0036 | kimi-k2.5 | 0.57 | 3.01 | 0.12 | glm-5.2 | 1.40 | 4.40 | 0.26 | qwen3-max | 1.20 | 6.00 | 0.36 | gemini-3.1-pro | 2.00 | 12.00 | 0.20 | claude-opus-4-8 | 5.00 | 25.00 | 0.50 | gpt-5.5 | 5.00 | 30.00 | 0.50 | Its $4.40 output rate is about a seventh of gpt-5.5 and a sixth of claude-opus-4-8 , though deepseek-v4-pro and kimi-k2.5 undercut it. So GLM 5.2 is frontier-class capability at roughly Chinese-model prices, not the absolute floor. There is no separate cache-write charge: a cache write bills at the input rate, and only the cache read is discounted to the rate above. The discount varies by vendor, with GLM 5.2's cache read about a fifth of its input rate and the frontier models gpt-5.5 , claude-opus-4-8 , gemini-3.1-pro discounting reads to roughly a tenth. It is also a step up from its own predecessors. The previous GLM generation was extraordinarily cheap; the GLM 5 line raised prices, and GLM 5.2 lands at about 3x the input rate of GLM-4.6 Zhipu's official rates : | GLM model | Released | Input $/M | Output $/M | |---|---|---|---| | GLM-4.5 | 2025-07 | 0.60 | 2.20 | | GLM-4.6 | 2025-09 | 0.43 | 1.74 | | GLM-5 | 2026 | 1.00 | 3.20 | GLM-5.2 | 2026-06 | 1.40 | 4.40 | That buys the 1M context and the frontier benchmarks. But the per-token rate is only the headline. What you actually pay per task is set by the reasoning effort. GLM 5.2's reasoning is a dial, not a switch. You can turn it off enable thinking: false , set reasoning effort to low, medium, or high, or leave it on the default, which runs reasoning unbounded. That setting changes cost and latency by far more than the price does. We ran one easy and one hard coding task across the settings, checking every answer against a reference on hundreds of randomized cases. Weighted interval scheduling, a moderate dynamic-programming problem: | Mode | Reasoning tokens | Answer tokens | Cost | Latency | Correct | |---|---|---|---|---|---| glm-5.2 , thinking off | 0 | 169 | $0.0008 | ≈5s | yes | glm-5.2 , reasoning effort: low | 1,563 | 150 | $0.0076 | 39s | yes | glm-5.2 , unbounded default | ≈6,290 | ≈150 | $0.0285 | 137s | yes | gpt-5.5 reference | 59 | 141 | $0.0064 | 4.8s | yes | claude-opus-4-8 reference | 0 | 201 | $0.0057 | 3.3s | yes | Two things stand out. Thinking off is correct and the cheapest thing on the board, about 8x under the frontier models, and every step up the dial just adds cost for the same answer. And the bill tracks the reasoning, not the answer: the code GLM returns is roughly 150 tokens every time, while the reasoning in front of it grows from nothing to about 6,300, billed at the same $4.40/M output rate. The unbounded default spends that reasoning to reach the same answer thinking off produced with none, and the gap is the entire cost difference. The frontier models answer here with little or no reported reasoning: gpt-5.5 spends 59 reasoning tokens, and claude-opus-4-8 's usage reports none. Wildcard string matching ? and , the classic problem that is easy to get subtly wrong. Here thinking off broke. It returned a memoized recursion: python def is match s, p : memo = {} def match i, j : if i, j in memo: return memo i, j if j == len p : result = i == len s elif i < len s and p j in s i , '?' : result = match i + 1, j + 1 elif p j == ' ': result = match i + 1, j or match i, j + 1 else: result = False memo i, j = result return result return match 0, 0 It looks right, and the memo even suggests some care. But the branch recurses match i + 1, j without bounding i . Once the string is consumed and the pattern still has a , i climbs forever and the stack overflows. Fast, cheap, and wrong. Turn the dial up and it returns the correct iterative two-pointer algorithm, which backtracks to the last instead of recursing: python def is match s, p : s idx, p idx, star idx, match idx = 0, 0, -1, 0 while s idx < len s : if p idx < len p and p p idx == '?' or p p idx == s s idx : s idx += 1 p idx += 1 elif p idx < len p and p p idx == ' ': star idx = p idx match idx = s idx p idx += 1 elif star idx = -1: p idx = star idx + 1 match idx += 1 s idx = match idx else: return False while p idx < len p and p p idx == ' ': p idx += 1 return p idx == len p The full dial on this task: | GLM 5.2 setting | Cost | Latency | Correct | |---|---|---|---| | thinking off | $0.0007 | 6s | no stack overflow | reasoning effort: high | $0.0031 | 13s | yes | reasoning effort: medium | $0.0032 | 16s | yes | reasoning effort: low | $0.0068 | 40s | yes | | unbounded default | $0.062 | 405s | yes | gpt-5.5 reference | $0.0064 | 5.4s | yes | claude-opus-4-8 reference | $0.0069 | 4.6s | yes | Every explicit effort level solved it. reasoning effort: high did it for $0.0031 in 13 seconds, about twenty times cheaper and thirty times faster than the unbounded default for the same answer, and it undercuts the frontier models on cost, just a few seconds slower. One quirk worth knowing: GLM's low produced more reasoning than high , consistently across both tasks, so the names don't track token count. Medium and high were the cheap, fast settings. The unbounded default is the one setting to avoid. It is the worst of both worlds: it buys reasoning the task may not need and takes minutes to do it, reaching the same answer reasoning effort: high gave for twenty times the cost. The lever is the reasoning effort, and the right setting belongs to the task, not the model: enable thinking: false . Correct and about 8x under frontier. reasoning effort: medium or high . Correct, around $0.003 a task, under frontier on cost and only a few seconds slower.If you cannot tell in advance whether a task needs reasoning, reasoning effort: high is a safe default: it was cheap, it solved both tasks, and it never ran away. GLM 5.2 supports caching on the gateway, and it helps where you'd expect. We sent a 1,494-token shared prefix a code module to review with several different questions: | Call | Prompt tokens | Cached | Output | Cost | Latency | |---|---|---|---|---|---| | new question, prefix not yet cached | 1,493 | 0 | 120 | $0.0026 | 6.5s | | new question, prefix cached | 1,494 | 1,472 | 120 | $0.0009 | 5.1s | | exact repeat semantic hit | 1,494 | 1,494 | 120 | $0.0009 | 1.0s | Once a large prefix has been seen, it caches. The cached input tokens bill at roughly a fifth of the normal input rate, which cut an otherwise identical request from $0.0026 to $0.0009, about 64%. An exact repeat is served straight from the semantic cache: the same answer at the same cost as the cached call, but back in about a second instead of five. The catch is the same one the dial taught: caching discounts the input, and the moment reasoning is on, the cost and latency live in the reasoning output, which is not cached. So caching is a real win for thinking-off, high-context work the same system prompt or codebase on every call , and a small one once reasoning is on. glm-5.2 is live on the gateway. Three practical notes from our testing: enable thinking: false for simple work and reasoning effort: medium or high for harder problems. The one thing to avoid is leaving reasoning on with no effort cap the unbounded default , which is the $0.06, seven-minute trap. stream: true and you get incremental output and the full result.Pricing is $1.40 / $4.40 per million tokens, and the gateway returns a cost field per call so you can see exactly what each request cost. GLM 5.2 is a genuinely cheap, capable coding model, and configured well it beats frontier prices on both easy and hard work. The catch is the configuration. Its reasoning is a dial, and the default leaves it unbounded, which is how a task that should cost $0.003 becomes a $0.06, seven-minute call. Set enable thinking: false for simple work and reasoning effort: medium or high for the rest, and GLM 5.2 is cheap and correct across the board. Leave reasoning on its default, and it is the slowest, priciest option you could have picked. Synthorai listing prices above are this platform's rates as of 2026-06-24; GLM generational rates are Zhipu's official list. Costs measured on Synthorai on 2026-06-24 glm-5.2 at $1.40 / $4.40 per M tokens ; verify current pricing before relying on it.