GLM 5.2: Reasoning Effort Is the Cost Lever

wpnews.pro

GLM 5.2 is now on Synthorai at about a sixth of frontier per-token prices, and the open-weight, frontier-benchmark headline is real. But the per-token price is the wrong number to anchor on. What a coding task actually costs on GLM 5.2 swings by more than an order of magnitude depending on a single knob, reasoning effort, and the default leaves that knob in the worst position. Set it well and GLM 5.2 is correct and cheaper than frontier on both easy and hard work. Leave it on the default and the same answer costs twenty times more and takes minutes. We measured it.

GLM 5.2 is Zhipu's open-weight frontier model, released 2026-06-13: a mixture-of-experts network (~744B total, ~40B active), a usable 1M-token context, and an MIT license you can self-host. It targets coding and agentic work, with strong published benchmarks (SWE-bench Pro 62.1, Terminal-Bench 2.1 81.0, AIME 2026 99.2, GPQA Diamond 91.2). On Synthorai it's glm-5.2

, priced at $1.40 per million input tokens and $4.40 per million output.

The detail that drives everything below: it is a reasoning model, and how much it reasons is something you set.

On per-token listing price, GLM 5.2 sits well below the Western frontier and among the cheaper Chinese models. Synthorai's rates for a representative set:

Model	Input ($/M)	Output ($/M)
`deepseek-v4-pro`
0.44	0.87	0.0036
`kimi-k2.5`
0.57	3.01	0.12
`glm-5.2`
1.40
4.40
0.26
`qwen3-max`
1.20	6.00	0.36
`gemini-3.1-pro`
2.00	12.00	0.20
`claude-opus-4-8`
5.00	25.00	0.50
`gpt-5.5`
5.00	30.00	0.50

Its $4.40 output rate is about a seventh of gpt-5.5

and a sixth of claude-opus-4-8

, though deepseek-v4-pro

and kimi-k2.5

undercut it. So GLM 5.2 is frontier-class capability at roughly Chinese-model prices, not the absolute floor. There is no separate cache-write charge: a cache write bills at the input rate, and only the cache read is discounted to the rate above. The discount varies by vendor, with GLM 5.2's cache read about a fifth of its input rate and the frontier models (gpt-5.5

, claude-opus-4-8

, gemini-3.1-pro

) discounting reads to roughly a tenth.

It is also a step up from its own predecessors. The previous GLM generation was extraordinarily cheap; the GLM 5 line raised prices, and GLM 5.2 lands at about 3x the input rate of GLM-4.6 (Zhipu's official rates):

GLM model	Released	Input ($/M)	Output ($/M)
GLM-4.5	2025-07	0.60	2.20
GLM-4.6	2025-09	0.43	1.74
GLM-5	2026	1.00	3.20
GLM-5.2
2026-06	1.40
4.40

That buys the 1M context and the frontier benchmarks. But the per-token rate is only the headline. What you actually pay per task is set by the reasoning effort.

GLM 5.2's reasoning is a dial, not a switch. You can turn it off (enable_thinking: false

), set reasoning_effort

to low, medium, or high, or leave it on the default, which runs reasoning unbounded. That setting changes cost and latency by far more than the price does. We ran one easy and one hard coding task across the settings, checking every answer against a reference on hundreds of randomized cases.

Weighted interval scheduling, a moderate dynamic-programming problem:

Mode	Reasoning tokens	Answer tokens	Cost	Latency
`glm-5.2` , thinking off
0	169	$0.0008
≈5s	yes
`glm-5.2` , `reasoning_effort: low`

1,563	150	$0.0076	39s	yes
`glm-5.2` , unbounded default
≈6,290	≈150	$0.0285	137s	yes
`gpt-5.5` (reference)
59	141	$0.0064	4.8s	yes
`claude-opus-4-8` (reference)
0	201	$0.0057	3.3s	yes

Two things stand out. Thinking off is correct and the cheapest thing on the board, about 8x under the frontier models, and every step up the dial just adds cost for the same answer. And the bill tracks the reasoning, not the answer: the code GLM returns is roughly 150 tokens every time, while the reasoning in front of it grows from nothing to about 6,300, billed at the same $4.40/M output rate. The unbounded default spends that reasoning to reach the same answer thinking off produced with none, and the gap is the entire cost difference. The frontier models answer here with little or no reported reasoning: gpt-5.5

spends 59 reasoning tokens, and claude-opus-4-8

's usage reports none.

Wildcard string matching (?

and *

), the classic problem that is easy to get subtly wrong. Here thinking off broke. It returned a memoized recursion:

def is_match(s, p):
    memo = {}
    def match(i, j):
        if (i, j) in memo:
            return memo[(i, j)]
        if j == len(p):
            result = i == len(s)
        elif i < len(s) and p[j] in (s[i], '?'):
            result = match(i + 1, j + 1)
        elif p[j] == '*':
            result = match(i + 1, j) or match(i, j + 1)
        else:
            result = False
        memo[(i, j)] = result
        return result
    return match(0, 0)

It looks right, and the memo even suggests some care. But the *

branch recurses match(i + 1, j)

without bounding i

. Once the string is consumed and the pattern still has a *

, i

climbs forever and the stack overflows. Fast, cheap, and wrong.

Turn the dial up and it returns the correct iterative two-pointer algorithm, which backtracks to the last *

instead of recursing:

def is_match(s, p):
    s_idx, p_idx, star_idx, match_idx = 0, 0, -1, 0
    while s_idx < len(s):
        if p_idx < len(p) and (p[p_idx] == '?' or p[p_idx] == s[s_idx]):
            s_idx += 1
            p_idx += 1
        elif p_idx < len(p) and p[p_idx] == '*':
            star_idx = p_idx
            match_idx = s_idx
            p_idx += 1
        elif star_idx != -1:
            p_idx = star_idx + 1
            match_idx += 1
            s_idx = match_idx
        else:
            return False
    while p_idx < len(p) and p[p_idx] == '*':
        p_idx += 1
    return p_idx == len(p)

The full dial on this task:

GLM 5.2 setting	Cost	Latency	Correct
thinking off	$0.0007	6s	no (stack overflow)
`reasoning_effort: high`
$0.0031
13s	yes
`reasoning_effort: medium`
$0.0032	16s	yes
`reasoning_effort: low`
$0.0068	40s	yes
unbounded default	$0.062	405s	yes
`gpt-5.5` (reference)
$0.0064	5.4s	yes
`claude-opus-4-8` (reference)
$0.0069	4.6s	yes

Every explicit effort level solved it. reasoning_effort: high

did it for $0.0031 in 13 seconds, about twenty times cheaper and thirty times faster than the unbounded default for the same answer, and it undercuts the frontier models on cost, just a few seconds slower. One quirk worth knowing: GLM's low

produced more reasoning than high

, consistently across both tasks, so the names don't track token count. Medium and high were the cheap, fast settings.

The unbounded default is the one setting to avoid. It is the worst of both worlds: it buys reasoning the task may not need and takes minutes to do it, reaching the same answer reasoning_effort: high

gave for twenty times the cost.

The lever is the reasoning effort, and the right setting belongs to the task, not the model:

enable_thinking: false

). Correct and about 8x under frontier.reasoning_effort: medium

or high

. Correct, around $0.003 a task, under frontier on cost and only a few seconds slower.If you cannot tell in advance whether a task needs reasoning, reasoning_effort: high

is a safe default: it was cheap, it solved both tasks, and it never ran away.

GLM 5.2 supports caching on the gateway, and it helps where you'd expect. We sent a 1,494-token shared prefix (a code module to review) with several different questions:

Call	Prompt tokens	Cached	Output	Cost	Latency
new question, prefix not yet cached	1,493	0	120	$0.0026	6.5s
new question, prefix cached	1,494	1,472	120	$0.0009	5.1s
exact repeat (semantic hit)	1,494	1,494	120	$0.0009	1.0s

Once a large prefix has been seen, it caches. The cached input tokens bill at roughly a fifth of the normal input rate, which cut an otherwise identical request from $0.0026 to $0.0009, about 64%. An exact repeat is served straight from the semantic cache: the same answer at the same cost as the cached call, but back in about a second instead of five.

The catch is the same one the dial taught: caching discounts the input, and the moment reasoning is on, the cost and latency live in the reasoning output, which is not cached. So caching is a real win for thinking-off, high-context work (the same system prompt or codebase on every call), and a small one once reasoning is on.

glm-5.2

is live on the gateway. Three practical notes from our testing:

enable_thinking: false

for simple work and reasoning_effort: medium

or high

for harder problems. The one thing to avoid is leaving reasoning on with no effort cap (the unbounded default), which is the $0.06, seven-minute trap.stream: true

and you get incremental output and the full result.Pricing is $1.40 / $4.40 per million tokens, and the gateway returns a cost

field per call so you can see exactly what each request cost.

GLM 5.2 is a genuinely cheap, capable coding model, and configured well it beats frontier prices on both easy and hard work. The catch is the configuration. Its reasoning is a dial, and the default leaves it unbounded, which is how a task that should cost $0.003 becomes a $0.06, seven-minute call. Set enable_thinking: false

for simple work and reasoning_effort: medium

or high

for the rest, and GLM 5.2 is cheap and correct across the board. Leave reasoning on its default, and it is the slowest, priciest option you could have picked.

(Synthorai listing prices above are this platform's rates as of 2026-06-24; GLM generational rates are Zhipu's official list.)

Costs measured on Synthorai on 2026-06-24 ( glm-5.2 at $1.40 / $4.40 per M tokens); verify current pricing before relying on it.

source & further reading

dev.to — original article The Best Free Sports Data APIs in 2025: A Developer's Practical Review Did you want more Claude on Cloud? ☁️ Let's talk about Securing Agents at Scale Issue-Orchestrator: A Software Engineering Control Plane for Coding Agents

GLM 5.2: Reasoning Effort Is the Cost Lever

Run your AI side-project on zahid.host