# GLM 5.2: Reasoning Effort Is the Cost Lever

> Source: <https://dev.to/synthorai/glm-52-reasoning-effort-is-the-cost-lever-5ddm>
> Published: 2026-06-24 15:19:00+00:00

GLM 5.2 is now on Synthorai at about a sixth of frontier per-token prices, and the open-weight, frontier-benchmark headline is real. But the per-token price is the wrong number to anchor on. What a coding task actually costs on GLM 5.2 swings by more than an order of magnitude depending on a single knob, reasoning effort, and the default leaves that knob in the worst position. Set it well and GLM 5.2 is correct and cheaper than frontier on both easy and hard work. Leave it on the default and the same answer costs twenty times more and takes minutes. We measured it.

GLM 5.2 is Zhipu's open-weight frontier model, released 2026-06-13: a mixture-of-experts network (~744B total, ~40B active), a usable 1M-token context, and an MIT license you can self-host. It targets coding and agentic work, with strong published benchmarks (SWE-bench Pro 62.1, Terminal-Bench 2.1 81.0, AIME 2026 99.2, GPQA Diamond 91.2). On Synthorai it's `glm-5.2`

, priced at $1.40 per million input tokens and $4.40 per million output.

The detail that drives everything below: it is a reasoning model, and how much it reasons is something you set.

On per-token listing price, GLM 5.2 sits well below the Western frontier and among the cheaper Chinese models. Synthorai's rates for a representative set:

| Model | Input ($/M) | Output ($/M) | Cache read ($/M) |
|---|---|---|---|
`deepseek-v4-pro` |
0.44 | 0.87 | 0.0036 |
`kimi-k2.5` |
0.57 | 3.01 | 0.12 |
`glm-5.2` |
1.40 |
4.40 |
0.26 |
`qwen3-max` |
1.20 | 6.00 | 0.36 |
`gemini-3.1-pro` |
2.00 | 12.00 | 0.20 |
`claude-opus-4-8` |
5.00 | 25.00 | 0.50 |
`gpt-5.5` |
5.00 | 30.00 | 0.50 |

Its $4.40 output rate is about a seventh of `gpt-5.5`

and a sixth of `claude-opus-4-8`

, though `deepseek-v4-pro`

and `kimi-k2.5`

undercut it. So GLM 5.2 is frontier-class capability at roughly Chinese-model prices, not the absolute floor. There is no separate cache-write charge: a cache write bills at the input rate, and only the cache read is discounted to the rate above. The discount varies by vendor, with GLM 5.2's cache read about a fifth of its input rate and the frontier models (`gpt-5.5`

, `claude-opus-4-8`

, `gemini-3.1-pro`

) discounting reads to roughly a tenth.

It is also a step up from its own predecessors. The previous GLM generation was extraordinarily cheap; the GLM 5 line raised prices, and GLM 5.2 lands at about 3x the input rate of GLM-4.6 (Zhipu's official rates):

| GLM model | Released | Input ($/M) | Output ($/M) |
|---|---|---|---|
| GLM-4.5 | 2025-07 | 0.60 | 2.20 |
| GLM-4.6 | 2025-09 | 0.43 | 1.74 |
| GLM-5 | 2026 | 1.00 | 3.20 |
GLM-5.2 |
2026-06 | 1.40 |
4.40 |

That buys the 1M context and the frontier benchmarks. But the per-token rate is only the headline. What you actually pay per task is set by the reasoning effort.

GLM 5.2's reasoning is a dial, not a switch. You can turn it off (`enable_thinking: false`

), set `reasoning_effort`

to low, medium, or high, or leave it on the default, which runs reasoning unbounded. That setting changes cost and latency by far more than the price does. We ran one easy and one hard coding task across the settings, checking every answer against a reference on hundreds of randomized cases.

Weighted interval scheduling, a moderate dynamic-programming problem:

| Mode | Reasoning tokens | Answer tokens | Cost | Latency | Correct |
|---|---|---|---|---|---|
`glm-5.2` , thinking off |
0 | 169 | $0.0008 |
≈5s | yes |
`glm-5.2` , `reasoning_effort: low`
|
1,563 | 150 | $0.0076 | 39s | yes |
`glm-5.2` , unbounded default |
≈6,290 | ≈150 | $0.0285 | 137s | yes |
`gpt-5.5` (reference) |
59 | 141 | $0.0064 | 4.8s | yes |
`claude-opus-4-8` (reference) |
0 | 201 | $0.0057 | 3.3s | yes |

Two things stand out. Thinking off is correct and the cheapest thing on the board, about 8x under the frontier models, and every step up the dial just adds cost for the same answer. And the bill tracks the reasoning, not the answer: the code GLM returns is roughly 150 tokens every time, while the reasoning in front of it grows from nothing to about 6,300, billed at the same $4.40/M output rate. The unbounded default spends that reasoning to reach the same answer thinking off produced with none, and the gap is the entire cost difference. The frontier models answer here with little or no reported reasoning: `gpt-5.5`

spends 59 reasoning tokens, and `claude-opus-4-8`

's usage reports none.

Wildcard string matching (`?`

and `*`

), the classic problem that is easy to get subtly wrong. Here thinking off broke. It returned a memoized recursion:

``` python
def is_match(s, p):
    memo = {}
    def match(i, j):
        if (i, j) in memo:
            return memo[(i, j)]
        if j == len(p):
            result = i == len(s)
        elif i < len(s) and p[j] in (s[i], '?'):
            result = match(i + 1, j + 1)
        elif p[j] == '*':
            result = match(i + 1, j) or match(i, j + 1)
        else:
            result = False
        memo[(i, j)] = result
        return result
    return match(0, 0)
```

It looks right, and the memo even suggests some care. But the `*`

branch recurses `match(i + 1, j)`

without bounding `i`

. Once the string is consumed and the pattern still has a `*`

, `i`

climbs forever and the stack overflows. Fast, cheap, and wrong.

Turn the dial up and it returns the correct iterative two-pointer algorithm, which backtracks to the last `*`

instead of recursing:

``` python
def is_match(s, p):
    s_idx, p_idx, star_idx, match_idx = 0, 0, -1, 0
    while s_idx < len(s):
        if p_idx < len(p) and (p[p_idx] == '?' or p[p_idx] == s[s_idx]):
            s_idx += 1
            p_idx += 1
        elif p_idx < len(p) and p[p_idx] == '*':
            star_idx = p_idx
            match_idx = s_idx
            p_idx += 1
        elif star_idx != -1:
            p_idx = star_idx + 1
            match_idx += 1
            s_idx = match_idx
        else:
            return False
    while p_idx < len(p) and p[p_idx] == '*':
        p_idx += 1
    return p_idx == len(p)
```

The full dial on this task:

| GLM 5.2 setting | Cost | Latency | Correct |
|---|---|---|---|
| thinking off | $0.0007 | 6s | no (stack overflow) |
`reasoning_effort: high` |
$0.0031 |
13s | yes |
`reasoning_effort: medium` |
$0.0032 | 16s | yes |
`reasoning_effort: low` |
$0.0068 | 40s | yes |
| unbounded default | $0.062 | 405s | yes |
`gpt-5.5` (reference) |
$0.0064 | 5.4s | yes |
`claude-opus-4-8` (reference) |
$0.0069 | 4.6s | yes |

Every explicit effort level solved it. `reasoning_effort: high`

did it for $0.0031 in 13 seconds, about twenty times cheaper and thirty times faster than the unbounded default for the same answer, and it undercuts the frontier models on cost, just a few seconds slower. One quirk worth knowing: GLM's `low`

produced more reasoning than `high`

, consistently across both tasks, so the names don't track token count. Medium and high were the cheap, fast settings.

The unbounded default is the one setting to avoid. It is the worst of both worlds: it buys reasoning the task may not need and takes minutes to do it, reaching the same answer `reasoning_effort: high`

gave for twenty times the cost.

The lever is the reasoning effort, and the right setting belongs to the task, not the model:

`enable_thinking: false`

). Correct and about 8x under frontier.`reasoning_effort: medium`

or `high`

. Correct, around $0.003 a task, under frontier on cost and only a few seconds slower.If you cannot tell in advance whether a task needs reasoning, `reasoning_effort: high`

is a safe default: it was cheap, it solved both tasks, and it never ran away.

GLM 5.2 supports caching on the gateway, and it helps where you'd expect. We sent a 1,494-token shared prefix (a code module to review) with several different questions:

| Call | Prompt tokens | Cached | Output | Cost | Latency |
|---|---|---|---|---|---|
| new question, prefix not yet cached | 1,493 | 0 | 120 | $0.0026 | 6.5s |
| new question, prefix cached | 1,494 | 1,472 | 120 | $0.0009 | 5.1s |
| exact repeat (semantic hit) | 1,494 | 1,494 | 120 | $0.0009 | 1.0s |

Once a large prefix has been seen, it caches. The cached input tokens bill at roughly a fifth of the normal input rate, which cut an otherwise identical request from $0.0026 to $0.0009, about 64%. An exact repeat is served straight from the semantic cache: the same answer at the same cost as the cached call, but back in about a second instead of five.

The catch is the same one the dial taught: caching discounts the input, and the moment reasoning is on, the cost and latency live in the reasoning output, which is not cached. So caching is a real win for thinking-off, high-context work (the same system prompt or codebase on every call), and a small one once reasoning is on.

`glm-5.2`

is live on the gateway. Three practical notes from our testing:

`enable_thinking: false`

for simple work and `reasoning_effort: medium`

or `high`

for harder problems. The one thing to avoid is leaving reasoning on with no effort cap (the unbounded default), which is the $0.06, seven-minute trap.`stream: true`

and you get incremental output and the full result.Pricing is $1.40 / $4.40 per million tokens, and the gateway returns a `cost`

field per call so you can see exactly what each request cost.

GLM 5.2 is a genuinely cheap, capable coding model, and configured well it beats frontier prices on both easy and hard work. The catch is the configuration. Its reasoning is a dial, and the default leaves it unbounded, which is how a task that should cost $0.003 becomes a $0.06, seven-minute call. Set `enable_thinking: false`

for simple work and `reasoning_effort: medium`

or `high`

for the rest, and GLM 5.2 is cheap and correct across the board. Leave reasoning on its default, and it is the slowest, priciest option you could have picked.

(Synthorai listing prices above are this platform's rates as of 2026-06-24; GLM generational rates are Zhipu's official list.)

*Costs measured on Synthorai on 2026-06-24 ( glm-5.2 at $1.40 / $4.40 per M tokens); verify current pricing before relying on it.*