# GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest

> Source: <https://www.stet.sh/blog/glm-5-2-passes-tests-fails-review>
> Published: 2026-06-19 12:00:00+00:00

# GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest

GLM 5.2 has been getting a lot of hype for being a "frontier killer". We can evaluate it 2 ways: how cheap is it, and is good enough for most work?

On tasks from these 2 open source repos, it is neither. It costs roughly twice Composer 2.5 for lower quality on both dimensions that matter.

**Route GLM to supervised first-draft work with a smarter model supervising. Don't run it unattended against production-grade repositories.**

The setup: fifty real merged pull requests across two repos — graphql-go-tools in Go and sqlparser-rs in Rust — replayed against frozen snapshots so nothing leaks, one attempt per task. Every patch is graded beyond pass/fail: a craft score (0–4, code quality and idiom) and equivalence (0–1, how closely the patch reproduces the merged human PR's actual behavior). [Stet](/methodology), the local eval harness I build, runs and grades them; a blinded gpt-5.4 judge scores independently of the runner. GLM ran at medium reasoning. Does GLM belong at this table, and if not, where does it actually belong?

n=50 slice: graphql-go-tools (Go, n=25) plus sqlparser-rs (Rust, n=25), blinded GPT-5.4 judge. GLM 5.2 (run at medium reasoning) lands last in the field on craft and equivalence in both repos — cheaper than the premium arms, but pricier than Composer and last on quality. Below: where it stands, how it behaves, and what it costs.

calibrated standing · GLM 5.2 medium

Last on craft and equivalence, in both repos

craft = 8-grader mean (0–4); equivalence = how closely the patch reproduces the merged human PR's behavior (0–1). GLM is decision-grade behind the whole premium field on both, in both repos — a gap big enough to survive the statistics at this sample size. Against the budget arm Composer 2.5 it is a noise-band peer (too close to call) on Go and decision-grade behind on Rust equivalence — while costing about twice as much on Rust.

graphql-go-tools (Go), n=25

| Arm | craft | equiv | $/task |
|---|---|---|---|
| Opus 4.8 high | 2.90 | 0.73 | $3.98 |
| GPT-5.5 high | 2.72 | 0.73 | $4.69 |
| Opus 4.7 xhigh | 2.63 | 0.68 | $5.93 |
| Composer 2.5 | 2.48 | 0.60 | $0.71 |
| GLM 5.2 medium | 2.38 | 0.47 | $1.40 |

sqlparser-rs (Rust), n=25

| Arm | craft | equiv | $/task |
|---|---|---|---|
| Opus 4.8 high | 3.28 | 0.98 | $3.02 |
| Opus 4.7 xhigh | 2.98 | 0.97 | $3.55 |
| GPT-5.5 high | 2.94 | 0.96 | $3.41 |
| Composer 2.5 | 2.84 | 0.95 | $0.53 |
| GLM 5.2 medium | 2.69 | 0.78 | $1.04 |

GLM ran at medium reasoning. Composer's Go cost is recovered from raw Cursor logs (directional); all other figures are from the calibrated per-repo panels.

how GLM works

It writes more than the human, then misses the change

Median patch vs Composer 2.5 and the merged human PR. GLM and Composer both write more than the human, but GLM barely deletes (−16 / −12 vs Composer's −42 / −35 and the human's −47 / −17) — it bolts new code on rather than editing the existing path. That deletion-aversion is plausibly why its equivalence lags: the real change often needed existing behavior modified, and GLM adds around it.

graphql-go-tools (Go)

sqlparser-rs (Rust)

graphql-go-tools · turns

135

sqlparser-rs · turns

122

graphql-go-tools · tokens

3.9M

sqlparser-rs · tokens

2.5M

Medians per task. GLM produced a patch on all 50 tasks. It runs more agent turns than the premium arms (Rust: 122 vs Opus 4.8's 113, GPT-5.5's 94) at a fraction of the cost — it grinds longer, cheaply. Turns and tokens are GLM's own; Composer ran on a different harness, so its turn counts aren't directly comparable. Variance is wide: the worst Go task burned 14.1M tokens and $4.07 in a single 326-turn loop and still failed.

glm 5.2 comparison

## cost vs local score

vsCompare weighted Stet quality against rollout spend. Use the selectors to choose the two metrics that matter for the decision.

- GLM 5.2 medium: repo-balanced local score 64.6, cost per task $1.22
- Composer 2.5: repo-balanced local score 71.6, cost per task $0.62
- GPT-5.5 high: repo-balanced local score 75.6, cost per task $4.05
- Opus 4.8 high: repo-balanced local score 80.2, cost per task $3.50
- Opus 4.7 xhigh: repo-balanced local score 75.9, cost per task $4.74

Lower cost is better.

Higher local score is better.

Hover a point, or focus/tap a row to show the selected metric values in the chart. Repo-mean pooled, both repos. GLM lands last on the blended local score (64.6) below Composer (71.6); cheaper than the premium arms but ~2x Composer in both repos. Toggle the Y axis between local score, craft, code review, equivalence, and tests. GLM ran at medium reasoning.

local score formula

A single repo-balanced display score: 5% tests + 30% equivalence + 25% code review + 25% craft + 15% footprint. GLM 5.2 lands last at 64.6 — but only 7 points behind Composer, because footprint is the one component where GLM doesn't trail, and it props the blended number up. The calibrated per-axis read above is harsher. Drag to reweight and watch the ranking move. GLM ran at medium reasoning.

weights

live local score

same task, two models

Where Composer shipped it, GLM shipped a partial

Two sqlparser-rs tasks Composer solved cleanly. GLM passed the same tests and came back non-equivalent — tests green, behavior incomplete. The same plausible-partial shape shows up across the slice.

sqlparser-rs #1472

Hive ! negation vs PostgreSQL ! factorial

Disambiguate the bang operator by dialect.

Composer 2.5

Equivalent patch, cleared review, craft 3.68.

GLM 5.2

Passed the same tests, non-equivalent: it enabled both bang forms for GenericDialect.

sqlparser-rs #1493

JSON_TABLE FOR ORDINALITY with NESTED PATH

Add the ordinality column syntax.

Composer 2.5

Equivalent and review-clean, craft 3.53.

GLM 5.2

Passed tests, non-equivalent: “misses the core real-world ordinality syntax — FOR ORDINALITY requires a data type.”

green tests, failed review

graphql-go-tools #859

$0.36 · non-equivalent

planner-path optimization

Tests passed, non-equivalent: “plausible partial… misses the datasource metadata membership side.”

sqlparser-rs #1398

$0.73 · non-equivalent

interval qualifiers

Tests passed, non-equivalent: “fails to enforce required qualifiers; regression coverage removed without visible replacement.”

## It passes tests – so does everyone

GLM passed 38 of 50 tasks. That sounds usable. Composer 2.5 passed 44/50. GPT-5.5 also 44/50. Opus 4.8 hit 47/50.

The test gate has saturated as a discriminator — all four models cluster above 75%. A model can write a structurally plausible patch, leave the core behavior incomplete, and still go green, because, on this dataset, test suites rarely cover every edge the human PR was fixing. GLM does exactly that, repeatedly, as the task contrasts below show.

**Routing decisions built on pass/fail won't separate GLM from Opus 4.8. The axes that diverge are equivalence and craft.**

## Last on quality, in both repos

GLM trails the entire field on craft and equivalence in both repos. Against the premium arms that gap is decision-grade — large enough to survive the statistics at n=50. Against Composer the picture splits by repo. On Go the craft and equivalence gaps are −0.10 and −0.14: noise-band, too close to call. On Rust craft stays close but equivalence falls 0.17 short, which is decision-grade. On the Rust repo every other arm clears 0.95 equivalence; GLM sits at 0.78, the clear outlier.

**GLM is last on quality. Composer is also cheap, and it's better.**

## How GLM actually works

The quality gap has a mechanism:

### Turns and tokens

GLM grinds. It ran a median of roughly 135 agent turns per task on Go and 122 on Rust. Opus 4.8 ran a median of ~113 turns on Rust, GPT-5.5 ~94. GLM takes more turns than the premium models and produces worse output; its cost edge comes from cheap per-token pricing, not from efficiency. Median token spend ran ~3.9M per Go task and ~2.5M on Rust.

### It writes more than the human, then misses the change

The patch shape is the tell. GLM writes roughly 1.8x the human's churn. On Go the median GLM patch added 222 lines and deleted 16 across 4 files; the human PR was +111 / −47. On Rust, GLM +284 / −12 across 6 files against the human's +110 / −17. GLM bolts new code alongside the existing path and rarely edits or deletes what's there. That deletion-aversion is plausibly why equivalence lags: the real change often needed existing behavior modified, and GLM adds around the problem instead of through it.

### Wildly variable

The variance is wide. Go #1308 burned 14.1M tokens and $4.07 across 326 turns of flail and still failed. Go #1034 produced a +522 / −6 patch at twice the human's line count — more code, same miss, craft 1.94. GLM can also land it cleanly: Rust #2174 reached an equivalent patch in roughly half the human's lines, and Rust #1538 got the behavior right but ground through 11.3M tokens and $3.21 to do it.

### Same-task contrasts with Composer

The sharpest window is running both models on identical tasks. sqlparser-rs #1472 (Hive `!`

negation vs PostgreSQL `!`

factorial): Composer shipped an equivalent, review-clean patch, craft 3.68. GLM passed the same tests but wasn't equivalent — it enabled both bang forms for GenericDialect, which is not what the PR did. sqlparser-rs #1493 (JSON_TABLE FOR ORDINALITY): Composer again equivalent and review-clean, craft 3.53; GLM non-equivalent, graded as missing the core real-world ordinality syntax. Two more in the same shape: Go #859 at $0.36 ("plausible partial… misses the datasource metadata membership side") and Rust #1398 at $0.73 ("fails to enforce required qualifiers").

**The pattern is a plausible partial: tests go green, behavior stays incomplete — which is exactly the failure a test-only gate cannot see.**

## Cheaper than frontier, pricier than Composer

GLM at $1.40/task on Go and $1.04 on Rust runs about 0.3x what the premium arms cost: Opus 4.8 $3.02–$3.98, GPT-5.5 $3.41–$4.69, Opus 4.7 $3.55–$5.93. Measured upward, the cost position is real.

Measured against the model directly below it, it isn't. Composer costs $0.71 on Go and $0.53 on Rust — GLM is roughly 2x Composer in both repos, for worse quality. The blended local score makes GLM look closer than it is: Opus 4.8 80.2, GPT-5.5 75.6, Composer 71.6, GLM 64.6. GLM sits only ~7 behind Composer because code footprint is the one component where it doesn't trail, and it props the composite up. The per-axis read is harsher and more accurate: decision-grade behind the entire premium field on craft and equivalence, decision-grade behind Composer on Rust equivalence.

**The "cheap" story only holds if you ignore the model sitting right below GLM in the standings.**

On lived cost: running the same fifty-task eval consumed 100% of a week's usage on GLM's $60/week plan. The same tasks on Composer used about 30% of a month's usage on its $20/month plan — and GLM's tighter caps fragment a long run into pieces the way Composer's don't.

## Route by behavior, not the test gate

If you already run Composer 2.5, GLM buys you nothing on these workloads — more turns, more cost, lower quality. The routing comparison isn't close.

GLM's honest slot is supervised first-draft generation: work where a human or frontier model is tightly guiding it. The deletion-aversion and the plausible-partial pattern are manageable with eyes on the output; they are not manageable at scale without them. Tighter turn and cost caps contain the flail cases like #1308's 326-turn, $4.07 failure, but caps limit the damage — they don't fix the incompleteness.

**GLM is a draft generator with a reviewer attached, not a merge trigger.**

## What this is and isn't

Fifty tasks per model, two repos — graphql-go-tools in Go and sqlparser-rs in Rust — June 2026. Grader: blinded gpt-5.4, independent of the runner. Single seed, one attempt per task. Contamination audited clean on both repos. Costs are cache-aware. GLM ran at medium reasoning.

Per-repo replication is the confidence mechanism: when a finding holds in Go and in Rust, that's a stronger signal than one repo alone. Pooled numbers are directional, not definitive.

This is not a general benchmark — two repos, one domain cluster (parsers and query planning), one judge model, one agent scaffold. Other code domains, task distributions, and harness configurations may rank these models differently. I can't build a routing rule for your codebase out of these numbers.

First scored result within the hour, on the Claude subscription you already have.

[Join the waitlist](/private)

**Measure your own harness on your own code.** GLM's behavior on sqlparser-rs says something specific about dialect-sensitive parser work; whether that carries to your service layer, your migrations, your infrastructure code is an open question. What would these standings look like on the three repos you actually maintain?

## FAQ

### Is GLM 5.2 good enough to replace a premium coding model?

On this 50-task slice, no. GLM finished last on craft and equivalence in both repos, decision-grade behind the entire premium field. Its honest slot is cheap, supervised first-draft work where a human reviews the patch, not unattended production work.

### GLM 5.2 vs Composer 2.5 — which cheap model is better?

On this slice, Composer. They are a noise-band tie on Go craft and equivalence, but Composer is decision-grade ahead on Rust equivalence and costs about half as much in both repos ($0.53/task Rust, $0.71 Go vs GLM's $1.04 and $1.40). GLM is cheaper than the premium field but not the cheapest arm.

### Is GLM 5.2 cheap to run?

Cheaper than the premium field (~$1.40/task on Go, $1.04 on Rust, about 0.3x Opus 4.8 and GPT-5.5) but roughly 2x Composer 2.5 in both repos. Token volume drives its cost: it runs more agent turns than the premium models and writes more code than the human PR.

### How does GLM 5.2 fail when it fails?

It ships a plausible partial: a test-passing patch that bolts on new code but misses the change the PR actually made. It writes about 1.8x the human's churn while rarely deleting, and it cleared code review on only 5 of 50 tasks. On sqlparser-rs #1472 and #1493, where Composer shipped equivalent, review-clean patches, GLM passed the same tests but came back non-equivalent.

### Where is GLM 5.2 safe to use?

Cheap, supervised first-draft work where a human reviews before merge and the spec is narrow. Not auto-merge-on-green production work, and not unattended batch runs — its five-hour and weekly usage caps fragment long runs and its plausible-partial patches need human eyes.
