GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest

GLM 5.2 ranked last in code quality and equivalence among AI models tested on 50 real pull requests from Go and Rust repositories, while costing roughly twice as much as Composer 2.5. The model produced larger patches with fewer deletions, leading to lower equivalence scores, and its performance gap was statistically significant against premium models.

GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest GLM 5.2 has been getting a lot of hype for being a "frontier killer". We can evaluate it 2 ways: how cheap is it, and is good enough for most work? On tasks from these 2 open source repos, it is neither. It costs roughly twice Composer 2.5 for lower quality on both dimensions that matter. Route GLM to supervised first-draft work with a smarter model supervising. Don't run it unattended against production-grade repositories. The setup: fifty real merged pull requests across two repos — graphql-go-tools in Go and sqlparser-rs in Rust — replayed against frozen snapshots so nothing leaks, one attempt per task. Every patch is graded beyond pass/fail: a craft score 0–4, code quality and idiom and equivalence 0–1, how closely the patch reproduces the merged human PR's actual behavior . Stet /methodology , the local eval harness I build, runs and grades them; a blinded gpt-5.4 judge scores independently of the runner. GLM ran at medium reasoning. Does GLM belong at this table, and if not, where does it actually belong? n=50 slice: graphql-go-tools Go, n=25 plus sqlparser-rs Rust, n=25 , blinded GPT-5.4 judge. GLM 5.2 run at medium reasoning lands last in the field on craft and equivalence in both repos — cheaper than the premium arms, but pricier than Composer and last on quality. Below: where it stands, how it behaves, and what it costs. calibrated standing · GLM 5.2 medium Last on craft and equivalence, in both repos craft = 8-grader mean 0–4 ; equivalence = how closely the patch reproduces the merged human PR's behavior 0–1 . GLM is decision-grade behind the whole premium field on both, in both repos — a gap big enough to survive the statistics at this sample size. Against the budget arm Composer 2.5 it is a noise-band peer too close to call on Go and decision-grade behind on Rust equivalence — while costing about twice as much on Rust. graphql-go-tools Go , n=25 | Arm | craft | equiv | $/task | |---|---|---|---| | Opus 4.8 high | 2.90 | 0.73 | $3.98 | | GPT-5.5 high | 2.72 | 0.73 | $4.69 | | Opus 4.7 xhigh | 2.63 | 0.68 | $5.93 | | Composer 2.5 | 2.48 | 0.60 | $0.71 | | GLM 5.2 medium | 2.38 | 0.47 | $1.40 | sqlparser-rs Rust , n=25 | Arm | craft | equiv | $/task | |---|---|---|---| | Opus 4.8 high | 3.28 | 0.98 | $3.02 | | Opus 4.7 xhigh | 2.98 | 0.97 | $3.55 | | GPT-5.5 high | 2.94 | 0.96 | $3.41 | | Composer 2.5 | 2.84 | 0.95 | $0.53 | | GLM 5.2 medium | 2.69 | 0.78 | $1.04 | GLM ran at medium reasoning. Composer's Go cost is recovered from raw Cursor logs directional ; all other figures are from the calibrated per-repo panels. how GLM works It writes more than the human, then misses the change Median patch vs Composer 2.5 and the merged human PR. GLM and Composer both write more than the human, but GLM barely deletes −16 / −12 vs Composer's −42 / −35 and the human's −47 / −17 — it bolts new code on rather than editing the existing path. That deletion-aversion is plausibly why its equivalence lags: the real change often needed existing behavior modified, and GLM adds around it. graphql-go-tools Go sqlparser-rs Rust graphql-go-tools · turns 135 sqlparser-rs · turns 122 graphql-go-tools · tokens 3.9M sqlparser-rs · tokens 2.5M Medians per task. GLM produced a patch on all 50 tasks. It runs more agent turns than the premium arms Rust: 122 vs Opus 4.8's 113, GPT-5.5's 94 at a fraction of the cost — it grinds longer, cheaply. Turns and tokens are GLM's own; Composer ran on a different harness, so its turn counts aren't directly comparable. Variance is wide: the worst Go task burned 14.1M tokens and $4.07 in a single 326-turn loop and still failed. glm 5.2 comparison cost vs local score vsCompare weighted Stet quality against rollout spend. Use the selectors to choose the two metrics that matter for the decision. - GLM 5.2 medium: repo-balanced local score 64.6, cost per task $1.22 - Composer 2.5: repo-balanced local score 71.6, cost per task $0.62 - GPT-5.5 high: repo-balanced local score 75.6, cost per task $4.05 - Opus 4.8 high: repo-balanced local score 80.2, cost per task $3.50 - Opus 4.7 xhigh: repo-balanced local score 75.9, cost per task $4.74 Lower cost is better. Higher local score is better. Hover a point, or focus/tap a row to show the selected metric values in the chart. Repo-mean pooled, both repos. GLM lands last on the blended local score 64.6 below Composer 71.6 ; cheaper than the premium arms but ~2x Composer in both repos. Toggle the Y axis between local score, craft, code review, equivalence, and tests. GLM ran at medium reasoning. local score formula A single repo-balanced display score: 5% tests + 30% equivalence + 25% code review + 25% craft + 15% footprint. GLM 5.2 lands last at 64.6 — but only 7 points behind Composer, because footprint is the one component where GLM doesn't trail, and it props the blended number up. The calibrated per-axis read above is harsher. Drag to reweight and watch the ranking move. GLM ran at medium reasoning. weights live local score same task, two models Where Composer shipped it, GLM shipped a partial Two sqlparser-rs tasks Composer solved cleanly. GLM passed the same tests and came back non-equivalent — tests green, behavior incomplete. The same plausible-partial shape shows up across the slice. sqlparser-rs 1472 Hive negation vs PostgreSQL factorial Disambiguate the bang operator by dialect. Composer 2.5 Equivalent patch, cleared review, craft 3.68. GLM 5.2 Passed the same tests, non-equivalent: it enabled both bang forms for GenericDialect. sqlparser-rs 1493 JSON TABLE FOR ORDINALITY with NESTED PATH Add the ordinality column syntax. Composer 2.5 Equivalent and review-clean, craft 3.53. GLM 5.2 Passed tests, non-equivalent: “misses the core real-world ordinality syntax — FOR ORDINALITY requires a data type.” green tests, failed review graphql-go-tools 859 $0.36 · non-equivalent planner-path optimization Tests passed, non-equivalent: “plausible partial… misses the datasource metadata membership side.” sqlparser-rs 1398 $0.73 · non-equivalent interval qualifiers Tests passed, non-equivalent: “fails to enforce required qualifiers; regression coverage removed without visible replacement.” It passes tests – so does everyone GLM passed 38 of 50 tasks. That sounds usable. Composer 2.5 passed 44/50. GPT-5.5 also 44/50. Opus 4.8 hit 47/50. The test gate has saturated as a discriminator — all four models cluster above 75%. A model can write a structurally plausible patch, leave the core behavior incomplete, and still go green, because, on this dataset, test suites rarely cover every edge the human PR was fixing. GLM does exactly that, repeatedly, as the task contrasts below show. Routing decisions built on pass/fail won't separate GLM from Opus 4.8. The axes that diverge are equivalence and craft. Last on quality, in both repos GLM trails the entire field on craft and equivalence in both repos. Against the premium arms that gap is decision-grade — large enough to survive the statistics at n=50. Against Composer the picture splits by repo. On Go the craft and equivalence gaps are −0.10 and −0.14: noise-band, too close to call. On Rust craft stays close but equivalence falls 0.17 short, which is decision-grade. On the Rust repo every other arm clears 0.95 equivalence; GLM sits at 0.78, the clear outlier. GLM is last on quality. Composer is also cheap, and it's better. How GLM actually works The quality gap has a mechanism: Turns and tokens GLM grinds. It ran a median of roughly 135 agent turns per task on Go and 122 on Rust. Opus 4.8 ran a median of ~113 turns on Rust, GPT-5.5 ~94. GLM takes more turns than the premium models and produces worse output; its cost edge comes from cheap per-token pricing, not from efficiency. Median token spend ran ~3.9M per Go task and ~2.5M on Rust. It writes more than the human, then misses the change The patch shape is the tell. GLM writes roughly 1.8x the human's churn. On Go the median GLM patch added 222 lines and deleted 16 across 4 files; the human PR was +111 / −47. On Rust, GLM +284 / −12 across 6 files against the human's +110 / −17. GLM bolts new code alongside the existing path and rarely edits or deletes what's there. That deletion-aversion is plausibly why equivalence lags: the real change often needed existing behavior modified, and GLM adds around the problem instead of through it. Wildly variable The variance is wide. Go 1308 burned 14.1M tokens and $4.07 across 326 turns of flail and still failed. Go 1034 produced a +522 / −6 patch at twice the human's line count — more code, same miss, craft 1.94. GLM can also land it cleanly: Rust 2174 reached an equivalent patch in roughly half the human's lines, and Rust 1538 got the behavior right but ground through 11.3M tokens and $3.21 to do it. Same-task contrasts with Composer The sharpest window is running both models on identical tasks. sqlparser-rs 1472 Hive negation vs PostgreSQL factorial : Composer shipped an equivalent, review-clean patch, craft 3.68. GLM passed the same tests but wasn't equivalent — it enabled both bang forms for GenericDialect, which is not what the PR did. sqlparser-rs 1493 JSON TABLE FOR ORDINALITY : Composer again equivalent and review-clean, craft 3.53; GLM non-equivalent, graded as missing the core real-world ordinality syntax. Two more in the same shape: Go 859 at $0.36 "plausible partial… misses the datasource metadata membership side" and Rust 1398 at $0.73 "fails to enforce required qualifiers" . The pattern is a plausible partial: tests go green, behavior stays incomplete — which is exactly the failure a test-only gate cannot see. Cheaper than frontier, pricier than Composer GLM at $1.40/task on Go and $1.04 on Rust runs about 0.3x what the premium arms cost: Opus 4.8 $3.02–$3.98, GPT-5.5 $3.41–$4.69, Opus 4.7 $3.55–$5.93. Measured upward, the cost position is real. Measured against the model directly below it, it isn't. Composer costs $0.71 on Go and $0.53 on Rust — GLM is roughly 2x Composer in both repos, for worse quality. The blended local score makes GLM look closer than it is: Opus 4.8 80.2, GPT-5.5 75.6, Composer 71.6, GLM 64.6. GLM sits only ~7 behind Composer because code footprint is the one component where it doesn't trail, and it props the composite up. The per-axis read is harsher and more accurate: decision-grade behind the entire premium field on craft and equivalence, decision-grade behind Composer on Rust equivalence. The "cheap" story only holds if you ignore the model sitting right below GLM in the standings. On lived cost: running the same fifty-task eval consumed 100% of a week's usage on GLM's $60/week plan. The same tasks on Composer used about 30% of a month's usage on its $20/month plan — and GLM's tighter caps fragment a long run into pieces the way Composer's don't. Route by behavior, not the test gate If you already run Composer 2.5, GLM buys you nothing on these workloads — more turns, more cost, lower quality. The routing comparison isn't close. GLM's honest slot is supervised first-draft generation: work where a human or frontier model is tightly guiding it. The deletion-aversion and the plausible-partial pattern are manageable with eyes on the output; they are not manageable at scale without them. Tighter turn and cost caps contain the flail cases like 1308's 326-turn, $4.07 failure, but caps limit the damage — they don't fix the incompleteness. GLM is a draft generator with a reviewer attached, not a merge trigger. What this is and isn't Fifty tasks per model, two repos — graphql-go-tools in Go and sqlparser-rs in Rust — June 2026. Grader: blinded gpt-5.4, independent of the runner. Single seed, one attempt per task. Contamination audited clean on both repos. Costs are cache-aware. GLM ran at medium reasoning. Per-repo replication is the confidence mechanism: when a finding holds in Go and in Rust, that's a stronger signal than one repo alone. Pooled numbers are directional, not definitive. This is not a general benchmark — two repos, one domain cluster parsers and query planning , one judge model, one agent scaffold. Other code domains, task distributions, and harness configurations may rank these models differently. I can't build a routing rule for your codebase out of these numbers. First scored result within the hour, on the Claude subscription you already have. Join the waitlist /private Measure your own harness on your own code. GLM's behavior on sqlparser-rs says something specific about dialect-sensitive parser work; whether that carries to your service layer, your migrations, your infrastructure code is an open question. What would these standings look like on the three repos you actually maintain? FAQ Is GLM 5.2 good enough to replace a premium coding model? On this 50-task slice, no. GLM finished last on craft and equivalence in both repos, decision-grade behind the entire premium field. Its honest slot is cheap, supervised first-draft work where a human reviews the patch, not unattended production work. GLM 5.2 vs Composer 2.5 — which cheap model is better? On this slice, Composer. They are a noise-band tie on Go craft and equivalence, but Composer is decision-grade ahead on Rust equivalence and costs about half as much in both repos $0.53/task Rust, $0.71 Go vs GLM's $1.04 and $1.40 . GLM is cheaper than the premium field but not the cheapest arm. Is GLM 5.2 cheap to run? Cheaper than the premium field ~$1.40/task on Go, $1.04 on Rust, about 0.3x Opus 4.8 and GPT-5.5 but roughly 2x Composer 2.5 in both repos. Token volume drives its cost: it runs more agent turns than the premium models and writes more code than the human PR. How does GLM 5.2 fail when it fails? It ships a plausible partial: a test-passing patch that bolts on new code but misses the change the PR actually made. It writes about 1.8x the human's churn while rarely deleting, and it cleared code review on only 5 of 50 tasks. On sqlparser-rs 1472 and 1493, where Composer shipped equivalent, review-clean patches, GLM passed the same tests but came back non-equivalent. Where is GLM 5.2 safe to use? Cheap, supervised first-draft work where a human reviews before merge and the spec is narrow. Not auto-merge-on-green production work, and not unattended batch runs — its five-hour and weekly usage caps fragment long runs and its plausible-partial patches need human eyes.