cd /news/large-language-models/opus-vs-glm-5-2-in-a-coding-agent-pi… · home topics large-language-models article
[ARTICLE · art-44264] src=gist.github.com ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Opus vs GLM-5.2 in a coding-agent pipeline — paired-run findings

A controlled A/B test comparing Claude Opus and GLM-5.2 in a coding-agent pipeline revealed qualitative differences in engineering behavior. Using the same paper-implementation pipeline across 10 repository forks, Anthropic's model produced scoping discussions while GLM-5.2 generated complete pull requests with code, tests, and documentation. The findings highlight distinct temperaments: one model defaults to analysis, the other to action.

read8 min views1 publishedJun 29, 2026

A controlled A/B across 10 repository forks × 2 model providers, running an identical paper-implementation pipeline ( remyxai/outrider — Claude Code under the hood, with

glm-5.2

routed at z.ai's Coding Plan endpoint vs default Anthropic). Same paper pinned to each repo, same chain, same prompt-set — model is the only variable. The interesting findings aren't quantitative; they're qualitative differences in how each agent behaves when asked to do real engineering work.

The action under test is remyxai/outrider; the

installs the workflow on a target fork and dispatches pinned-paper runs:

remyxai-cli

remyxai outrider init --repo your-fork/repo --interest-id <uuid>

remyxai outrider set-provider-secret \
  --repo your-fork/repo --provider zai --key-from ~/zai-key

remyxai outrider trigger --repo your-fork/repo --pin-method 2606.27369v1 \
  --provider anthropic --model claude-opus-4-7
remyxai outrider trigger --repo your-fork/repo --pin-method 2606.27369v1 \
  --provider zai --model glm-5.2

Pinning bypasses candidate-selection, so the same paper lands on every fork — the comparison reduces to "model vs model on identical input." --provider

picks the company / API endpoint; --model

picks the specific model from that provider's catalog.

Same paper (arXiv:2606.27369v1 — RiVER, an RL-without-ground-truth method), same repo (fork of NousResearch/atropos), same pipeline. Two different verdicts.

Anthropic drafted an implementation attempt, ran a self-review pass on its own output, decided the slice was too narrow to claim "implements this paper," and downgraded to— a thoughtful scoping discussion of why this isn't mergeable yet.Issue #7GLM-5.2 drafted the same kind of attempt, ran the same self-review pass, and shipped it as a: +462 / -1 acrossdraft PR #84 files, including a new 175-LOC reward-function module, 120 LOC of unit tests, module registration, and a directory README.

A careful read of GLM's diff shows it's not a stub. The module docstring correctly names the paper, identifies the two failure modes RiVER analyzes (scale dominance, frequency dominance), and explains how its min-max calibration + emphasis exponent counters each. The tests aren't smoke — two of them specifically verify the paper's failure-mode claims (100× scaled instance → identical rewards; rare strong solver among many mediocre → only strong gets max reward), with the emphasis math worked out correctly. The PR body is honest about scope: "starting point — no benchmarking experiment is bundled yet."

Two legitimate outputs from identical input. Neither is wrong. They embody different temperaments around the same engineering question.

Defaults to action. Where Anthropic asks scoping questions, GLM proposes a specific plan. On a kernel-port paper, GLM said "profile the existing kernel on skinny shapes to confirm the bottleneck before porting"; Anthropic listed three open questions. On a deterministic-control-plane paper, GLM proposed "a BaseMiddleware

subclass with a conformance test"; Anthropic listed three scoping questions. On an active-perception paper, GLM proposed a thin video-eval harness consuming the released trajectories; Anthropic proposed a control-pattern-only prototype, explicitly dropping the eval question.

GLM commits even when the commitment rests on assumptions the maintainer might not endorse. Its failure mode is confabulation under uncertainty. On one preflight pass against an observability platform's repo (opik on the GLM side vs opik on the Anthropic side), GLM asserted that "the repo layout supplied for routing is empty" — factually wrong; the repo has hundreds of relevant Python files, and Anthropic correctly named specific classes (BaseOptimizer.optimize_prompt()

, OptimizableAgent

, LiteLLMAgent

) and module paths. When context-grounding falters, GLM reaches for an unsupported assertion rather than slowing down.

When GLM does ship code, it ships more of it: longer Issue bodies, more sub-headings, more structure (TL;DR / Suggested experiment / Engineering analysis / What blocks / How to unblock

is GLM's house style), wider artifact scope.

Defaults to scoping. Where GLM commits to "let's try this specific thing," Anthropic answers a different question: "should we try this at all, and how do we know?" Its Issue bodies are tighter, name specific files and line numbers, and lean on open questions over proposed experiments.

Its failure mode is under-engagement with meta-framing. On two paired Issues, GLM caught sharp scope-defining observations Anthropic missed:

  • On LiveKit Agents(GLM Issue), commenting on a video-understanding paper:*"the repo's existing 'action' concept is LLM tool-calling within a voice turn, not perception actions over a video timeline."That's a precise dissolution of the recommendation's premise — the two concepts share a word but not a structure. - On AG2(GLM Issue), commenting on a control-plane-for-coding-agents paper:"the paper argues governance should NOT be delegated to LLM orchestration, whereas AG2 IS an LLM-orchestration framework — even the intent sits beside, not inside, AG2."*That observation reframes the entire premise.

Anthropic answered the question the candidate-pool put in front of it. GLM sometimes questioned it.

Anthropic's self-review pass is also stricter. On atropos, given roughly equivalent drafted output, Anthropic's self-review voted to downgrade to Issue while GLM's cleared. Either GLM's draft happened to clear Anthropic's higher bar (and the gap is in self-review calibration, not draft quality), or GLM's self-review is just laxer. Both interpretations are consistent with the data; disambiguating would need n>1 paired PR-attempts.

8 of 9 paired completions matched on the routing verdict — including the deeper refinement-chain gates, not just preflight:

The routing decision is largely model-insensitive. When the evidence forces a particular verdict, both models see it — at preflight and at the deeper validator gates. If a workflow just needs "which paper for this repo, and PR or Issue," the two models look interchangeable for that purpose.

The substance only diverges in how the verdict is justified, what evidence each model exploits, and what the body recommends doing next.

Across 10 paired runs (with 3 caveats — 2 GLM runs hit an envelope-parsing bug that dropped their token counts, 1 hit the timeout described below; the clean comparison below is on the 7 fully-recorded pairs):

Anthropic Opus GLM-5.2 Ratio
Total spend
$11.11 $1.20 Anthropic 9.2× more expensive
Total input tokens
54,858 407,441 GLM uses 7.4× MORE input
Total output tokens
144,147 235,269 GLM uses 1.6× more output
Total wall-clock
3,430s 6,396s GLM 1.9× slower overall

The striking number is input tokens. GLM consumes 7× more context to do the same task — not just verbose at output, it reads substantially more per run (more tool-use turns, weaker caching, or just less efficient context management depending on how you read it). It's still ~9× cheaper because the unit-price gap between the two providers is large enough to swamp the token gap.

Wall-clock varies a lot per fork, and the variance maps to the behavioral profiles above:

Fork Anthropic GLM-5.2 GLM ratio
ultralytics 491s 376s 0.77× (GLM faster)
mlx 259s 289s 1.12×
opik 180s 283s 1.57×
OLMo-core 541s 884s 1.63×
neural-steering 583s 1069s 1.83×
lm-evaluation-harness 135s 289s 2.14×
ag2 97s 261s 2.69×
agents 94s 253s 2.69×
atropos
380s 1582s
4.16×

Two patterns:

Anthropic short-circuits faster on cheap-decision runs. Its 94s / 97s preflight Issues (agents, ag2) explain the 2.7× gap on those forks — Anthropic ends quickly when the verdict is obvious; GLM consistently spends 250–290s regardless.The committer pays for committing. The 4.16× outlier on atropos is the same fork where GLM shipped real code and Anthropic self-reviewed-out. Writing 462 lines of well-tested module takes longer than deciding not to. The wall-clock cost directly mirrors the temperament difference.

One paired comparison hits an upstream wall: glm-5.2 on open-instruct (~50K files) returns HTTP 529 from z.ai across multiple attempts. Not the Outrider timeout — raising --claude-timeout

to 1500s let preflight run well past its old 180s ceiling, and z.ai still rejects. The sibling glm-4.6 run on the same paper+repo completed cleanly in 96s with Issue #4 (paper-grounded; names the OLMo-core trainer attention-mask interface as the integration blocker), so the constraint is z.ai's service capacity for glm-5.2 on a prompt this size — not anything on the Outrider side. The headline glm-5.2 vs Anthropic paired-PR result on this fork stays open until z.ai capacity recovers.

The two agents have different temperaments, not different competence at this kind of work. Both produce legitimate artifacts on the same input; they differ in how willing each is to commit, how strict each is about "this isn't ready to ship," and how readily each questions the premise.Pick the failure mode you'd rather absorb. GLM-5.2 errs confident-but-occasionally-wrong; Anthropic errs careful-but-occasionally-under-engaged with the framing. Neither is universally better.**Routing convergence is robust enough that lightweight "pick the paper, pick the route" workflows can use either provider.**Substantive artifact quality is where the choice starts to matter.

For each fork, both the GLM-5.2 and Anthropic Outrider outputs are public:

Fork Paper Anthropic GLM-5.2
atropos RiVER (RL w/o ground-truth)

PR #8Issue #5Issue #6Issue #5Issue #6Issue #6Issue #7Issue #6Issue #7Issue #5Issue #6Issue #3Issue #4Issue #3Issue #4PR #3(glm-5.2 unrecoverable — z.ai 529; sibling

── more in #large-language-models 4 stories · sorted by recency
── more on @anthropic 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/opus-vs-glm-5-2-in-a…] indexed:0 read:8min 2026-06-29 ·