Opus vs GLM-5.2 in a coding-agent pipeline — paired-run findings

wpnews.pro

A controlled A/B across 10 repository forks × 2 model providers, running an identical paper-implementation pipeline ( remyxai/outrider — Claude Code under the hood, with

glm-5.2

routed at z.ai's Coding Plan endpoint vs default Anthropic). Same paper pinned to each repo, same chain, same prompt-set — model is the only variable. The interesting findings aren't quantitative; they're qualitative differences in how each agent behaves when asked to do real engineering work.

The action under test is remyxai/outrider; the

installs the workflow on a target fork and dispatches pinned-paper runs:

remyxai-cli

remyxai outrider init --repo your-fork/repo --interest-id <uuid>

remyxai outrider set-provider-secret \
  --repo your-fork/repo --provider zai --key-from ~/zai-key

remyxai outrider trigger --repo your-fork/repo --pin-method 2606.27369v1 \
  --provider anthropic --model claude-opus-4-7
remyxai outrider trigger --repo your-fork/repo --pin-method 2606.27369v1 \
  --provider zai --model glm-5.2

Pinning bypasses candidate-selection, so the same paper lands on every fork — the comparison reduces to "model vs model on identical input." --provider

picks the company / API endpoint; --model

picks the specific model from that provider's catalog.

Same paper (arXiv:2606.27369v1 — RiVER, an RL-without-ground-truth method), same repo (fork of NousResearch/atropos), same pipeline. Two different verdicts.

Anthropic drafted an implementation attempt, ran a self-review pass on its own output, decided the slice was too narrow to claim "implements this paper," and downgraded to— a thoughtful scoping discussion of why this isn't mergeable yet.Issue #7GLM-5.2 drafted the same kind of attempt, ran the same self-review pass, and shipped it as a: +462 / -1 acrossdraft PR #8 4 files, including a new 175-LOC reward-function module, 120 LOC of unit tests, module registration, and a directory README.

A careful read of GLM's diff shows it's not a stub. The module docstring correctly names the paper, identifies the two failure modes RiVER analyzes (scale dominance, frequency dominance), and explains how its min-max calibration + emphasis exponent counters each. The tests aren't smoke — two of them specifically verify the paper's failure-mode claims (100× scaled instance → identical rewards; rare strong solver among many mediocre → only strong gets max reward), with the emphasis math worked out correctly. The PR body is honest about scope: "starting point — no benchmarking experiment is bundled yet."

Two legitimate outputs from identical input. Neither is wrong. They embody different temperaments around the same engineering question.

Defaults to action. Where Anthropic asks scoping questions, GLM proposes a specific plan. On a kernel-port paper, GLM said "profile the existing kernel on skinny shapes to confirm the bottleneck before porting"; Anthropic listed three open questions. On a deterministic-control-plane paper, GLM proposed "a BaseMiddleware

subclass with a conformance test"; Anthropic listed three scoping questions. On an active-perception paper, GLM proposed a thin video-eval harness consuming the released trajectories; Anthropic proposed a control-pattern-only prototype, explicitly dropping the eval question.

GLM commits even when the commitment rests on assumptions the maintainer might not endorse. Its failure mode is confabulation under uncertainty. On one preflight pass against an observability platform's repo (opik on the GLM side vs opik on the Anthropic side), GLM asserted that "the repo layout supplied for routing is empty" — factually wrong; the repo has hundreds of relevant Python files, and Anthropic correctly named specific classes (BaseOptimizer.optimize_prompt()

, OptimizableAgent

, LiteLLMAgent

) and module paths. When context-grounding falters, GLM reaches for an unsupported assertion rather than slowing down.

When GLM does ship code, it ships more of it: longer Issue bodies, more sub-headings, more structure (TL;DR / Suggested experiment / Engineering analysis / What blocks / How to unblock

is GLM's house style), wider artifact scope.

Defaults to scoping. Where GLM commits to "let's try this specific thing," Anthropic answers a different question: "should we try this at all, and how do we know?" Its Issue bodies are tighter, name specific files and line numbers, and lean on open questions over proposed experiments.

Its failure mode is under-engagement with meta-framing. On two paired Issues, GLM caught sharp scope-defining observations Anthropic missed:

On LiveKit Agents(GLM Issue), commenting on a video-understanding paper:*"the repo's existing 'action' concept is LLM tool-calling within a voice turn, not perception actions over a video timeline."That's a precise dissolution of the recommendation's premise — the two concepts share a word but not a structure. - On AG2(GLM Issue), commenting on a control-plane-for-coding-agents paper:"the paper argues governance should NOT be delegated to LLM orchestration, whereas AG2 IS an LLM-orchestration framework — even the intent sits beside, not inside, AG2."*That observation reframes the entire premise.

Anthropic answered the question the candidate-pool put in front of it. GLM sometimes questioned it.

Anthropic's self-review pass is also stricter. On atropos, given roughly equivalent drafted output, Anthropic's self-review voted to downgrade to Issue while GLM's cleared. Either GLM's draft happened to clear Anthropic's higher bar (and the gap is in self-review calibration, not draft quality), or GLM's self-review is just laxer. Both interpretations are consistent with the data; disambiguating would need n>1 paired PR-attempts.

8 of 9 paired completions matched on the routing verdict — including the deeper refinement-chain gates, not just preflight:

Both providers picked Issue when the repo lacked the infrastructure the paper depends on (no Triton runtime for a Triton kernel paper; no MAS orchestration fora multi-agent benchmark; no trainer foran RL recipe; no config-supply-chain fora governance paper). - Both fired the same downstream gates when the chain ran: diff-risk downgrade on OLMo-core(GLM) ↔same on Anthropic side;no-integration on neural-steering(GLM) ↔same on Anthropic side;preflight Issue on lm-evaluation-harness(GLM) ↔same on Anthropic side.

The routing decision is largely model-insensitive. When the evidence forces a particular verdict, both models see it — at preflight and at the deeper validator gates. If a workflow just needs "which paper for this repo, and PR or Issue," the two models look interchangeable for that purpose.

The substance only diverges in how the verdict is justified, what evidence each model exploits, and what the body recommends doing next.

Across 10 paired runs (with 3 caveats — 2 GLM runs hit an envelope-parsing bug that dropped their token counts, 1 hit the timeout described below; the clean comparison below is on the 7 fully-recorded pairs):

Anthropic Opus	GLM-5.2	Ratio
Total spend
$11.11	$1.20	Anthropic 9.2× more expensive
Total input tokens
54,858	407,441	GLM uses 7.4× MORE input
Total output tokens
144,147	235,269	GLM uses 1.6× more output
Total wall-clock
3,430s	6,396s	GLM 1.9× slower overall

The striking number is input tokens. GLM consumes 7× more context to do the same task — not just verbose at output, it reads substantially more per run (more tool-use turns, weaker caching, or just less efficient context management depending on how you read it). It's still ~9× cheaper because the unit-price gap between the two providers is large enough to swamp the token gap.

Wall-clock varies a lot per fork, and the variance maps to the behavioral profiles above:

Fork	Anthropic	GLM-5.2	GLM ratio
ultralytics	491s	376s	0.77× (GLM faster)
mlx	259s	289s	1.12×
opik	180s	283s	1.57×
OLMo-core	541s	884s	1.63×
neural-steering	583s	1069s	1.83×
lm-evaluation-harness	135s	289s	2.14×
ag2	97s	261s	2.69×
agents	94s	253s	2.69×
atropos
380s	1582s
4.16×

Two patterns:

Anthropic short-circuits faster on cheap-decision runs. Its 94s / 97s preflight Issues (agents, ag2) explain the 2.7× gap on those forks — Anthropic ends quickly when the verdict is obvious; GLM consistently spends 250–290s regardless.The committer pays for committing. The 4.16× outlier on atropos is the same fork where GLM shipped real code and Anthropic self-reviewed-out. Writing 462 lines of well-tested module takes longer than deciding not to. The wall-clock cost directly mirrors the temperament difference.

One paired comparison hits an upstream wall: glm-5.2 on open-instruct (~50K files) returns HTTP 529 from z.ai across multiple attempts. Not the Outrider timeout — raising --claude-timeout

to 1500s let preflight run well past its old 180s ceiling, and z.ai still rejects. The sibling glm-4.6 run on the same paper+repo completed cleanly in 96s with Issue #4 (paper-grounded; names the OLMo-core trainer attention-mask interface as the integration blocker), so the constraint is z.ai's service capacity for glm-5.2 on a prompt this size — not anything on the Outrider side. The headline glm-5.2 vs Anthropic paired-PR result on this fork stays open until z.ai capacity recovers.

The two agents have different temperaments, not different competence at this kind of work. Both produce legitimate artifacts on the same input; they differ in how willing each is to commit, how strict each is about "this isn't ready to ship," and how readily each questions the premise.Pick the failure mode you'd rather absorb. GLM-5.2 errs confident-but-occasionally-wrong; Anthropic errs careful-but-occasionally-under-engaged with the framing. Neither is universally better.**Routing convergence is robust enough that lightweight "pick the paper, pick the route" workflows can use either provider.**Substantive artifact quality is where the choice starts to matter.

For each fork, both the GLM-5.2 and Anthropic Outrider outputs are public:

Fork	Paper	Anthropic	GLM-5.2
atropos	RiVER (RL w/o ground-truth)

PR #8 Issue #5 Issue #6 Issue #5 Issue #6 Issue #6 Issue #7 Issue #6 Issue #7 Issue #5 Issue #6 Issue #3 Issue #4 Issue #3 Issue #4 PR #3(glm-5.2 unrecoverable — z.ai 529; sibling

source & further reading

gist.github.com — original article Fusion Harness: How to combine a more expensive main model and a sidekick model Show your Neon branch in your terminal prompt (Starship) — copy-paste / agent-ready setup The Karpathy-Michaels (@SpaceWelder314) CLAUDE.md — Andrej Karpathy's golden rules merged with the battle-tested system prompt behind 100+ full-stack apps in under 12 months

Opus vs GLM-5.2 in a coding-agent pipeline — paired-run findings

Run your AI side-project on zahid.host