{"slug": "opus-vs-glm-5-2-in-a-coding-agent-pipeline-paired-run-findings", "title": "Opus vs GLM-5.2 in a coding-agent pipeline — paired-run findings", "summary": "A controlled A/B test comparing Claude Opus and GLM-5.2 in a coding-agent pipeline revealed qualitative differences in engineering behavior. Using the same paper-implementation pipeline across 10 repository forks, Anthropic's model produced scoping discussions while GLM-5.2 generated complete pull requests with code, tests, and documentation. The findings highlight distinct temperaments: one model defaults to analysis, the other to action.", "body_md": "A controlled A/B across 10 repository forks × 2 model providers, running an identical paper-implementation pipeline ([ remyxai/outrider](https://github.com/remyxai/outrider) — Claude Code under the hood, with\n\n`glm-5.2`\n\nrouted at z.ai's Coding Plan endpoint vs default Anthropic). Same paper pinned to each repo, same chain, same prompt-set — model is the only variable. The interesting findings aren't quantitative; they're qualitative differences in *how each agent behaves when asked to do real engineering work*.\n\nThe action under test is [ remyxai/outrider](https://github.com/remyxai/outrider); the\n\n[installs the workflow on a target fork and dispatches pinned-paper runs:](https://github.com/remyxai/remyxai-cli)\n\n`remyxai-cli`\n\n```\n# Install Outrider on the target fork (one-time setup).\nremyxai outrider init --repo your-fork/repo --interest-id <uuid>\n\n# Drop the alternate provider's API key into the repo's secrets.\nremyxai outrider set-provider-secret \\\n  --repo your-fork/repo --provider zai --key-from ~/zai-key\n\n# A/B the same paper across providers + models.\nremyxai outrider trigger --repo your-fork/repo --pin-method 2606.27369v1 \\\n  --provider anthropic --model claude-opus-4-7\nremyxai outrider trigger --repo your-fork/repo --pin-method 2606.27369v1 \\\n  --provider zai --model glm-5.2\n```\n\nPinning bypasses candidate-selection, so the same paper lands on every fork — the comparison reduces to \"model vs model on identical input.\" `--provider`\n\npicks the company / API endpoint; `--model`\n\npicks the specific model from that provider's catalog.\n\nSame paper ([arXiv:2606.27369v1](https://arxiv.org/abs/2606.27369v1) — RiVER, an RL-without-ground-truth method), same repo (fork of [NousResearch/atropos](https://github.com/NousResearch/atropos)), same pipeline. **Two different verdicts.**\n\n**Anthropic** drafted an implementation attempt, ran a self-review pass on its own output, decided the slice was too narrow to claim \"implements this paper,\" and downgraded to— a thoughtful scoping discussion of why this isn't mergeable yet.[Issue #7](https://github.com/smellslikeml/atropos/issues/7)**GLM-5.2** drafted the same kind of attempt, ran the same self-review pass, and shipped it as a: +462 / -1 across[draft PR #8](https://github.com/smellslikeml/atropos/pull/8)[4 files](https://github.com/smellslikeml/atropos/pull/8/files), including a new 175-LOC reward-function module, 120 LOC of unit tests, module registration, and a directory README.\n\nA careful read of [GLM's diff](https://github.com/smellslikeml/atropos/pull/8/files) shows it's not a stub. The module docstring correctly names the paper, identifies the two failure modes RiVER analyzes (scale dominance, frequency dominance), and explains how its min-max calibration + emphasis exponent counters each. The tests aren't smoke — two of them specifically verify the paper's failure-mode claims (100× scaled instance → identical rewards; rare strong solver among many mediocre → only strong gets max reward), with the emphasis math worked out correctly. The PR body is honest about scope: \"starting point — no benchmarking experiment is bundled yet.\"\n\n**Two legitimate outputs from identical input.** Neither is wrong. They embody different *temperaments* around the same engineering question.\n\nDefaults to **action**. Where Anthropic asks scoping questions, GLM proposes a specific plan. On a kernel-port paper, GLM said \"profile the existing kernel on skinny shapes to confirm the bottleneck before porting\"; Anthropic listed three open questions. On a deterministic-control-plane paper, GLM proposed \"a `BaseMiddleware`\n\nsubclass with a conformance test\"; Anthropic listed three scoping questions. On an active-perception paper, GLM proposed a thin video-eval harness consuming the released trajectories; Anthropic proposed a control-pattern-only prototype, explicitly dropping the eval question.\n\nGLM commits even when the commitment rests on assumptions the maintainer might not endorse. Its **failure mode is confabulation under uncertainty**. On one preflight pass against an observability platform's repo ([opik](https://github.com/smellslikeml/opik/issues/6) on the GLM side vs [opik](https://github.com/smellslikeml/opik/issues/5) on the Anthropic side), GLM asserted that \"the repo layout supplied for routing is empty\" — factually wrong; the repo has hundreds of relevant Python files, and Anthropic correctly named specific classes (`BaseOptimizer.optimize_prompt()`\n\n, `OptimizableAgent`\n\n, `LiteLLMAgent`\n\n) and module paths. When context-grounding falters, GLM reaches for an unsupported assertion rather than slowing down.\n\nWhen GLM does ship code, it ships *more* of it: longer Issue bodies, more sub-headings, more structure (`TL;DR / Suggested experiment / Engineering analysis / What blocks / How to unblock`\n\nis GLM's house style), wider artifact scope.\n\nDefaults to **scoping**. Where GLM commits to \"let's try this specific thing,\" Anthropic answers a different question: \"should we try this at all, and how do we know?\" Its Issue bodies are tighter, name specific files and line numbers, and lean on open questions over proposed experiments.\n\nIts **failure mode is under-engagement with meta-framing.** On two paired Issues, GLM caught sharp scope-defining observations Anthropic missed:\n\n- On\n[LiveKit Agents](https://github.com/smellslikeml/agents/issues/7)(GLM Issue), commenting on a video-understanding paper:*\"the repo's existing 'action' concept is LLM tool-calling within a voice turn, not perception actions over a video timeline.\"*That's a precise dissolution of the recommendation's premise — the two concepts share a word but not a structure. - On\n[AG2](https://github.com/smellslikeml/ag2/issues/7)(GLM Issue), commenting on a control-plane-for-coding-agents paper:*\"the paper argues governance should NOT be delegated to LLM orchestration, whereas AG2 IS an LLM-orchestration framework — even the intent sits beside, not inside, AG2.\"*That observation reframes the entire premise.\n\nAnthropic [answered the question](https://github.com/smellslikeml/agents/issues/6) the candidate-pool put in front of it. GLM sometimes questioned it.\n\nAnthropic's self-review pass is *also* stricter. On atropos, given roughly equivalent drafted output, Anthropic's self-review voted to downgrade to Issue while GLM's cleared. Either GLM's draft happened to clear Anthropic's higher bar (and the gap is in self-review calibration, not draft quality), or GLM's self-review is just laxer. Both interpretations are consistent with the data; disambiguating would need n>1 paired PR-attempts.\n\n8 of 9 paired completions matched on the routing verdict — including the deeper refinement-chain gates, not just preflight:\n\n- Both providers picked Issue when the repo lacked the infrastructure the paper depends on (no Triton runtime for\n[a Triton kernel paper](https://github.com/smellslikeml/mlx/issues/6); no MAS orchestration for[a multi-agent benchmark](https://github.com/smellslikeml/opik/issues/6); no trainer for[an RL recipe](https://github.com/smellslikeml/agents/issues/7); no config-supply-chain for[a governance paper](https://github.com/smellslikeml/ag2/issues/7)). - Both fired the same downstream gates when the chain ran:\n[diff-risk downgrade on OLMo-core](https://github.com/smellslikeml/OLMo-core/issues/4)(GLM) ↔[same on Anthropic side](https://github.com/smellslikeml/OLMo-core/issues/3);[no-integration on neural-steering](https://github.com/smellslikeml/neural-steering/issues/4)(GLM) ↔[same on Anthropic side](https://github.com/smellslikeml/neural-steering/issues/3);[preflight Issue on lm-evaluation-harness](https://github.com/smellslikeml/lm-evaluation-harness/issues/4)(GLM) ↔[same on Anthropic side](https://github.com/smellslikeml/lm-evaluation-harness/issues/3).\n\n**The routing decision is largely model-insensitive.** When the evidence forces a particular verdict, both models see it — at preflight *and* at the deeper validator gates. If a workflow just needs \"which paper for this repo, and PR or Issue,\" the two models look interchangeable for that purpose.\n\nThe substance only diverges in *how* the verdict is justified, what evidence each model exploits, and what the body recommends doing next.\n\nAcross 10 paired runs (with 3 caveats — 2 GLM runs hit an envelope-parsing bug that dropped their token counts, 1 hit the timeout described below; the clean comparison below is on the 7 fully-recorded pairs):\n\n| Anthropic Opus | GLM-5.2 | Ratio | |\n|---|---|---|---|\nTotal spend |\n$11.11 | $1.20 | Anthropic 9.2× more expensive |\nTotal input tokens |\n54,858 | 407,441 | GLM uses 7.4× MORE input |\nTotal output tokens |\n144,147 | 235,269 | GLM uses 1.6× more output |\nTotal wall-clock |\n3,430s | 6,396s | GLM 1.9× slower overall |\n\nThe striking number is **input tokens**. GLM consumes 7× more context to do the same task — not just verbose at output, it *reads* substantially more per run (more tool-use turns, weaker caching, or just less efficient context management depending on how you read it). It's still ~9× cheaper because the unit-price gap between the two providers is large enough to swamp the token gap.\n\nWall-clock varies a lot per fork, and the variance maps to the behavioral profiles above:\n\n| Fork | Anthropic | GLM-5.2 | GLM ratio |\n|---|---|---|---|\n| ultralytics | 491s | 376s | 0.77× (GLM faster) |\n| mlx | 259s | 289s | 1.12× |\n| opik | 180s | 283s | 1.57× |\n| OLMo-core | 541s | 884s | 1.63× |\n| neural-steering | 583s | 1069s | 1.83× |\n| lm-evaluation-harness | 135s | 289s | 2.14× |\n| ag2 | 97s | 261s | 2.69× |\n| agents | 94s | 253s | 2.69× |\natropos |\n380s | 1582s |\n4.16× |\n\nTwo patterns:\n\n**Anthropic short-circuits faster on cheap-decision runs.** Its 94s / 97s preflight Issues (agents, ag2) explain the 2.7× gap on those forks — Anthropic ends quickly when the verdict is obvious; GLM consistently spends 250–290s regardless.**The committer pays for committing.** The 4.16× outlier on atropos is the same fork where GLM shipped real code and Anthropic self-reviewed-out. Writing 462 lines of well-tested module takes longer than deciding not to. The wall-clock cost directly mirrors the temperament difference.\n\nOne paired comparison hits an upstream wall: glm-5.2 on [open-instruct](https://github.com/smellslikeml/open-instruct) (~50K files) returns HTTP 529 from z.ai across multiple attempts. Not the Outrider timeout — raising `--claude-timeout`\n\nto 1500s let preflight run well past its old 180s ceiling, and z.ai still rejects. The sibling glm-4.6 run on the same paper+repo completed cleanly in 96s with [Issue #4](https://github.com/smellslikeml/open-instruct/issues/4) (paper-grounded; names the OLMo-core trainer attention-mask interface as the integration blocker), so the constraint is z.ai's service capacity for glm-5.2 on a prompt this size — not anything on the Outrider side. The headline glm-5.2 vs Anthropic paired-PR result on this fork stays open until z.ai capacity recovers.\n\n**The two agents have different temperaments, not different competence at this kind of work.** Both produce legitimate artifacts on the same input; they differ in how willing each is to commit, how strict each is about \"this isn't ready to ship,\" and how readily each questions the premise.**Pick the failure mode you'd rather absorb.** GLM-5.2 errs confident-but-occasionally-wrong; Anthropic errs careful-but-occasionally-under-engaged with the framing. Neither is universally better.**Routing convergence is robust enough that lightweight \"pick the paper, pick the route\" workflows can use either provider.**Substantive artifact quality is where the choice starts to matter.\n\nFor each fork, both the GLM-5.2 and Anthropic Outrider outputs are public:\n\n| Fork | Paper | Anthropic | GLM-5.2 |\n|---|---|---|---|\n| atropos | RiVER (RL w/o ground-truth) |\n|\n\n[PR #8](https://github.com/smellslikeml/atropos/pull/8)[Issue #5](https://github.com/smellslikeml/mlx/issues/5)[Issue #6](https://github.com/smellslikeml/mlx/issues/6)[Issue #5](https://github.com/smellslikeml/opik/issues/5)[Issue #6](https://github.com/smellslikeml/opik/issues/6)[Issue #6](https://github.com/smellslikeml/agents/issues/6)[Issue #7](https://github.com/smellslikeml/agents/issues/7)[Issue #6](https://github.com/smellslikeml/ag2/issues/6)[Issue #7](https://github.com/smellslikeml/ag2/issues/7)[Issue #5](https://github.com/smellslikeml/ultralytics/issues/5)[Issue #6](https://github.com/smellslikeml/ultralytics/issues/6)[Issue #3](https://github.com/smellslikeml/lm-evaluation-harness/issues/3)[Issue #4](https://github.com/smellslikeml/lm-evaluation-harness/issues/4)[Issue #3](https://github.com/smellslikeml/OLMo-core/issues/3)[Issue #4](https://github.com/smellslikeml/OLMo-core/issues/4)[PR #3](https://github.com/smellslikeml/open-instruct/pull/3)*(glm-5.2 unrecoverable — z.ai 529; sibling*", "url": "https://wpnews.pro/news/opus-vs-glm-5-2-in-a-coding-agent-pipeline-paired-run-findings", "canonical_source": "https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c", "published_at": "2026-06-29 22:36:05+00:00", "updated_at": "2026-06-30 03:48:35.631328+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "developer-tools", "ai-research"], "entities": ["Anthropic", "GLM-5.2", "Claude Opus", "z.ai", "NousResearch", "remyxai", "arXiv"], "alternates": {"html": "https://wpnews.pro/news/opus-vs-glm-5-2-in-a-coding-agent-pipeline-paired-run-findings", "markdown": "https://wpnews.pro/news/opus-vs-glm-5-2-in-a-coding-agent-pipeline-paired-run-findings.md", "text": "https://wpnews.pro/news/opus-vs-glm-5-2-in-a-coding-agent-pipeline-paired-run-findings.txt", "jsonld": "https://wpnews.pro/news/opus-vs-glm-5-2-in-a-coding-agent-pipeline-paired-run-findings.jsonld"}}