A controlled A/B across 10 repository forks × 2 model providers, running an identical paper-implementation pipeline ( remyxai/outrider — Claude Code under the hood, with
glm-5.2
routed at z.ai's Coding Plan endpoint vs default Anthropic). Same paper pinned to each repo, same chain, same prompt-set — model is the only variable. The interesting findings aren't quantitative; they're qualitative differences in how each agent behaves when asked to do real engineering work.
The action under test is remyxai/outrider; the
installs the workflow on a target fork and dispatches pinned-paper runs:
remyxai-cli
remyxai outrider init --repo your-fork/repo --interest-id <uuid>
remyxai outrider set-provider-secret \
--repo your-fork/repo --provider zai --key-from ~/zai-key
remyxai outrider trigger --repo your-fork/repo --pin-method 2606.27369v1 \
--provider anthropic --model claude-opus-4-7
remyxai outrider trigger --repo your-fork/repo --pin-method 2606.27369v1 \
--provider zai --model glm-5.2
Pinning bypasses candidate-selection, so the same paper lands on every fork — the comparison reduces to "model vs model on identical input." --provider
picks the company / API endpoint; --model
picks the specific model from that provider's catalog.
Same paper (arXiv:2606.27369v1 — RiVER, an RL-without-ground-truth method), same repo (fork of NousResearch/atropos), same pipeline. Two different verdicts.
Anthropic drafted an implementation attempt, ran a self-review pass on its own output, decided the slice was too narrow to claim "implements this paper," and downgraded to— a thoughtful scoping discussion of why this isn't mergeable yet.Issue #7GLM-5.2 drafted the same kind of attempt, ran the same self-review pass, and shipped it as a: +462 / -1 acrossdraft PR #84 files, including a new 175-LOC reward-function module, 120 LOC of unit tests, module registration, and a directory README.
A careful read of GLM's diff shows it's not a stub. The module docstring correctly names the paper, identifies the two failure modes RiVER analyzes (scale dominance, frequency dominance), and explains how its min-max calibration + emphasis exponent counters each. The tests aren't smoke — two of them specifically verify the paper's failure-mode claims (100× scaled instance → identical rewards; rare strong solver among many mediocre → only strong gets max reward), with the emphasis math worked out correctly. The PR body is honest about scope: "starting point — no benchmarking experiment is bundled yet."
Two legitimate outputs from identical input. Neither is wrong. They embody different temperaments around the same engineering question.
Defaults to action. Where Anthropic asks scoping questions, GLM proposes a specific plan. On a kernel-port paper, GLM said "profile the existing kernel on skinny shapes to confirm the bottleneck before porting"; Anthropic listed three open questions. On a deterministic-control-plane paper, GLM proposed "a BaseMiddleware
subclass with a conformance test"; Anthropic listed three scoping questions. On an active-perception paper, GLM proposed a thin video-eval harness consuming the released trajectories; Anthropic proposed a control-pattern-only prototype, explicitly dropping the eval question.
GLM commits even when the commitment rests on assumptions the maintainer might not endorse. Its failure mode is confabulation under uncertainty. On one preflight pass against an observability platform's repo (opik on the GLM side vs opik on the Anthropic side), GLM asserted that "the repo layout supplied for routing is empty" — factually wrong; the repo has hundreds of relevant Python files, and Anthropic correctly named specific classes (BaseOptimizer.optimize_prompt()
, OptimizableAgent
, LiteLLMAgent
) and module paths. When context-grounding falters, GLM reaches for an unsupported assertion rather than slowing down.
When GLM does ship code, it ships more of it: longer Issue bodies, more sub-headings, more structure (TL;DR / Suggested experiment / Engineering analysis / What blocks / How to unblock
is GLM's house style), wider artifact scope.
Defaults to scoping. Where GLM commits to "let's try this specific thing," Anthropic answers a different question: "should we try this at all, and how do we know?" Its Issue bodies are tighter, name specific files and line numbers, and lean on open questions over proposed experiments.
Its failure mode is under-engagement with meta-framing. On two paired Issues, GLM caught sharp scope-defining observations Anthropic missed:
- On LiveKit Agents(GLM Issue), commenting on a video-understanding paper:*"the repo's existing 'action' concept is LLM tool-calling within a voice turn, not perception actions over a video timeline."That's a precise dissolution of the recommendation's premise — the two concepts share a word but not a structure. - On AG2(GLM Issue), commenting on a control-plane-for-coding-agents paper:"the paper argues governance should NOT be delegated to LLM orchestration, whereas AG2 IS an LLM-orchestration framework — even the intent sits beside, not inside, AG2."*That observation reframes the entire premise.
Anthropic answered the question the candidate-pool put in front of it. GLM sometimes questioned it.
Anthropic's self-review pass is also stricter. On atropos, given roughly equivalent drafted output, Anthropic's self-review voted to downgrade to Issue while GLM's cleared. Either GLM's draft happened to clear Anthropic's higher bar (and the gap is in self-review calibration, not draft quality), or GLM's self-review is just laxer. Both interpretations are consistent with the data; disambiguating would need n>1 paired PR-attempts.
8 of 9 paired completions matched on the routing verdict — including the deeper refinement-chain gates, not just preflight:
- Both providers picked Issue when the repo lacked the infrastructure the paper depends on (no Triton runtime for a Triton kernel paper; no MAS orchestration fora multi-agent benchmark; no trainer foran RL recipe; no config-supply-chain fora governance paper). - Both fired the same downstream gates when the chain ran: diff-risk downgrade on OLMo-core(GLM) ↔same on Anthropic side;no-integration on neural-steering(GLM) ↔same on Anthropic side;preflight Issue on lm-evaluation-harness(GLM) ↔same on Anthropic side.
The routing decision is largely model-insensitive. When the evidence forces a particular verdict, both models see it — at preflight and at the deeper validator gates. If a workflow just needs "which paper for this repo, and PR or Issue," the two models look interchangeable for that purpose.
The substance only diverges in how the verdict is justified, what evidence each model exploits, and what the body recommends doing next.
Across 10 paired runs (with 3 caveats — 2 GLM runs hit an envelope-parsing bug that dropped their token counts, 1 hit the timeout described below; the clean comparison below is on the 7 fully-recorded pairs):
| Anthropic Opus | GLM-5.2 | Ratio | |
|---|---|---|---|
| Total spend | |||
| $11.11 | $1.20 | Anthropic 9.2× more expensive | |
| Total input tokens | |||
| 54,858 | 407,441 | GLM uses 7.4× MORE input | |
| Total output tokens | |||
| 144,147 | 235,269 | GLM uses 1.6× more output | |
| Total wall-clock | |||
| 3,430s | 6,396s | GLM 1.9× slower overall |
The striking number is input tokens. GLM consumes 7× more context to do the same task — not just verbose at output, it reads substantially more per run (more tool-use turns, weaker caching, or just less efficient context management depending on how you read it). It's still ~9× cheaper because the unit-price gap between the two providers is large enough to swamp the token gap.
Wall-clock varies a lot per fork, and the variance maps to the behavioral profiles above:
| Fork | Anthropic | GLM-5.2 | GLM ratio |
|---|---|---|---|
| ultralytics | 491s | 376s | 0.77× (GLM faster) |
| mlx | 259s | 289s | 1.12× |
| opik | 180s | 283s | 1.57× |
| OLMo-core | 541s | 884s | 1.63× |
| neural-steering | 583s | 1069s | 1.83× |
| lm-evaluation-harness | 135s | 289s | 2.14× |
| ag2 | 97s | 261s | 2.69× |
| agents | 94s | 253s | 2.69× |
| atropos | |||
| 380s | 1582s | ||
| 4.16× |
Two patterns:
Anthropic short-circuits faster on cheap-decision runs. Its 94s / 97s preflight Issues (agents, ag2) explain the 2.7× gap on those forks — Anthropic ends quickly when the verdict is obvious; GLM consistently spends 250–290s regardless.The committer pays for committing. The 4.16× outlier on atropos is the same fork where GLM shipped real code and Anthropic self-reviewed-out. Writing 462 lines of well-tested module takes longer than deciding not to. The wall-clock cost directly mirrors the temperament difference.
One paired comparison hits an upstream wall: glm-5.2 on open-instruct (~50K files) returns HTTP 529 from z.ai across multiple attempts. Not the Outrider timeout — raising --claude-timeout
to 1500s let preflight run well past its old 180s ceiling, and z.ai still rejects. The sibling glm-4.6 run on the same paper+repo completed cleanly in 96s with Issue #4 (paper-grounded; names the OLMo-core trainer attention-mask interface as the integration blocker), so the constraint is z.ai's service capacity for glm-5.2 on a prompt this size — not anything on the Outrider side. The headline glm-5.2 vs Anthropic paired-PR result on this fork stays open until z.ai capacity recovers.
The two agents have different temperaments, not different competence at this kind of work. Both produce legitimate artifacts on the same input; they differ in how willing each is to commit, how strict each is about "this isn't ready to ship," and how readily each questions the premise.Pick the failure mode you'd rather absorb. GLM-5.2 errs confident-but-occasionally-wrong; Anthropic errs careful-but-occasionally-under-engaged with the framing. Neither is universally better.**Routing convergence is robust enough that lightweight "pick the paper, pick the route" workflows can use either provider.**Substantive artifact quality is where the choice starts to matter.
For each fork, both the GLM-5.2 and Anthropic Outrider outputs are public:
| Fork | Paper | Anthropic | GLM-5.2 |
|---|---|---|---|
| atropos | RiVER (RL w/o ground-truth) | ||
PR #8Issue #5Issue #6Issue #5Issue #6Issue #6Issue #7Issue #6Issue #7Issue #5Issue #6Issue #3Issue #4Issue #3Issue #4PR #3(glm-5.2 unrecoverable — z.ai 529; sibling