Composer 2.5 was 6.5x cheaper on 50 real Rust and Go PRs. Here's where it's actually safe.

wpnews.pro

Composer 2.5 is 6.47x cheaper on Rust and about 7x on Go. It runs roughly twice as fast (254s vs 489s on Rust). It passes the same test gate: 44 of 50 pooled, tied with GPT-5.5, while Opus 4.8 hits 47. The test gate is saturated and doesn't separate them. Then Opus 4.8 beats Composer on craft, and the margin survives multiplicity correction in both languages. Composer ranks last on both repos.

Composer saves real money on Rust and Go both. That saving holds only for task classes where tests actually exercise the obligation: bounded, proportional changes. Parser work, planner work, new API surface. Route those to the premium model. A leaderboard won't show you that line. You find it by measuring on your own code.

Each task is a real merged PR, replayed against a frozen repo snapshot with one attempt, then graded beyond pass/fail by Stet, the local eval harness I build: is the patch behaviorally equivalent to the human's, would a reviewer accept it, how large is the diff, and is the craft there?

n=50 slice: sqlparser-rs (Rust, n=25) plus graphql-go-tools (Go, n=25), four arms, judge GPT-5.4. Composer 2.5 sits in the cheap corner and last on the repo-balanced local score (and last on craft in both repos); Opus 4.8 leads quality at materially lower cost than the premium arms. The frontier defaults to local score — reweight it with the formula just below, or toggle the Y axis to craft. The calibrated per-pair strength is in the grid further down.

composer 2.5 comparison

cost vs local score #

vsCompare weighted Stet quality against rollout spend. Use the selectors to choose the two metrics that matter for the decision.

Opus 4.8 high: repo-balanced local score 74.4, cost per task $3.72
GPT-5.5 high: repo-balanced local score 70.0, cost per task $4.05
Opus 4.7 xhigh: repo-balanced local score 66.1, cost per task $4.74
Composer 2.5: repo-balanced local score 64.6, cost per task $0.62

Lower cost is better.

Higher local score is better.

Hover a point, or focus/tap a row to show the selected metric values in the chart. Repo-balanced local score (the reweightable display score below), pooled across both repos. Toggle the Y axis between local score, craft, code review, equivalence, tests, and footprint.

local score formula

A single repo-balanced display score: 5% tests + 30% equivalence + 25% code review + 25% craft + 15% footprint. Tests are downweighted because they saturate on this slice — every arm cleared 84%+. This is a compact display score, not a substitute for the calibrated pairwise statistics below. Drag to reweight and watch the ranking move. Opus 4.8 leads at the default weights; Composer is last unless you weight cost — which this quality score deliberately excludes, so cost stays the chart's second axis.

weights

live local score

model comparison

per-axis breakdown, all four arms

Best value in each row is highlighted. Quality rows: higher is better. Cost rows: lower is better. Composer wins both cost rows and is last or near-last on the quality rows. Opus 4.8 quality values are post-decontamination.

how confident are these comparisons?

statistical strength of each pairwise read, after correcting for multiple comparisons

Opus 4.8 vs Composer 2.5

Opus 4.8's craft lead survives BH-FDR multiplicity correction at q=0.05 in both repos (Rust 7/11 quality-family graders, Go 10/11). The clearest result in the slice.

Composer is cheaper: 6.47x on Rust (decision-grade, CI excludes 1.0); ~7x on Go (recovered from logs, directional). Opus 4.8 vs GPT-5.5

Opus 4.8 clears BH-FDR on Rust craft against GPT-5.5 (4/11 quality-family graders at q=0.05); the Go craft gap is weak/directional.

Opus 4.8 wins tokens decision-grade in both repos; cost is DG on Rust (0.81x) and noise-band on Go (0.83x) after a cache-pricing fix.

Opus 4.8 vs Opus 4.7-xhigh

Opus 4.8 ahead on craft and equivalence directionally; Opus 4.7 edges code-review-overall on Rust.

Opus 4.8 is cheaper on Go (decision-grade); Rust is a statistical wash. Decision-grade (DG) = survives BH-FDR multiplicity correction at q < 0.05. Directional = consistent across graders but doesn't clear that bar. Noise-band = indistinguishable at this sample size.

trace example

one model, two task classes, two outcomes #

Same Composer 2.5. On a bounded sqlparser-rs tokenizer change it ships clean and cheap. On a graphql-go-tools planner change it passes tests and fails review. The task class decides — switch tabs.

agent.patch

requires_provides.go (new) @@ new file v2/pkg/engine/plan/requires_provides.go (+130) @@

func requiresSatisfiedByUpstreamProvides(

parentPath string,

dsConfig DataSource,

nodeSuggestions *NodeSuggestions,

) bool { for path != "" { // walk every ancestor path

` ss, ok := providesSelectionSetForPath(path, dsConfig, nodeSuggestions)`

` if ok && selectionSetCoversRequires(ss, ...) { return true }`

` path = parentPathPrefix(path)`

}

return false

Interactive: switch model output, then hover, focus, or pin a rubric row to see which patch lines carry the evidence.

The cost case is real #

Composer 2.5 beats Opus 4.8 on cost in Rust by a decision-grade margin: 6.47x cheaper on the geometric mean, CI excludes 1.0. On Go it points toward roughly 7x cheaper ($17.71 Composer vs $110.27 Opus over 25 tasks), recovered from raw Cursor logs. Composer is also faster on the captured Rust arm: about 254 seconds vs ~489 for Opus, a shape difference that pays out on the wall clock. Go duration was not captured, and the flattened Composer run leaves GraphQL behavior/reliability deltas unavailable.

Tests agree. Composer ties the test gate: 44/50 pooled, matching GPT-5.5 at 44/50; Opus 4.8 sits at 47/50. The gate is near-saturated and does not discriminate. This is a dataset property, not a model property. On a tighter gate, the order would shift. The cost win is genuine, and the Rust speed win is captured. The test behavior is consistent with the field.

The test gate did not catch the quality gap #

Tests can tell you whether a cheap agent satisfied the harness. They cannot reliably tell you whether the cheap pass is safe enough to keep.

graphql-go-tools #1380 is the clarifying case. Tests passed. Review killed it. The patch is non-equivalent — Composer added a new 130-line ancestor-walk helper and sprawled to +282 / −6 across 6 files, craft 2.14. On the same task Opus 4.8 stayed behaviorally equivalent in roughly half the diff, though review flagged both. Tests do not catch a wrong-but-green patch, or the review gate waiting downstream.

graphql-go-tools #1260 runs the other direction. Composer sprawled to +272 / −26 across 8 files and the suite broke. Craft 2.54. The sprawl broke the build. We do not credit a "green tests" win on a task where tests failed. But the pattern is the same: local reasoning that looked complete at the model level shipped gaps the harness did not catch upstream.

Test pass barely separates anything here. Every Rust patch cleared the suite, 25 of 25. Go reached 19 of 25, passable until you ask the equivalence grader: it marks 15 of those 19 Go patches as not doing what the human did. Correct enough to pass, wrong enough to matter. The sharpest case was a patch that shipped a multi-megabyte build artifact and still cleared every test; a cheap size guardrail caught it, not the suite. I can't build a routing rule out of pass/fail alone. The signal lives above the gate, in equivalence, in review, and in craft.

Last on craft, and it replicates #

Opus 4.8 beats Composer on craft in both languages, and it replicates. On Rust, 7 of 11 quality-family graders show the lead at q=0.05; on Go, 10 of 11.

The absolute standings, directional: Rust craft means are Opus 4.8 at 3.28, Opus 4.7 at 2.98, GPT-5.5 at 2.94, Composer 2.5 at 2.84 (last). Go craft: Opus 4.8 at 2.90, GPT-5.5 at 2.72, Opus 4.7 at 2.63, Composer 2.5 at 2.48 (last). The Composer-Opus 4.8 delta is the load-bearing result. No single craft grader is individually decisive at n=25; the signal is cross-grader sign consistency. When 8–11 grader dimensions lean the same way, the direction is real even if no single grader is individually conclusive, and the detection floor (~0.34–0.49 on the 0–4 scale) sits bigger than most deltas.

Equivalence and code review tell the same story. Composer's equivalence on Rust is 0.80, Go 0.28; Opus 4.8 is 0.92 Rust, 0.40 Go. Code review: Opus 4.8 is 3.32 Rust, 2.76 Go; Composer 2.5 is 3.20 Rust, 2.25 Go. The craft standings replicate across both metrics and both languages.

Three buckets by task shape #

I classified Composer's 50 patches by task shape and measured outcome, and Composer's safe zone has a sharp shape, but a repo-specific one. Rust had a real replacement zone in bounded dialect and tokenizer work, plus a middle band of useful drafts that still needed review. Go did not produce a clean replace-the-premium-model zone on this slice: planner and federation work kept falling into review, equivalence, or scope trouble.

Same model, opposite answer by repo. A real replacement zone in Rust dialect and tokenizer work, and none in Go planner and federation work. The zone doesn't travel between codebases, which is the whole reason you measure your own instead of trusting someone else's leaderboard.

Replacement zone (Composer is safe). These are the local dialect and tokenizer changes where the test suite exercises the obligation and the patch stays proportional. sqlparser-rs #1620: SQLite dollar-placeholder tokenizer, equivalent, code review passed, craft 3.74, +54 lines / −2 lines across 4 files. #1472: Hive !

negation vs PostgreSQL !

factorial disambiguation, equivalent, passed review, craft 3.68. #1493: JSON_TABLE FOR ORDINALITY with NESTED PATH, equivalent, passed review, craft 3.53. (Review flagged a public-AST-shape maintainability point here, and the gate held – the "keep the review gate honest" example.) The fingerprint: bounded scope, tests that cover the real obligation, proportional changes, no contract drift.

Composer-first-then-grade (Go, careful). These tasks passed tests and hit equivalent, but code review caught issues that tests did not surface. #1001 (HTTP observability): passed tests, equivalent, but code review failed because response metadata was omitted on body-read failure and request metadata was captured before redirects. #859 (planner): passed tests, equivalent, code review failed on lazy metadata caches that can go stale without invalidation. Useful first drafts. Not merge-ready on green CI.

Escalate immediately to Opus or GPT. Parser and planner behavior changes. Public API shape drift. Checked-in expected test outputs. Exact source-text fidelity. Default and null behavior. Cross-package contracts. These are the boundaries where the test gate does not reach and where Composer's craft gap bites. Composer can pass these tests and still ship code that review rejects or code that compiles but does the wrong thing downstream.

Route Composer by task shape #

The obvious routing rule dies on the oracle problem. You want to send a task to the cheap model when the cheap model would have gotten it right, but knowing that means grading the diff against the merged PR, after the fact. At dispatch you don't have the outcome. What you do have is the shape of the task: the subsystem, whether the obligation is additive or structural, and the blast radius. All of that is visible before a single token runs.

That's where Stet comes in. Replay your own merged PRs, grade above the pass gate, and you get a shape-to-outcome map: where the cheap model lands, where it misfires. This post is that workflow run on 50 PRs across two repos. Four principles fell out of it.

Route by subsystem and obligation. Composer earned its place on bounded, additive grammar and tokenizer changes with one clear integration point, the #1620 shape: "add support for syntax X." It failed on planner, resolver, and federation work, and on obligations phrased as invariants or refactors ("represent X in the existing path," "preserve ordering," "consolidate the duplicated logic"), where it adds the new thing and quietly drops the constraint. That is the #1380 shape, and how the obligation is phrased predicts the failure better than the subsystem alone.Escalate on blast radius. Public API or type changes, defaults and null behavior, checked-in golden outputs, cross-package contracts, anything spanning many files or layers: premium model, whatever the tests say.Gate the output cheaply. A size check and a diff-proportionality check catch the worst patches without an oracle. If a 40-line change comes back as 200 lines across triple the files, something went sideways. Tests stay a floor, not the decision.Send green-but-bounded work to human review where review keeps finding edge defects. Composer is a fast first draft there.

The map keeps moving. The Rust-versus-Go split shows it doesn't transfer between codebases, and model updates plus a shifting task mix mean it won't hold still inside one repo either. So you re-measure and let routing follow the evidence. That's the difference between a leaderboard, which tells you who won last month, and a measurement layer, which tells you where to send tomorrow's task.

You can do this today by hand, before you automate anything. The automated version, classifying at dispatch and re-measuring on a schedule, is where it goes next.

What this is and isn't #

This is a local slice: 50 tasks across 2 open-source repos, run in late May 2026, judged by a separate model, GPT-5.4, blinded to model labels, single seed, one attempt per task.

This is not a general benchmark. Composite leaderboards that average across all task classes will tell you Composer ranks last. What matters is the partition: where is it safe for your code, and where does it leak cost into review? Stet is the policy layer for AI coding agents.

Composer is a cost lever inside the replacement zone and a liability outside it. The test gate is saturated, so it won't tell you which class a change belongs to; you have to measure that yourself on your own code. Default to the premium model for the hard classes (parser, planner, API shape, defaults, golden outputs), and route the cheap model only for bounded, test-covered work. The numbers here show the shape of the gap. Your numbers decide the deploy.

First scored result within the hour, on the Claude subscription you already have.

Join the waitlist Are you routing by task type, or are you routing by token budget and hoping the tests catch whatever breaks?

FAQ #

Is Composer 2.5 cheaper than Opus 4.8?

Yes. On this 50-task slice Composer was 6.47x cheaper than Opus 4.8 on Rust (decision-grade, CI excludes 1.0) and about 7x cheaper on Go (recovered from raw Cursor logs, directional). Rust duration also favors Composer; Go duration was not captured.

Is Composer 2.5 good enough to replace a premium model?

It depends on the task class. Composer ties the test gate but finishes last on craft in both repos, and Opus 4.8's craft lead over it survives statistical correction in both. It is safe on bounded, test-covered changes in this local sqlparser-rs/graphql-go-tools slice and a trap on parser, planner, and public-API work.

Where is Composer 2.5 safe to use?

In this local sqlparser-rs/graphql-go-tools slice, the replacement zone was bounded local changes the test suite actually exercises, like dialect adds and tokenizer tweaks, where Composer shipped equivalent, review-passing patches cheaply. Escalate parser and planner behavior, public API shape, and checked-in expected outputs to a premium model.

Did Composer beat the premium models on anything besides cost?

It tied on test pass rate (44/50, matching GPT-5.5). On craft, code review, and equivalence it was last or near-last in both repos. The test gate is saturated on this slice, so passing tests did not separate the models.

source & further reading

stet.sh — original article GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest When Fable 5 Is Worth the Premium Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust