{"slug": "opus-4-7-low-vs-medium-vs-high-vs-xhigh-vs-max-the-reasoning-curve-on-29-real-an", "title": "Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo", "summary": "An evaluation of Anthropic's Opus 4.7 model across five reasoning effort settings on 29 real-world tasks from an open source GraphQL repository found that medium reasoning effort produced the best results, outperforming higher settings on test pass rate, patch equivalence, code-review pass rate, and code quality. Higher reasoning settings (high, xhigh, and max) cost more time and money without improving outcomes, while low reasoning sacrificed too much correctness. The non-monotonic reasoning curve contrasts with OpenAI's GPT-5.5, which showed the expected pattern where more reasoning effort yielded better results.", "body_md": "# Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo\n\nI ran Opus 4.7 in Claude Code at all reasoning effort settings (low, medium, high, xhigh, and max) on the same 29 tasks from an open source repo (GraphQL-go-tools, in Go).\n\n**On this slice, Opus 4.7 did not behave like a model where more reasoning effort had a linear correlation with more intelligence. In fact, the curve appears to peak at medium.**\n\nIf you think this is weird, I agree! This was the follow-up to a Zod run where Opus also looked non-monotonic. I reran the question on GraphQL-go-tools because I wanted a more discriminating repo slice and didn’t trust the fact that more reasoning != better outcomes. Running on the GraphQL repo helped clarified the result: Opus still did not show a simple higher-reasoning-is-better curve.\n\nThe contrast is GPT-5.5 in Codex, which overall *did* show the intuitive curve: more reasoning bought more semantic/review quality. That post is here: [https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve](https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve)\n\nMedium has the best test pass rate, highest equivalence with the original human-authored changes, the best code-review pass rate, and the best aggregate craft/discipline rate. Low is cheaper and faster, but it drops too much correctness. High, xhigh, and max spend more time and money without beating medium on the metrics that matter.\n\nMore reasoning effort doesn't only cost more - it changes the way Claude works, but without reliably improving judgment. Xhigh inflates the test/fixture surface most. Max is busier overall and has the largest implementation-line footprint. But even though both are supposedly thinking more, neither produces \"better\" patches than medium.\n\nOne likely reason: Opus 4.7 uses adaptive thinking - the model already picks its own reasoning budget per task, so the effort knob biases an already-adaptive policy rather than buying more intelligence. More on this below.\n\nAn illuminating example is PR #1260. After retry, medium recovered into a real patch. High and xhigh used their extra reasoning budget to dig up commit hashes from prior PRs and confidently declare \"no work needed\" - voluntarily ending the turn with no patch. Medium and max read the literal control flow and made the fix.\n\nOne broader takeaway for me: this should not have to be a one-off manual benchmark. If reasoning level changes the kind of patch an agent writes, the natural next step is to let the agent test and improve its own setup on real repo work.\n\n*For this post, \"equivalent\" means the patch matched the intent of the merged human PR; \"code-review pass\" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch.*\n\nThe data:\n\nClaude Opus 4.7 on GraphQL-go-tools: medium peaks across pass rate (28/29), equivalence (48%), review pass (35%), and aggregate craft/discipline. High, xhigh, and max each cost more without beating medium on any primary quality metric. The curve is non-monotonic, unlike the GPT-5.5 Codex run on the same repo.\n\nPer-task drilldown - sorted by widest spread\n\nEnter opens profile; Tab moves between rows.\n\nInspect-grade five-arm curve on a 29-task GraphQL-go-tools matched slice. Each arm is a candidate-arm run on the same slice; the medium-vs-high, medium-vs-xhigh, low-vs-medium, and medium-vs-max compares are stitched against the same medium baseline. Max is decision-grade for inspect/readout after targeted no-patch retry and infra repair.\n\nCost authority is the source-arm summary for each level. No-patch rows reduce publishable denominators for low, high, and xhigh, and built-in equivalence / code-review coverage is partial on those rows. Max no longer has infra/no-patch failures after targeted repair.\n\nCode-review rubric means use the flattened RubricScores from each level's source summary. Low has two no-patch rows that drop out of patch-intrinsic rubrics; medium recovered stet-pr-1260 after retry while high and xhigh did not.\n\nReproduce from the source summaries listed in the raw JSON; regenerate the chart data with `cd leaderboard && npm exec tsx scripts/build-opus47-graphql-reasoning-curve.mjs`\n\n.\n\nPrior Zod signal\n\nThe earlier 28-task Zod run was the reason to rerun on GraphQL: tests were flat, while equivalence and review moved around above the gate. Interesting, but not clean enough for the default-setting claim.\n\nfour arms only - no max\n\nRead: Zod made the non-monotonic Opus behavior visible first. GraphQL is the cleaner follow-up because it uses the same 29-task slice across all five Opus efforts and medium wins the behavioral table outright.\n\n## Why I Ran This\n\nAfter my last post comparing GPT-5.5 vs 5.4 vs Opus 4.7, I was curious how intra-model performance varied with reasoning effort. Doing research online, it's very very hard to gauge what *actual experience* is like when varying the reasoning levels, and how that applies to the work that I'm doing.\n\nI first ran this on Zod, and the result looked strange: tests were flat across low, medium, high, and xhigh, while the above-test quality signals moved around in mixed ways. Low, medium, high, and xhigh all landed at 12/28 test passes. But equivalence moved from 10/28 on low to 16/28 on medium, 13/28 on high, and 19/28 on xhigh; code-review pass moved from 4/27 to 10/27, 10/27, and 11/27. That was interesting, but not clean enough to make a default-setting claim. It could have been a Zod-specific artifact, or a sign that Opus 4.7 does not have a simple \"turn reasoning up\" curve.\n\nSo I reran the question on GraphQL-go-tools. To separate vibes from reality, and figure out where the cost/performance sweet spot is for Opus 4.7, I wanted the same reasoning-effort question on a more discriminating repo slice.\n\nThis is not meant to be a universal benchmark result - I don't have the funds or time to generate statistically significant data. The purpose is closer to \"how should I choose the reasoning setting for real repo work?\", with `GraphQL-Go-Tools`\n\nas the example repo.\n\nPublic benchmarks flatten the reviewer question that most SWEs actually care about: would I actually merge the patch, and do I want to maintain it? That's why I ran this test - to gain more insight, at a small scale, into how coding agents perform on real-world tasks.\n\nTerminal-Bench consists of esoteric problems that mostly aren't encountered in day-to-day software, SWE-bench verified is contaminated (as in models already have answers baked in), and SWE-bench Pro is useful, but generic. That is not a knock on SWE-bench or Terminal-Bench. Standardized benchmarks are useful, but they mostly answer a binary task-outcome question.\n\nThe question I care about day to day is narrower and more annoying: did the agent make the same kind of change a human merged in my codebase, and would I want to own the patch afterward?\n\n## Experimental Setup\n\nEach task is derived from a real merged PR or commit. The model gets a frozen repo snapshot, a prompt describing the change, and one attempt to produce a patch in a Docker container. Stet then applies the patch and runs the task's tests in an isolated container to check if it passed/failed.\n\nThen Stet grades the result beyond pass/fail:\n\n**Equivalence:** does the candidate patch accomplish the same behavioral change as the original human patch?**Code review:** would a reviewer accept the patch, considering correctness, introduced-bug risk, maintainability, and edge cases?**Footprint risk:** how much additional code did the agent touch when compared with the human patch?**Craft/discipline rubrics:** attempt to capture non-correct aspects of code. Basically, would a reviewer want to maintain this code. The categories are clarity, simplicity, coherence, intentionality, robustness, instruction adherence, scope discipline, and diff minimality\n\nThose metrics exist because tests alone do not answer the thing I actually care about: would this patch be something I want to merge and maintain?\n\nEvery model ran once per task with a single seed. The LLM-as-a-judge model was GPT-5.4. Each patch was scored independently - the judge sees the patch and the task, and was blinded to the model/effort that produced the patch. I also manually inspected representative examples as sanity checks. There was no human calibration pass on this task set, so I would trust the direction of the deltas more than any single absolute score.\n\nAs an aside, I've also been using these evaluations as an \"autoresearch\" optimization loop, not just a benchmark. I tell my agent something like \"make `AGENTS.md`\n\nbetter for this repo\"; it proposes an edit, runs Stet on historical tasks, figures out where the candidate was better / worse and why, and iterates to improve the evaluation numbers.\n\nDetails:\n\n- Model: Opus 4.7\n- Harness: Claude Code 2.1.126-2.1.138 (varied across arms by run date; npm-installed latest at each run)\n- Dataset: 29 real GraphQL-go-tools tasks.\n- Yes this is small - however running even this used most of my weekly 20x quota\n\n- Main metrics:\n- test pass\n- semantic equivalence\n- code-review pass\n- footprint risk\n- craft/discipline custom graders\n- cost and runtime\n\n## Low: Cheaper, Shallower, and Incomplete\n\nLow appears to drive Opus 4.7 to work through most issues on a surface level. It is faster, cheaper, and lower-footprint (touching less files relative to the human-authored change), but misses important pieces of the task, leaving gaps in correctness.\n\nIn practice, low is superseded by medium, with just a ~26% increase in cost ($2.50 → $3.15) and a noticeably better performance across the board.\n\nExample: PR #1230 fixes two GraphQL federation query-planner bugs and adds an empty-selection-set guard on the GraphQL datasource print path.\n\n- Task: tighten the planner's parent-chain selection and add the right-shape validation guard.\n- Lower-effort failure mode: low worked in the wrong boundary, inlining hand-rolled recursive AST helpers directly into\n`graphql_datasource.go`\n\nrather than registering a planner-scoped validation rule. The unique-node selection logic stayed eager, tests failed, and the patch was non-equivalent with the human PR. - Higher-effort change: medium did the same job at the right boundary - a dedicated validation rule wired into the planner's\n`printKitPool`\n\n- and matched the two-pass planner shape the human PR used. - Lesson: low does work, but at the wrong level of abstraction. It tends to inline behavior into the file it happens to be reading rather than picking the package boundary the task is actually about.\n\n## Medium: Balance of Restraint and Correctness\n\nMedium appears to be the level that does enough repo modeling without drifting into prior-PR rationalization, no-op stories, or oversized patch surface.\n\nIt has the best test pass count, is the most equivalent with the human patches, passes code review at a higher rate, and performs the best on the craft/discipline rubrics.\n\nWhen looking at the original Zod slice, medium improved over low, but the higher-effort signal was mixed: xhigh had the best equivalence rate, high had the best discipline average, and tests stayed flat. GraphQL is the cleaner medium-wins read.\n\nMedium spends its extra effort productively - looking at the agent trajectories, it runs more tests than high/xhigh while avoiding the bloated time/tokens from max. On this slice, medium looks like the local optimum: enough reasoning to execute the user's intent, without going down too many rabbit holes.\n\nExample: PR #1260 makes GraphQL subscription query plans include trigger metadata (subgraph name/ID, trigger query), and lets a `SkipLoader`\n\nquery-plan introspection request return the plan for a subscription without opening the upstream stream.\n\n- Task: make the existing\n`SkipLoader`\n\nearly-return reachable for plan-only requests, then surface trigger metadata in the printed plan. The repo already contained partial scaffolding from PR #1008, which is the trap. - Lower-effort failure mode: low got confused by the partial pre-existing code and asked the operator for the diff - \"I can't proceed without knowing what specifically PR #1260 changes.\" End of turn, no patch.\n- Higher-effort failure mode: high and xhigh used their extra reasoning budget to dig up commit hashes (\n`34cc4fa8`\n\n,`69485dfe`\n\n), conclude the feature had already been shipped in earlier PRs, and stop with`end_turn`\n\nand no patch. Not a timeout, not a refusal - a confidently-wrong no-op. Xhigh's final message:*\"This work was originally added in commit*`34cc4fa8`\n\n(PR #1008) and refined by`69485dfe`\n\n(PR #1120). No code changes are needed; nothing left to implement.\" - Medium's win: read the literal control flow, saw that the existing\n`SkipLoader`\n\nbranch sat*after*a`Trigger.Source == nil`\n\nguard and was therefore unreachable for plan-only requests, and made the minimum hoist-and-extract fix. Tests passed. (Max made the same fix plus an added regression test.) - Lesson: on tasks where the repo already contains adjacent prior work, more reasoning amplifies the temptation to rationalize a no-op. The extra budget doesn't go into running the code - it goes into building a more sophisticated story for why running the code isn't necessary.\n\n## High: the Limits of More Thinking\n\nAt high, we begin to see signs of \"overthinking\".\n\nHigh costs $5.01/task versus medium's $3.15/task and runs 716.4s/task versus medium's 450.7s/task. It also makes more shell calls and tool calls than medium. But its pass rate falls to 26/29, equivalence falls to 12/29, review pass falls to 7/29, review-rubric mean falls to 2.509, and aggregate custom quality falls to 2.670.\n\nThat pattern suggests extra effort is not strictly adding more intelligence and discovering more correct implementation paths. It may be spending additional work on larger or less focused paths, with no corresponding improvement in semantic judgment.\n\nAlso note that this is still a small sample, so a rerun may slightly change the curve. The point is more practical than statistical: the observed deltas point the wrong way for a paid upgrade. Using more reasoning might actually *increase* risk by steering the model toward more complex, convoluted changes.\n\nExample: PR #1293 refactors planner/resolve metadata into a centralized `FetchInfo`\n\n, adds an opt-in `BuildFetchReasons`\n\nplanner switch, replaces `KeyConditionCoordinate`\n\nwith a reusable `FieldCoordinate`\n\n- **and** bumps `go.work`\n\n's `toolchain go1.25`\n\nto `go1.25.1`\n\n(a one-character change) plus trims `--config ../.golangci.yml`\n\nfrom two Makefiles.\n\n- Task: a real refactor plus a small bundle of boring build-plumbing fixes.\n- Higher-effort failure mode: high, xhigh, and max all skipped\n`go.work`\n\nand the Makefile fixes entirely. They produced smaller, more elegant refactor-only diffs (11-13 files vs medium's 18) - but the toolchain pin stayed broken (`go1.25`\n\nis \"a language version but not a toolchain version\"), so`go test`\n\naborted at toolchain resolution before any Go code ran. The reviewer also flagged the refactor itself as half-done - the old`RequireFetchReasons(typeName, fieldName)`\n\nAPI was left alive next to the new`FieldCoordinate`\n\none. - Medium's win: medium produced the\n*largest*diff (18 files, 462+/288−) because it did the full job, including the boring one-character bump. Stet's equivalence rescue actually flagged high/xhigh/max as \"likely equivalent\" - but review wasn't a clean stylistic pass, because the refactor was half-finished. - Lesson: more reasoning narrowed the diff toward the \"interesting\" code and pruned away one-line build-plumbing fixes that were actually load-bearing. Conceptual elegance is not the same as PR scope completeness.\n\n## Xhigh: Larger Surface, Worse Results\n\nXhigh may be the most counterintuitive arm if we expect reasoning effort to monotonically improve outcomes. It's also Claude Code's default for Opus 4.7, and Anthropic’s stated “best option” for coding.\n\nIt costs $6.51/task, runs 803.8s/task, touches the most files, and has the highest test/fixture share of added lines. It adds 7,764 lines, with 47.5% in test/fixture surface. But xhigh does not run more tests than medium, does not use more tools than medium, and does not edit more iteratively than medium.\n\nAdditionally, the quality signal is weaker than medium almost everywhere, indicating that these additional edits don't contribute to overall patch quality.\n\nInterpreting the behavior, xhigh makes more elaborate changes, with more tests, without being more correct / aligned with the original human intent. It may write more code, fixtures, or tests, but that does not consistently translate to positive outcomes.\n\nExample: PR #859 replaces O(n) linear scans in GraphQL planning hot paths (added-path lookups, datasource root/child node checks) with map-backed O(1) indexes.\n\n- Task: swap two hot-path lookups for map-backed indexes. That's it.\n- Medium's patch: 2 files, 85 added lines, both in the hot-path files the task named. Tests pass.\n- Xhigh's patch: 5 files, 263 added lines (3.1x medium) - including a brand-new 170-line\n`federation_metadata.go`\n\ncaching interface-implementor and entity-interface membership that the task didn't ask for. Tests still pass. - The tradeoff: code review flipped from fail to pass on xhigh, but\n`footprint_risk`\n\n*degraded*from \"low\" to \"medium,\" and`scope_discipline`\n\n/`diff_minimality`\n\nmoved only 0.1-0.2 points despite 3x the surface. The reviewer explicitly flagged the broader cached surface:*\"The patch expands beyond the minimal node/path indexes into federation metadata caching and changes multiple planner conditionals. That broader cached surface increases the chance of stale-index or semantic drift.\"* - Lesson: xhigh used the extra reasoning budget to invent a second-order refactor, not to write a tighter patch. More surface, similar outcome, worse footprint risk.\n\n## Max: much Busier, but Still not Better than Medium\n\nMax is a useful stress test of \"does more reasoning monotonically buy quality?\" - and the answer here is no. The max arm is decision-grade after targeted repair, but it is not a magic escape hatch from the same curve.\n\nMax ran 294 test commands vs medium's 132, made 1,153 shell calls vs 582, and produced 3,719 assistant turns vs 2,042. It also added 8,102 lines vs medium's 6,700 across patches, with the largest implementation-line footprint of any arm.\n\nBut none of that effort translated into better outcomes. Max came closest to medium on pass count (27/29 vs 28/29) but still trailed on equivalence, code-review pass, code-review rubric mean, and aggregate craft/discipline. At $8.84/task vs $3.15/task, max costs ~2.8x medium and produces ~3x fewer equivalent passes per dollar (0.051 vs 0.153).\n\nMax changed the *shape* of the work - more validation loops, more shell exploration, more implementation lines - without reliably improving the model's judgment.\n\nExample: PR #1076 is a concurrency-heavy rewrite of GraphQL subscription handling - replace shared `sync.Mutex`\n\n+ `semaphore.Weighted`\n\ncoordination with per-subscription serialized writer goroutines, move heartbeat ticking onto the writer path, fix WebSocket close semantics so only server-initiated close signals `updater.Done`\n\n, and enable `-race`\n\nby default in CI. This is the clearest showcased task where max paid off over medium.\n\n- Task: preserve a write-ordering invariant across a global concurrency refactor.\n- Lower-effort failure modes: low produced an empty patch. Medium left the old\n`triggerEventsSem`\n\n/ shared event-loop coexisting with a new worker channel, so the`should_successfully_delete_multiple_finished_subscriptions`\n\ntest failed deterministically - writes still raced teardown. - Xhigh's failure: equivalence-grader marked all five task obligations met (xhigh had the highest\n`instruction_adherence`\n\nof the bunch), but xhigh's worker dispatch used a`select / default: go func(){ ch <- f }()`\n\noverflow path that spawns unbounded goroutines and reorders writes. The same test failed for a different reason. Xhigh also edited four CI surfaces when the task only required one. - Max's win: max fully retired the shared coordinator like high did,\n*and*added a`MaxSubscriptionFetchTimeout`\n\ndefault plus a per-trigger shutdown wait - robustness graded`3.3`\n\nvs everyone else at`1.0-1.2`\n\n. - Lesson: on this slice, this is the clearest max-over-medium win, and even then it's not monotonic - xhigh elaborated itself into an unbounded-goroutine bug that medium's smaller diff didn't have room to introduce. Max wins by doing the same shared-coordinator cleanup high did, then adding extra safety guards on top. But this is 1 task out of 29; the other 28 tell a different story.\n\n## Craft And Discipline\n\nThe custom graders tell the same story as the headline metrics: medium leads, and more reasoning does not catch up.\n\nThe interesting split is that higher reasoning can make a patch look more deliberate without making it easier to own:\n\n**Medium wins on the dimensions reviewers actually flag in PRs:** simplicity (3.034), robustness (2.266), instruction adherence (2.338), and scope discipline (2.934).**High/xhigh/max pull ahead on intentionality and coherence**- the \"did the agent know what it was doing?\" dimensions. More reasoning makes the patch*look*more deliberate.**But that deliberateness does not pay off downstream.** Scope discipline drops from 2.934 (medium) to 2.697 (xhigh). Robustness drops from 2.266 (medium) to 1.932 (xhigh). The model thinks more about what it's doing, then does more of it, and the result is harder to maintain.\n\nThat is the headline read in miniature: higher reasoning effort changes the *kind* of work, but not the quality of judgment.\n\n## Cost And Runtime\n\nCost-adjusted quality is where the story gets blunt:\n\n- Medium produces\n**0.153 equivalent patches per dollar.** - High: 0.083.\n- Xhigh: 0.058.\n- Max: 0.051.\n\nMedium is ~3x more cost-efficient at producing patches that match human intent than max. Even if max were equal to medium on quality (it isn't), it would be hard to justify the spend.\n\nUnlike the GPT-5.5 Codex curve, where each step up bought measurable quality, Opus 4.7's cost scaling buys you a busier agent, not a better one.\n\n## Why This Might Happen\n\nOne plausible explanation is adaptive reasoning - on Opus 4.7, the model is already adapting reasoning to the task on its own.\n\n[Anthropic's docs](https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking) say adaptive thinking is the *only* supported mode on Opus 4.7 - fixed token budgets are no longer accepted. The model \"dynamically determine[s] when and how much to use extended thinking based on the complexity of each request.\" Reasoning effort influences the adaptive policy, but doesn’t cap it.\n\nThat framing fits the data here. If Claude is already picking a reasonable internal budget per task, forcing higher effort doesn't unlock new intelligence. Instead, it amplifies a policy that was already roughly right at medium. This is just a hypothesis, but it matches the observed data better than simply stating \"more tokens always buys better judgment.\"\n\nAnthropic itself acknowledges the risk. The Claude Code [model-config docs](https://code.claude.com/docs/en/model-config) warn that max \"may show diminishing returns and is prone to overthinking. Test before adopting broadly.\" Their separate [inverse-scaling research](https://alignment.anthropic.com/2025/inverse-scaling/) shows that extended reasoning can actively *deteriorate* outputs on certain task families - though that paper isn't coding-specific.\n\nIt’s worth noting that Anthropic's recommended Claude Code default for coding is xhigh, so medium winning here runs counter to their own guidance.\n\n## GPT-5.5 Contrast\n\nThe GPT-5.5 GraphQL run is the important contrast. On the same repo family, GPT-5.5 behaved much closer to the intuitive \"more reasoning buys more intelligence\" story (see [https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve](https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve)).\n\nWhen I ran the same broad experiment shape on GraphQL with GPT-5.5, equivalence, review pass, and craft/discipline quality moved strongly upward as reasoning increased. It was not perfectly monotonic on tests because xhigh lost one test pass versus high, and xhigh was much more expensive, but the above-test quality curve was mostly monotonic and very clear.\n\nOpus 4.7 did not do that on GraphQL. The same repo family and same kind of reasoning-effort intervention produced a different model behavior curve, one which peaked/flattened after medium.\n\n## Limitations\n\nI am not pretending that this is a statistically significant result, or that this result will carry over to your repo. That's ok - as long as we're aware that this is just one run, at one point in time, on one repo, it's still useful for thinking about our own reasoning settings.\n\nSpecific limitations / methodology gaps:\n\n- Single seed per task.\n- 29 matched real GraphQL-go-tools tasks, plus the original 28 Zod tasks as context.\n- LLM-as-judge was GPT-5.4; judge saw patch and task, but was blinded to the model/effort label.\n- No grader calibration on this task set.\n- No-patch rows reduce publishable denominators for low, high, and xhigh, and built-in equivalence / code-review coverage is partial on those rows. I treat that as part of the model/harness signal after retry, not an infra reason to discard the run.\n- Max is decision-grade for this inspect/readout, but this is still an inspect result rather than a promote result because the metrics are mixed and worse than medium on the primary dimensions.\n\n## Conclusion\n\nOn this slice, the practical answer is clear: use medium. That being said - read this as directional rather than absolute.\n\nPersonally, here's what I'll be trying moving forward:\n\n- Use medium as the daily driver for most tasks\n- Consider xhigh or max selectively for exploratory, complex, or cross-cutting tasks, then measure whether it actually helped\n\nReasoning effort clearly matters, but the curve is not smooth enough to provide a broad recommendation.\n\n**However, your results may vary. This is why teams should measure their own harnesses, on their own tasks, rather than copying global benchmark defaults.**\n\n*Disclosure: I am building Stet.sh, the local eval tool I used to run this. The product version is that you can ask your coding agent to improve its own setup - for example, make AGENTS.md better - and it uses Stet to test candidate changes against historical repo tasks. If your team is already using coding agents heavily and has a concrete decision in front of you - high vs xhigh, Codex vs Claude Code, an AGENTS.md update, or which tasks are safe to delegate - I am looking for a few teams to run repo-specific trials with. Stet runs entirely locally, using your LLM subscriptions. Join the waitlist at *\n\n[https://www.stet.sh/private](https://www.stet.sh/private)or reach out to me directly.\n\nData is great, but I'm also interested in anecdotal experience. How have people here been finding the behavior of Opus 4.7 at various reasoning efforts? Which one is your default? And if you have changed team defaults based on evidence instead of vibes, I especially want to hear how you measured it.", "url": "https://wpnews.pro/news/opus-4-7-low-vs-medium-vs-high-vs-xhigh-vs-max-the-reasoning-curve-on-29-real-an", "canonical_source": "https://www.stet.sh/blog/opus-47-graphql-reasoning-curve", "published_at": "2026-05-12 12:00:00+00:00", "updated_at": "2026-05-27 20:53:30.108957+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-research", "ai-products", "ai-tools"], "entities": ["Opus 4.7", "Claude Code", "GPT-5.5", "Codex", "GraphQL-go-tools", "Zod"], "alternates": {"html": "https://wpnews.pro/news/opus-4-7-low-vs-medium-vs-high-vs-xhigh-vs-max-the-reasoning-curve-on-29-real-an", "markdown": "https://wpnews.pro/news/opus-4-7-low-vs-medium-vs-high-vs-xhigh-vs-max-the-reasoning-curve-on-29-real-an.md", "text": "https://wpnews.pro/news/opus-4-7-low-vs-medium-vs-high-vs-xhigh-vs-max-the-reasoning-curve-on-29-real-an.txt", "jsonld": "https://wpnews.pro/news/opus-4-7-low-vs-medium-vs-high-vs-xhigh-vs-max-the-reasoning-curve-on-29-real-an.jsonld"}}