cd /news/large-language-models/i-ran-claude-code-on-every-new-claud… · home topics large-language-models article
[ARTICLE · art-34655] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

I Ran Claude Code on Every New Claude Model. Here's What Actually Ships.

An engineer spent a month routing all of Anthropic's 2026 Claude models—Haiku, Sonnet 4.6, Opus 4.8, Fable 5, and Mythos 5—through Claude Code across real codebases. The developer found that Sonnet 4.6 should be the default for everyday coding, Opus 4.8 excels at judgment-heavy tasks, and Fable 5 is best for long-horizon migrations. The key insight is that routing tasks to the right model doubled throughput and avoided wasted token costs.

read13 min views1 publishedJun 20, 2026

Fable, Mythos, Opus 4.8, Sonnet 4.6, Haiku — Anthropic's 2026 lineup is no longer "one model you talk to." It's a fleet you route between. I spent a month inside Claude Code orchestrating all of them across real codebases. Here's which model to reach for, when, and the routing playbook that quietly doubled my throughput.

Last time I wrote about Claude Skills and called Claude Code the killer host for them. Since then, two things happened that changed how I work day to day.

First, the models got genuinely strange-good. In the span of a few months Anthropic shipped Sonnet 4.6, Opus 4.8, and then an entirely new tier above Opus — the Mythos class — released to the public as Claude Fable 5. We went from "the AI suggested a decent diff" to Stripe reporting that Fable 5 ran a codebase-wide migration on a 50-million-line Ruby codebase in a single day — work that would've taken a team over two months by hand.

Second, Claude Code stopped being a single-model tool. With a fleet of models at different price/speed/intelligence points, the highest-leverage skill in 2026 isn't prompting — it's routing. Knowing which model to put on which task is the difference between burning $200 of tokens on a typo fix and one-shotting a multi-service refactor.

So I did the obvious thing: I wired all of them into Claude Code and ran them against real work for a month — bug fixes, migrations, greenfield features, test suites, the boring stuff and the scary stuff. This is what I learned.

Forget "Claude" as one thing. In 2026 it's a graded ladder, and each rung exists for a reason.

Model Class Sweet spot Price (in / out per M tokens)
Haiku
Fast tier High-volume, latency-sensitive, cheap glue work Lowest
Sonnet 4.6
Workhorse Everyday coding, agents, 1M context $3 / $15
Opus 4.8
Heavy lifter Architecture, refactors, judgment-heavy work $5 / $25 ($10 / $50 fast mode)
Fable 5
Mythos-class (safe) Long-horizon, frontier coding, vision, research $10 / $50
Mythos 5
Mythos-class (restricted) Cyber defense, life sciences — vetted access only $10 / $50

A few things worth knowing about how these actually relate:

Here's the mental model I settled on after a month. Think of it as a triage flow:

flowchart TD
    A[New task] --> B{How long-horizon<br/>and how risky?}
    B -->|Quick edit, glue,<br/>bulk text| H[Haiku]
    B -->|Everyday coding,<br/>most PRs| S[Sonnet 4.6]
    B -->|Architecture, refactor,<br/>needs judgment| O[Opus 4.8]
    B -->|Multi-hour migration,<br/>frontier reasoning| F[Fable 5]
    O -->|Scale it out| D[Dynamic workflows:<br/>100s of subagents]

1. Start at Sonnet 4.6. Always.

This is the single most important habit. Sonnet 4.6 now benchmarks near Opus-level on the coding tasks most teams actually care about, with a 1M-token context window and a price point that makes running multiple instances in parallel economically trivial. Several teams I trust have publicly moved the majority of their traffic here. Start here, and only climb the ladder when Sonnet visibly struggles.

2. Climb to Opus 4.8 when judgment matters.

The moment a task needs taste — a cross-service refactor, an API redesign, "should we even do it this way?" — Opus 4.8 earns its premium. The standout improvement isn't raw smarts, it's honesty: Opus 4.8 is roughly four times less likely than its predecessor to let a flaw in its own code pass unremarked. It flags uncertainty instead of confidently shipping a landmine. For unattended, long-running work, that's worth more than a benchmark point.

3. Reach for Fable 5 on the long-horizon stuff.

When the task is genuinely big — a migration across hundreds of thousands of lines, rebuilding an app's source from screenshots, reasoning that spans millions of tokens — Fable 5 is the one I reach for to get past a wall. It stays focused across enormous contexts and improves its own outputs using file-based memory. It's also more token-efficient than past models, which softens the higher per-token price.

4. Drop to Haiku for the boring glue.

Bulk renames, log parsing, commit-message generation, simple codegen. Don't pay Opus prices to reformat JSON.

A model fleet only pays off if the host lets you orchestrate it. Four features did the heavy lifting for me:

Launched alongside Opus 4.8, dynamic workflows let Claude plan a task and then fan out across tens to hundreds of parallel subagents in a single session — then verify its own outputs before reporting back. This is what turns "codebase-scale migration" from a slide into a Tuesday. Claude Code with Opus 4.8 can now take a six-figure-line migration from kickoff to merge, using your existing test suite as the bar. Available on Enterprise, Team, and Max plans.

Routines (shipped April 2026) let you configure a Claude Code workflow once and trigger it on a schedule, via API, or in response to an event. Nightly dependency upgrades, auto-triage of new GitHub issues, on-merge changelog generation. Pair a routine with the right model — Sonnet for triage, Opus for the actual fix — and you've replaced a pile of brittle CI scripts with one agent that improves over time.

When you're keeping "as many instances of Claude Code busy as possible" (Notion's co-founder isn't joking — that's literally the workflow now), you need a cockpit. Agent View gives you one place to manage every running session across surfaces. It's the unglamorous feature that makes parallel agent work sane.

Claude Code now opens your apps, drives your browser, and runs your dev tools to complete tasks end-to-end. Combined with Fable 5's state-of-the-art vision (it beat Pokémon FireRed from raw screenshots alone, no harness), the "AI that can actually operate your machine" future is quietly here.

And it meets you everywhere: terminal, VS Code / Cursor / JetBrains extensions, desktop app, web, mobile, and Slack — same agent, same context, same models, wherever you happen to be working.

The newer models expose an effort control — and it's the cheapest performance lever you have. Opus 4.8 defaults to high, but you can push it to extra (xhigh

in Claude Code) or max for hard problems and long async runs. On lower effort it answers faster and sips your rate limits; on higher effort it thinks more and self-validates.

My rule: low/standard effort for interactive back-and-forth, high/extra for anything you're going to walk away from. The extra thinking pays for itself precisely when you're not watching.

There's also fast mode for Opus 4.8 — 2.5× the speed at a higher per-token cost. Great for tight interactive loops where you're paying in wall-clock attention, not just dollars.

Routing doesn't have to stop at Claude's borders. A few honest observations from running mixed fleets:

The takeaway isn't "Claude beats everyone." It's that multi-model routing is now a first-class engineering decision, and Claude Code is the most mature place to actually do it.

Benchmarks are fine. But what convinced me — and what I think convinces most engineers — is watching the thing land a PR you'd have spent a day on. Here are the use cases I ran (and the public results that back them up), organized by the kind of work you actually do.

The task: Migrate a large service off a deprecated framework — the kind of ticket that sits in the backlog for two quarters because nobody has a free week.

The setup: Opus 4.8 (or Fable 5 where available) + dynamic workflows, with the existing test suite as the pass/fail bar. Claude plans the migration, fans out across hundreds of parallel subagents, each handling a slice, then verifies against the tests before reporting back.

The result: Stripe reported Fable 5 performing a codebase-wide migration on a 50-million-line Ruby codebase in a single day — work estimated at two-plus months for a team by hand. In my own (far smaller) runs, a multi-thousand-file framework bump that I'd scoped at three days came back green in an afternoon, with a clean diff and a summary of every non-trivial decision.

Takeaway:Long-horizon migrations are the single highest-ROI use case for the frontier tier. The longer and more mechanical the migration, the more absurd the time savings.

The task: Turn an exploratory notebook (pull data, train a model, eval with basic metrics) into a real, scheduled production pipeline.

The setup: Sonnet 4.6 as the driver — this is bread-and-butter work that doesn't need Opus. Point it at the notebook and your pipeline framework's conventions in CLAUDE.md

.

The result: Ramp's staff engineer reported this exact workflow — notebook to Metaflow pipeline — saving 1–2 days of routine work per model. That's not a demo; that's a recurring tax on every ML engineer's week, quietly removed.

Takeaway:The boring-but-skilled translation work (notebook→pipeline, script→service, prototype→prod) is where Sonnet 4.6 pays for itself daily.

The task: A GitHub issue comes in. Read it, reproduce, write the fix, add a test, open the PR.

The setup: Claude Code's GitHub/GitLab integration. Sonnet 4.6 for triage and the common case; escalate to Opus 4.8 when the bug touches architecture or the root cause is non-obvious.

The result: This is the loop teams at GitHub, Cognition, and Code Rabbit have publicly leaned into — Sonnet 4.6 "punches way above its weight class for the vast majority of real-world PRs," with double-digit-point gains on the hardest bug-finding problems over Sonnet 4.5. In practice: most issues never reach me as anything but a PR to review.

Takeaway:Wire the cheap model to the front door, reserve the expensive model for the hard 10%. Don't pay Opus to fix a null check.

The task: "Here's a screenshot of the dashboard. Rebuild it." No source, no spec — just pixels.

The setup: Fable 5, the current state-of-the-art vision model. It can extract precise numbers from scientific figures and reconstruct a web app's source code from screenshots alone.

The result: Anthropic's own demo had Fable 5 beating Pokémon FireRed from raw game screenshots with a vision-only harness — something earlier Claude models couldn't do even with navigation aids. Translated to dev work: design-to-code from a Figma export or a competitor's UI screenshot, with far less hand-holding than anything before it.

Takeaway:Vision is no longer a party trick. "Rebuild this from a picture" is a real, reliable workflow now.

The task: Dependency upgrades, flaky-test triage, changelog generation — the chores that rot a codebase when ignored.

The setup: Routines. Configure once, trigger on a schedule. Sonnet 4.6 does the nightly sweep; anything genuinely broken gets escalated to an Opus 4.8 fix with a draft PR waiting in the morning.

The result: Replaced a folder of brittle cron + bash scripts with a single agent that understands why a test failed instead of just reporting that it did. The win isn't speed — it's that the maintenance actually happens now, every night, without a human remembering to do it.

Takeaway:Skills + Routines + model routing is the combo that turns "we should automate that" into "it ran at 2am."

The task: Catch the confidently-wrong bug before it ships.

The setup: Primary model writes the diff; a different model (via MCP — could be another Claude tier, GPT-5.5, or Gemini 3.5) reviews it adversarially. Opus 4.8's honesty gains help here too: it's ~4× less likely than its predecessor to let a flaw in its own code pass unremarked.

The result: Cognition reported Sonnet 4.6 "meaningfully closed the gap with Opus on bug detection," letting them run more reviewers in parallel and catch a wider variety of bugs without increasing cost. A second, independent model catches the class of mistakes self-review structurally can't.

Takeaway:Two cheap reviewers beat one expensive author. Parallel, multi-model review is now economically obvious.

Use case Model(s) Reported / observed result
50M-line framework migration Fable 5 + dynamic workflows ~2 months → 1 day (Stripe)
Notebook → prod pipeline Sonnet 4.6
1–2 days saved per model (Ramp)
Issue → PR Sonnet 4.6 → Opus 4.8 Most issues arrive as review-ready PRs
Screenshot → app Fable 5 (vision) Source rebuilt from pixels alone
Nightly maintenance Sonnet 4.6 + Routines Chores that actually happen, unattended
Adversarial review Multi-model via MCP More bugs caught, parallel, no cost increase

The pattern across all six: match the model to the shape of the task, let Claude Code orchestrate, and verify with tests or a second model. That's the whole game.

A few hard-won habits that separated my good weeks from my great ones:

CLAUDE.md

, once.legacy/

." Every model in the fleet inherits it. This single file is the highest-leverage 20 minutes you'll spend.The meta-lesson: agentic coding rewards engineers who think like tech leads. You decide what and why; the fleet handles how. The bottleneck moved from typing speed to judgment — which is exactly where you want it.

The Mythos class crossed a capability threshold that made Anthropic genuinely nervous — and they were right to be. These models excel at discovering and exploiting software vulnerabilities and at agentic hacking (recon, lateral movement, the works). That's exactly why:

For your own work, the same discipline as ever applies: sandbox agent execution, restrict file-system and network egress, review diffs before they merge, and never let an autonomous agent push to anything you can't roll back. A more capable model raises the stakes of a bad instruction, not just a good one.

Install Claude Code (one-liner):

irm https://claude.ai/install.ps1 | iex          # Windows

Pick your plan. Claude Code is bundled into Pro ($17–$20/mo), Max 5x ($100/mo), and Max 20x ($200/mo). For "keep three branches alive while I review the fourth," Max is the honest entry point.

Switch models per task. Inside a session, select the model that matches the job — Sonnet for the PR, Opus for the architecture call, Fable for the migration (where available). Use a CLAUDE.md

file to encode your project's conventions once so every model inherits them.

Promote winners to Routines. Once a model-plus-workflow combo proves itself, schedule it. Nightly Sonnet-powered issue triage that escalates real bugs to an Opus fix is the kind of thing that runs while you sleep.

Wire in a second opinion via MCP. Let a different model adversarially review high-stakes diffs. Cheap insurance against confident-but-wrong.

A year ago the question was "is the AI good enough to write this code?" In 2026 the answer is yes — across an entire ladder of models, each tuned for a different shape of problem. The new skill, the one that separates a 1.2× productivity bump from a 3× one, is knowing which model to put on which task and letting Claude Code orchestrate the fleet.

Start at Sonnet 4.6. Climb to Opus 4.8 when judgment matters. Reach for Fable 5 on the long-horizon work — when you can get it. Wire in a second model for adversarial review. Promote your wins to Routines. And keep a fallback path for the frontier models, because as June 2026 reminded everyone, the most capable model is also the one most likely to get pulled out from under you for a week.

Tools give agents capability. Skills give them competence. Models give them intelligence at the right price — and Claude Code, in 2026, is where you conduct the whole orchestra.

Suraj Khaitan — Gen AI Architect | Building scalable platforms and secure cloud-native systems

Connect on LinkedIn | Follow for more engineering and architecture write-ups

Which Claude model has become your default — and what finally made you climb the ladder? Drop it in the comments. I'm always refining the routing playbook.

Sources & further reading:Anthropic's announcements for[Claude Fable 5 & Mythos 5],[Claude Opus 4.8],[Claude Sonnet 4.6], the[Claude Code product page], and the[Fable/Mythos access statement]. Benchmarks and pricing reflect Anthropic's published figures as of June 2026 and are subject to change.

── more in #large-language-models 4 stories · sorted by recency
── more on @anthropic 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-ran-claude-code-on…] indexed:0 read:13min 2026-06-20 ·