cd /news/large-language-models/claude-code-costs-act-ii-where-the-b… · home topics large-language-models article
[ARTICLE · art-40376] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Claude Code Costs, Act II — Where the big hidden costs are

An engineer at Anthropic analyzed the hidden costs of switching models in Claude Code sessions, finding that model switches invalidate prompt caches, leading to duplicated cache writes that can increase costs by up to 85,113 tokens in a mixed session. The analysis shows that while switching models can save costs if done correctly, accidental switches via routers or manual changes are the most expensive, and reasoning blocks are re-billed or stripped on switch, affecting behavior rather than cost.

read17 min views1 publishedJun 26, 2026

A single-model session that stays well-cached is cheap. The biggest swing in a multi-model bill comes from one move — switching models — because the prompt cache belongs to a single model. The instant you switch, the cache you already paid for is thrown away.

Whether that helps or hurts comes down to how you switch:

Our 25-turn run shows both ends: bouncing 20% of turns to Sonnet lost ~2%, while keeping the whole run on Sonnet saved ~53% (on Haiku, ~85%) — see Act III. So the goal isn't to avoid switching; it's to switch the right way. The costly version is the one that sneaks in by accident: a router that picks a model per request, or a manual swap partway through a conversation.

What about the model's prior reasoning — its thinking blocks? It's tempting to count that as a second switching cost, but mechanically it's just context: the client re-sends it every turn, switch or not (Mental model 1). A switch only changes what the new model does with it — it either re-bills the reasoning as input (a cost), or strips it before the model sees it (a behavioral change).

The strip case is the one to watch, and it's about behavior, not money. Reasoning travels as an opaque, encrypted signature: you can't read it or edit it — you can only carry it whole or lose it. If a switch drops it, the model continues without its earlier chain of thought, and may no longer behave the way the previous turns set it up to.

So this part covers the cache cost of a switch first, then what happens to reasoning across one.

A cache entry belongs to exactly one model. Another model cannot read it.

This is the rule that makes "just route the easy turns to a cheaper model" so often backfire. There's really one fundamental reason a switch can't reuse the cache you already paid for — proven below: the cache key is model-scoped. And even once you accept that, a switch costs more than a clean cold start would, because the token counts shift between models — a separate effect that compounds the bill, covered after.

It falls straight out of What the cache stores (Mental model 2). A cache hit reuses the key/value vectors a model computed for the prefix — and those vectors are produced by that model's own weights. Run the identical tokens through a different model and you get different queries, keys, and values, a different attention computation, and therefore a different KV state. So a cached entry is meaningful only to the exact model that produced it: hand Sonnet's KV cache to Haiku and it's noise. That's why the cache can't carry across a switch — not a policy choice but a consequence of attention itself; the saved state simply isn't the state the new model would have computed.

The clean proof — a byte-identical 63K-token prompt, varying only the model:

| call | model | cache_read | cache_creation | |---|---|---|---| | sonnet #1 | sonnet-4-6 | 0 | 63,422 | | sonnet #2 | sonnet-4-6 | 63,422 | 0 | | haiku #1 | haiku-4-5 | 0 | 64,031 | | haiku #2 | haiku-4-5 | 64,031 | 0 |

Sonnet's second call reads its own warm entry. Haiku's first call — same bytes — reads 0 and cold-writes its own copy. The two models cannot share an entry.

The live consequence — duplicated cache writes in a mixed session (a real 50/50 Sonnet/Haiku session, per-model totals) [measured]:

| model | requests | cache_creation | cache_read | |---|---|---|---| | sonnet | 9 | 27,374 | 536,662 | | haiku | 6 | 57,739 | 321,744 |

Haiku cold-wrote a 57,739-token duplicate of the shared prefix it could never read from Sonnet — about 85,113 total cache_creation tokens of pure duplication a single-model session would never pay. At the 2× write rate you pay the write premium

The bytes happen to differ too — diffing the system prompt across models, two lines change (the model-name line and the knowledge-cutoff line, e.g. Opus 4.7 / January 2026

Haiku 4.5 / February 2025

). But that's moot for caching: the model-scoped key already settled it. No amount of byte-matching would let Haiku read Sonnet's KV state — so don't think of the differing text as a second cause; it's the same wall.

This one isn't about the cache at all — it's a separate cost effect that rides along with every switch. Token counts shift between models, so even "the same text" bills as a different number of tokens:

Anthropic publishes no official ratio (use count_tokens

per model); it documents the 4.7/4.8/Fable tokenizer as ~1×–1.35× an older one, putting older models at roughly 0.74–1.0× of Opus.

The good news is that a switch is not cold forever:

Only the

firstcall to a given model is fully cold. Eachsubsequentcall to that model ispartially warm: it reads that model's own cache and cold-writes only the catch-up diff — the content added by intervening turns on theothermodel. Cost = one cold start + a recurring catch-up write per re-entry,nota cold start every call.

Two bounds apply: the TTL (1 hour, refreshed on each read of that model's entry) and the 20-block lookback (~7–10 turns). Beyond the lookback the tail can't re-link, but the front breakpoints (tools+system) still hit — so you re-read the system prefix warm and only cold-write the message history. At the 2× write rate, both the cold start and the catch-up writes hurt twice as much, which is exactly what flips routing economics in Act III.

⚠️ Mistake — routing per turn to save money.Bouncing between models mid-conversation pays a cold prefix write on the first switchplusa catch-up write on every re-entry. For a higher-write-rate model (Sonnet), this can costmorethan just staying on Opus.

✅ Fix— Routesticky: pick a model perconversationor persub-agentand stay there. (Act III quantifies it: 20%-Sonnet per-turn bouncingloses2.1%, while all-Sonnet stickysaves53%.)

What to do: Default to treating the model as a per-conversation decision, not a per-turn one — and if you use multiple tiers, isolate them into separate sub-agents/conversations so each keeps its own warm cache. Plenty of commercial routers and gateways will route per request for you automatically, and they can save money — but the win is workload-dependent, and (as the numbers above show) per-request routing can quietly cost more, especially for a higher-write-rate model. So don't switch one on blindly. Adopt it only once you can see the evidence on your traffic: measure cache_read

vs cache_creation

and your actual billed cost, understand why the cache is model-scoped, and confirm a real net saving before you rely on it. And weigh more than cost: a cheaper tier can carry an older knowledge cutoff (e.g. Haiku 4.5's is ~11 months behind Opus's) — a behavioral difference that routing-for-price quietly inherits.

Act I established that the client re-sends everything each turn, including the model's prior reasoning — the thinking blocks. Carrying them is ordinary context, but a switch forces a choice with two kinds of consequence: they're either re-rendered into the target model's prompt and billed as input (a cost), or stripped before they get there (a behavioral change — the new model loses the prior chain of thought), depending on the target model's class. To weigh either, you first need to know what a thinking block is.

If you haven't worked with reasoning models, start here; if you have, skip to Mental model 4.

A reasoning model doesn't answer immediately. Given a hard prompt, it first generates a run of intermediate tokens — working the problem out step by step — and only then writes its reply. That working-out is the model's thinking (also called reasoning, extended thinking, or chain of thought): a scratch pad, the "let me work through this" pass a person makes before answering, except the model does it by emitting tokens.

Why it does this: spending tokens on reasoning before answering measurably improves accuracy on anything multi-step — math, code, planning — where one wrong early step dooms the result. It's test-time compute — trade tokens (and latency, and money) for a better answer. Modern models use adaptive thinking: the model decides per request whether a problem is worth thinking about and how hard, so a trivial lookup gets none and a hard puzzle gets thousands of tokens.

In the response, thinking isn't blended into the answer. The reply is a list of typed content blocks, and thinking is its own block type, emitted before the text

answer — the wire keeps the model's private working-out separate from the words meant for the user. That separate block is what gets carried back each turn, and what a model switch has to make a decision about. The next question is what's inside it.

A thinking block is not readable text you're carrying around. It's a sealed envelope.

The model seals its reasoning into an encrypted

and hands you an envelope you can't open. You carry it back each turn (stateless —signature

youhold it, not the server). The server has the key: it decrypts the signature to reconstruct the reasoning for the model. The result isprivate(you can't read it),stateless(the content rides in your request), andcontinuous(the server reconstructs it each turn).

It wasn't always sealed. The first generation of extended thinking handed the chain of thought back as plain, readable text — you got the model's working-out verbatim and could log it, diff it, even hand-edit it before resending. That openness is gone. Current Claude 4.x returns reasoning only in protected form: a summary written by a separate model, or nothing but the encrypted signature. The motive is anti-distillation — a raw chain of thought is exactly the training signal a competitor needs to clone the reasoning into their own model, so the readable text was replaced by a signature

you can carry but not inspect. (summarized

is the protected form, not a peek behind it — see Why you can't just read the chain of thought below.)

What you can still control — and what you can't. A few knobs shape the envelope; none of them open it:

thinking

parameter (adaptive

, or enabled

with a budget_tokens

ceiling) plus the effort setting decide whether a block is produced and how long the reasoning runs. More reasoning → a bigger signature (the depth table below shows ~45× across difficulty).display

has exactly two values — "summarized"

(a paraphrase) and "omitted"

(empty text, signature only) — and the default flips by model (newer models default to omitted

). Neither returns the raw reasoning, and display

is visibility-only: it 400

. You take the envelope whole, or not at all.Carry-over: required in some places, impossible in others. Because the content is sealed and integrity-locked, several moves that were fair game when thinking was open text are now off the table:

200

— silently breaks continuity and can convert cheap cache reads into cold writes.The measured detail behind each of these — what an omitted block contains, how signature size tracks reasoning depth, how it's billed, and what survives a switch — follows.

Opus 4.7 with display:"omitted"

emits: { "type":"thinking", "thinking":"" (empty), "signature": "<360–732 chars>" }

. Nothing else. The readable thinking text is empty; the signature is the payload.

Verbatim from Anthropic's extended-thinking documentation:

"Thesignature

field still carries the encrypted full thinking for multi-turn continuity."

"The server decrypts thesignature

to reconstruct the original thinking for prompt construction."

It also enforces integrity — blocks may not be edited or reordered:

"the entire sequence of consecutivethinking

blocks must match the outputs generated by the model… you can't rearrange or modify the sequence of these blocks."

Modifying a block returns 400 invalid_request_error

(" thinking … blocks in the latest assistant message cannot be modified").

What this shows: a thinking block isn't a fixed-size tag — it grows with how hard the model actually thought. Same model and settings (Sonnet 4.6, forced thinking, display:"summarized"

), five prompts from trivial to hard. The column that matters is the last one, signature size:

| prompt | output_tokens | summary text (chars) | signature (chars) | |---|---|---|---| | trivial | 20 | 1 | 276 | | easy | 35 | 45 | 332 | | medium | 444 | 34 | 320 | | hard (12-coin puzzle) | 5,595 | 3,352 | 12,524 | | very_hard | 64 | 129 | 448 |

Read down the signature column: the hard puzzle's signature is ~45× the trivial one, and larger than the visible summary (12,524 vs 3,352 chars). So the signature carries the full thinking; the summary is just a condensation. (Adaptive thinking decides per prompt whether to think at all, which is why the jump tracks actual reasoning — note very_hard

happened to reason little — not the difficulty label.)

display:"omitted"

and you never see the text.display

.display

does not change billing.omitted

, summarized

, or full.There are only two allowed display

values, and neither exposes the raw reasoning:

"summarized"

— a "omitted"

— empty text; the signature carries the encrypted full thinking.There is no value that returns verbatim chain of thought ("In rare cases where you need access to full thinking output for Claude 4 models, contact Anthropic sales."). So switching to summarized

does not bypass anti-distillation — summarized

is the protected form.

Which models hide the thinking text out of the box: newer models default to omitted (signature only — Opus 4.7, Opus 4.8, Fable 5, Mythos 5, Mythos Preview), while

summarized

The key cost question for switching: when the next turn goes to a different model, does it re-read the previous model's thinking blocks (and bill them as input), or silently drop them? The API never tells you which — so the test is to send the same request twice, once with the blocks kept and once with them removed, and diff the prompt-token count. Costs more with them kept → rendered and billed. Identical → dropped.

One concrete run, 3 Sonnet blocks replayed to Haiku: kept = 73,252 tokens vs removed = 63,997 — a 9,255-token gap (~3,085 per block), so they were rendered into Haiku's prompt and billed. (A naive model-only swap 400

s first — Haiku rejects Sonnet's adaptive

thinking param — so the params have to be fixed before the keep-vs-removed comparison is even valid.)

Run that same keep-minus-removed diff across every Opus/Sonnet/Haiku pairing and the verdict is uniform — always rendered, never dropped. Each row is one source model's blocks replayed to one target; the number is the extra tokens billed when you keep them (positive = billed):

source blocks → target keep − removed verdict
Sonnet (visible text) Opus 4.7 +10,950 rendered & billed
Sonnet Haiku 4.5 +9,253 rendered & billed
Sonnet Sonnet (control) +9,077 rendered & billed
Opus (omitted/empty text) Sonnet 4.6 +125 rendered & billed
Opus Haiku 4.5 +308 rendered & billed
Opus Opus (control) +170 rendered & billed

The Sonnet rows are large (~3,085 per block of real visible reasoning); the Opus rows are tiny (+125 to +308) — but that gap is block size, not display: those captured Opus blocks were shallow (short signatures), not cheap because their text is empty. Measured on a deep prompt [2026-06-25]: an Opus 4.8 omitted block — zero readable text, 4,300-char signature — cost +1,522 tokens to replay (pure signature), and the same prompt on summarized Sonnet cost +6,106. The replay bill tracks serialized block size, dominated by the signature, not whether you can read it. Every number above is positive — nothing was dropped.

The Fable/Mythos exception [docs]:the Fable/Mythos family's thinking blocks aredropped (unbilled)when replayed to a different model. Not reproducible here (those models 404 on the test host), but the contrast is documented: the entire Opus/Sonnet/Haiku family replays freely.

Key distinction:omittingthe thinking text (Opus 4.7/4.8 + Fable) isnotthe same asdropping blocks cross-model(Fable/Mythos only). Encryption isorthogonalto transfer: a sealed, empty-text Opus block still replays across models — confirmed live by resuming an Opus session on Sonnet, where the sealed Opus block was carried verbatim, accepted, and billed. Caveat:carried + accepted + billedis what's measured — it doesnotprove the target model semantically reuses another model's reasoning; only that the block crosses the boundary and costs you.

There is a documented rule about when previous thinking blocks are stripped from context [docs]:

"When a non-tool-result user block is included: on Opus 4.5+ and Sonnet 4.6+, previous thinking blocks are kept; on earlier Opus/Sonnet models and all Haiku models, all previous thinking blocks are ignored and stripped from context."

So by default: stripped on all Haiku + earlier Opus/Sonnet; kept on Opus 4.5+ and Sonnet 4.6+. The trigger is a normal (non-tool-result) user turn; inside tool loops, thinking is kept either way.

Does that strip disturb the cache? Only if the model counts thinking as part of its cached prefix in the first place — and that differs by model. How to read the next table: the same warm cache, measured with thinking kept vs forcibly stripped, on each model. Identical rows = thinking was never in that model's cache key (stripping is free); a cache_read

that collapses on STRIP = it was in the key (stripping re-keys it) [measured]:

| model | variant | cache_read | cache_create | reading | |---|---|---|---|---| | Haiku 4.5 | KEEP | 64,202 | 0 | rows identical → thinking isn't in Haiku's cache… | | Haiku 4.5 | STRIP | 64,202 | 0 | …so stripping it costs nothing | | Sonnet 4.6 | KEEP | 72,499 | 15 | cache_read collapses on STRIP → thinking is in Sonnet's cache… | | Sonnet 4.6 | STRIP | 54,080 | 9,342 | …so stripping re-keys ~9.3K tokens (read drops ~18K) |

Haiku strips thinking before it reaches the cache or the bill, so a strip is free. Sonnet keeps it in the prefix, so removing it cold-rewrites that slice — on a keep-model, stripping thinking hurts.

But inside real Claude Code, neither default bites — because Claude Code always sends one field [measured]:

"context_management": {"edits": [{"type": "clear_thinking_20251015", "keep": "all"}]}

keep:"all"

overrides Haiku's default strip and forces every model to retain thinking:

| default API | inside Claude Code (keep:"all" ) | | |---|---|---| | Haiku 4.5 | strips (excluded from cache + billing) | keeps (in cache + billed) | | Sonnet 4.6 | keeps | keeps |

So in genuine Claude Code usage every model keeps thinking; the Haiku-strip behavior only appears in raw API usage that omits keep:"all"

.

If you run a proxy that rewrites or strips thinking blocks, the cost depends sharply on where the touched block sits — because everything before the edit stays cached and everything from the edit onward must be rewritten. What this shows: on a keep-model, warm the cache, then remove a single thinking block from a 13-message conversation and watch how much warm cache_read

survives, depending on the removed block's position:

| variant | block removed at | cache_read | cache lost vs baseline | |---|---|---|---| | baseline | — | 82,065 | — | | strip last | msg 11 of 13 | 81,625 | 440 | | strip first | msg 1 of 13 | 73,961 | 8,104 |

Same total prompt size either way — the loss is purely cheap cache_read

turning into expensive cache_creation

. Touch the last block and you lose almost nothing (440); touch the first and you re-key 8,104 tokens. The earlier the edit, the bigger the bill (~18× here).

⚠️ Mistake — assuming HTTP 200 means your cache survived.A 200 means "valid request,"not"cache preserved." A proxy that removes a thinking block is accepted (200) but can silently convert tens of thousands of cheap cache reads into expensive cold writes. (Editing a block's content or signature is outright rejected — 400, integrity — butremovingis tolerated and quietly costly.)

✅ Fix— Never mutate the prefix in a proxy. If you must touch messages, do surgical byte-fragment replacement on thetailonly and confirmcache_read

is unchanged on the next request.

What to do (end of Act II): Switching models and carrying thinking are the two transitions that blow up a bill. Both reward the same discipline: pick a model per conversation and leave the request body — including its thinking blocks and its interlocked thinking/effort/context-management parameters — exactly as Claude Code assembled it.

── more in #large-language-models 4 stories · sorted by recency
── more on @anthropic 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/claude-code-costs-ac…] indexed:0 read:17min 2026-06-26 ·