# Claude Code Costs, Act II — Where the big hidden costs are

> Source: <https://dev.to/sumedhbala/claude-code-costs-act-ii-where-the-big-hidden-costs-are-4gf1>
> Published: 2026-06-26 05:59:40+00:00

A single-model session that stays well-cached is cheap. The biggest swing in a multi-model bill comes from one move — **switching models** — because the prompt cache belongs to a single model. The instant you switch, the cache you already paid for is thrown away.

Whether that *helps or hurts* comes down to **how** you switch:

Our 25-turn run shows both ends: bouncing 20% of turns to Sonnet *lost* ~2%, while keeping the whole run on Sonnet *saved* ~53% (on Haiku, ~85%) — see Act III. So the goal isn't to avoid switching; it's to switch the right way. The costly version is the one that sneaks in by accident: a router that picks a model per request, or a manual swap partway through a conversation.

What about the model's prior **reasoning** — its thinking blocks? It's tempting to count that as a second switching cost, but mechanically it's *just context*: the client re-sends it every turn, switch or not (Mental model 1). A switch only changes what the *new* model does with it — it either **re-bills** the reasoning as input (a cost), or **strips** it before the model sees it (a behavioral change).

The strip case is the one to watch, and it's about *behavior*, not money. Reasoning travels as an opaque, encrypted signature: you can't read it or edit it — you can only carry it whole or lose it. If a switch drops it, the model continues *without* its earlier chain of thought, and may no longer behave the way the previous turns set it up to.

So this part covers the cache cost of a switch first, then what happens to reasoning across one.

A cache entry belongs to exactly one model. Another model cannot read it.

This is the rule that makes "just route the easy turns to a cheaper model" so often backfire. There's really **one fundamental reason** a switch can't reuse the cache you already paid for — proven below: the cache key is model-scoped. And even once you accept that, a switch costs *more* than a clean cold start would, because the **token counts shift between models** — a separate effect that compounds the bill, covered after.

It falls straight out of *What the cache stores* (Mental model 2). A cache hit reuses the **key/value vectors** a model computed for the prefix — and those vectors are produced by that model's *own weights*. Run the identical tokens through a *different* model and you get different queries, keys, and values, a different attention computation, and therefore a different KV state. So a cached entry is meaningful only to the exact model that produced it: hand Sonnet's KV cache to Haiku and it's noise. **That's why the cache can't carry across a switch** — not a policy choice but a consequence of attention itself; the saved state simply isn't the state the new model would have computed.

The clean proof — a byte-identical 63K-token prompt, varying *only* the model:

| call | model | `cache_read` |
`cache_creation` |
|---|---|---|---|
| sonnet #1 | sonnet-4-6 | 0 | 63,422 |
| sonnet #2 | sonnet-4-6 | 63,422 | 0 |
| haiku #1 | haiku-4-5 | 0 | 64,031 |
| haiku #2 | haiku-4-5 | 64,031 | 0 |

Sonnet's second call reads its own warm entry. Haiku's *first* call — same bytes — reads **0** and cold-writes its own copy. The two models cannot share an entry.

**The live consequence — duplicated cache writes in a mixed session** (a real 50/50 Sonnet/Haiku session, per-model totals) **[measured]**:

| model | requests | `cache_creation` |
`cache_read` |
|---|---|---|---|
| sonnet | 9 | 27,374 | 536,662 |
| haiku | 6 | 57,739 | 321,744 |

Haiku cold-wrote a **57,739-token duplicate** of the shared prefix it could never read from Sonnet — about **85,113 total cache_creation tokens of pure duplication** a single-model session would never pay. At the 2× write rate you pay the write premium

The bytes happen to differ too — diffing the system prompt across models, two lines change (the model-name line and the knowledge-cutoff line, e.g. `Opus 4.7 / January 2026`

→ `Haiku 4.5 / February 2025`

). But that's moot for caching: the model-scoped key already settled it. No amount of byte-matching would let Haiku read Sonnet's KV state — so don't think of the differing text as a *second* cause; it's the same wall.

This one isn't about the cache at all — it's a separate cost effect that rides along with every switch. Token counts shift between models, so even "the same text" bills as a different number of tokens:

Anthropic publishes no official ratio (use `count_tokens`

per model); it documents the 4.7/4.8/Fable tokenizer as ~1×–1.35× an older one, putting older models at roughly 0.74–1.0× of Opus.

The good news is that a switch is **not cold forever:**

Only the

firstcall to a given model is fully cold. Eachsubsequentcall to that model ispartially warm: it reads that model's own cache and cold-writes only the catch-up diff — the content added by intervening turns on theothermodel. Cost = one cold start + a recurring catch-up write per re-entry,nota cold start every call.

Two bounds apply: the TTL (1 hour, refreshed on each read of that model's entry) and the 20-block lookback (~7–10 turns). Beyond the lookback the *tail* can't re-link, but the front breakpoints (tools+system) still hit — so you re-read the system prefix warm and only cold-write the message history. At the 2× write rate, both the cold start and the catch-up writes hurt twice as much, which is exactly what flips routing economics in Act III.

⚠️ Mistake — routing per turn to save money.Bouncing between models mid-conversation pays a cold prefix write on the first switchplusa catch-up write on every re-entry. For a higher-write-rate model (Sonnet), this can costmorethan just staying on Opus.

✅ Fix— Routesticky: pick a model perconversationor persub-agentand stay there. (Act III quantifies it: 20%-Sonnet per-turn bouncingloses2.1%, while all-Sonnet stickysaves53%.)

**What to do:** Default to treating the model as a *per-conversation* decision, not a per-turn one — and if you use multiple tiers, isolate them into separate sub-agents/conversations so each keeps its own warm cache. Plenty of commercial routers and gateways will route per request for you automatically, and they *can* save money — but the win is workload-dependent, and (as the numbers above show) per-request routing can quietly cost *more*, especially for a higher-write-rate model. So don't switch one on blindly. Adopt it only once you can see the evidence on **your** traffic: measure `cache_read`

vs `cache_creation`

and your actual billed cost, understand *why* the cache is model-scoped, and confirm a real net saving before you rely on it. And weigh more than cost: a cheaper tier can carry an older knowledge cutoff (e.g. Haiku 4.5's is ~11 months behind Opus's) — a *behavioral* difference that routing-for-price quietly inherits.

Act I established that the client re-sends everything each turn, including the model's prior reasoning — the **thinking blocks.** Carrying them is ordinary context, but a switch forces a choice with two kinds of consequence: they're either re-rendered into the target model's prompt and **billed as input** (a cost), or **stripped** before they get there (a *behavioral* change — the new model loses the prior chain of thought), depending on the target model's class. To weigh either, you first need to know what a thinking block *is*.

If you haven't worked with reasoning models, start here; if you have, skip to Mental model 4.

A reasoning model doesn't answer immediately. Given a hard prompt, it first generates a run of intermediate tokens — working the problem out step by step — and only then writes its reply. That working-out is the model's **thinking** (also called *reasoning*, *extended thinking*, or *chain of thought*): a scratch pad, the "let me work through this" pass a person makes before answering, except the model does it by emitting tokens.

Why it does this: spending tokens on reasoning before answering measurably improves accuracy on anything multi-step — math, code, planning — where one wrong early step dooms the result. It's *test-time compute* — trade tokens (and latency, and money) for a better answer. Modern models use **adaptive thinking**: the model decides per request whether a problem is worth thinking about and how hard, so a trivial lookup gets none and a hard puzzle gets thousands of tokens.

In the response, thinking isn't blended into the answer. The reply is a list of typed **content blocks**, and thinking is its own block type, emitted *before* the `text`

answer — the wire keeps the model's private working-out separate from the words meant for the user. That separate block is what gets carried back each turn, and what a model switch has to make a decision about. The next question is what's inside it.

A thinking block is not readable text you're carrying around. It's a sealed envelope.

The model seals its reasoning into an encrypted

and hands you an envelope you can't open. You carry it back each turn (stateless —`signature`

youhold it, not the server). The server has the key: it decrypts the signature to reconstruct the reasoning for the model. The result isprivate(you can't read it),stateless(the content rides in your request), andcontinuous(the server reconstructs it each turn).

**It wasn't always sealed.** The first generation of extended thinking handed the chain of thought back as plain, readable text — you got the model's working-out verbatim and could log it, diff it, even hand-edit it before resending. That openness is gone. Current Claude 4.x returns reasoning only in protected form: a *summary* written by a separate model, or nothing but the encrypted signature. The motive is **anti-distillation** — a raw chain of thought is exactly the training signal a competitor needs to clone the reasoning into their own model, so the readable text was replaced by a `signature`

you can carry but not inspect. (`summarized`

is the protected form, not a peek behind it — see *Why you can't just read the chain of thought* below.)

**What you can still control — and what you can't.** A few knobs shape the envelope; none of them open it:

`thinking`

parameter (`adaptive`

, or `enabled`

with a `budget_tokens`

ceiling) plus the effort setting decide whether a block is produced and how long the reasoning runs. More reasoning → a bigger signature (the depth table below shows ~45× across difficulty).`display`

has exactly two values — `"summarized"`

(a paraphrase) and `"omitted"`

(empty text, signature only) — and the default flips by model (newer models default to `omitted`

). Neither returns the raw reasoning, and `display`

is visibility-only: it `400`

. You take the envelope whole, or not at all.**Carry-over: required in some places, impossible in others.** Because the content is sealed *and* integrity-locked, several moves that were fair game when thinking was open text are now off the table:

`200`

— silently breaks continuity and can convert cheap cache reads into cold writes.The measured detail behind each of these — what an omitted block contains, how signature size tracks reasoning depth, how it's billed, and what survives a switch — follows.

Opus 4.7 with `display:"omitted"`

emits: `{ "type":"thinking", "thinking":"" (empty), "signature": "<360–732 chars>" }`

. Nothing else. The readable thinking text is empty; the signature is the payload.

Verbatim from Anthropic's extended-thinking documentation:

"The`signature`

field still carries the encrypted full thinking for multi-turn continuity."

"The server decrypts the`signature`

to reconstruct the original thinking for prompt construction."

It also enforces **integrity** — blocks may not be edited or reordered:

"the entire sequence of consecutive`thinking`

blocks must match the outputs generated by the model… you can't rearrange or modify the sequence of these blocks."

Modifying a block returns `400 invalid_request_error`

(*" thinking … blocks in the latest assistant message cannot be modified"*).

*What this shows:* a thinking block isn't a fixed-size tag — it grows with how hard the model actually thought. Same model and settings (Sonnet 4.6, forced thinking, `display:"summarized"`

), five prompts from trivial to hard. The column that matters is the last one, signature size:

| prompt | `output_tokens` |
summary text (chars) | signature (chars) |
|---|---|---|---|
| trivial | 20 | 1 | 276 |
| easy | 35 | 45 | 332 |
| medium | 444 | 34 | 320 |
| hard (12-coin puzzle) | 5,595 | 3,352 | 12,524 |
| very_hard | 64 | 129 | 448 |

Read down the signature column: the hard puzzle's signature is **~45×** the trivial one, and **larger than the visible summary** (12,524 vs 3,352 chars). So the signature carries the *full* thinking; the summary is just a condensation. (Adaptive thinking decides per prompt whether to think at all, which is why the jump tracks *actual reasoning* — note `very_hard`

happened to reason little — not the difficulty *label*.)

`display:"omitted"`

and you never see the text.`display`

.`display`

does not change billing.`omitted`

, `summarized`

, or full.There are only **two** allowed `display`

values, and neither exposes the raw reasoning:

`"summarized"`

— a `"omitted"`

— empty text; the signature carries the encrypted full thinking.There is no value that returns verbatim chain of thought (*"In rare cases where you need access to full thinking output for Claude 4 models, contact Anthropic sales."*). So switching to `summarized`

does **not** bypass anti-distillation — `summarized`

*is* the protected form.

Which models hide the thinking text out of the box: **newer models default to omitted** (signature only — Opus 4.7, Opus 4.8, Fable 5, Mythos 5, Mythos Preview), while

`summarized`

The key cost question for switching: when the next turn goes to a *different* model, does it re-read the previous model's thinking blocks (and bill them as input), or silently drop them? The API never tells you which — so the test is to send the same request **twice**, once with the blocks kept and once with them removed, and diff the prompt-token count. Costs more with them kept → rendered and billed. Identical → dropped.

One concrete run, 3 Sonnet blocks replayed to Haiku: **kept = 73,252** tokens vs **removed = 63,997** — a 9,255-token gap (~3,085 per block), so they were rendered into Haiku's prompt and billed. (A naive model-only swap `400`

s first — Haiku rejects Sonnet's `adaptive`

thinking param — so the params have to be fixed before the keep-vs-removed comparison is even valid.)

Run that same keep-minus-removed diff across every Opus/Sonnet/Haiku pairing and the verdict is uniform — **always rendered, never dropped**. Each row is one source model's blocks replayed to one target; the number is the extra tokens billed when you keep them (positive = billed):

| source blocks | → target | keep − removed | verdict |
|---|---|---|---|
| Sonnet (visible text) | Opus 4.7 | +10,950 | rendered & billed |
| Sonnet | Haiku 4.5 | +9,253 | rendered & billed |
| Sonnet | Sonnet (control) | +9,077 | rendered & billed |
| Opus (omitted/empty text) | Sonnet 4.6 | +125 | rendered & billed |
| Opus | Haiku 4.5 | +308 | rendered & billed |
| Opus | Opus (control) | +170 | rendered & billed |

The Sonnet rows are large (~3,085 per block of real visible reasoning); the Opus rows are tiny (+125 to +308) — but that gap is **block size, not display**: those captured Opus blocks were *shallow* (short signatures), not cheap *because* their text is empty. Measured on a deep prompt **[2026-06-25]**: an Opus 4.8 omitted block — **zero** readable text, 4,300-char signature — cost **+1,522 tokens** to replay (pure signature), and the same prompt on summarized Sonnet cost +6,106. The replay bill tracks serialized block size, dominated by the signature, not whether you can read it. Every number above is positive — nothing was dropped.

The Fable/Mythos exception [docs]:the Fable/Mythos family's thinking blocks aredropped (unbilled)when replayed to a different model. Not reproducible here (those models 404 on the test host), but the contrast is documented: the entire Opus/Sonnet/Haiku family replays freely.

Key distinction:omittingthe thinking text (Opus 4.7/4.8 + Fable) isnotthe same asdropping blocks cross-model(Fable/Mythos only). Encryption isorthogonalto transfer: a sealed, empty-text Opus block still replays across models — confirmed live by resuming an Opus session on Sonnet, where the sealed Opus block was carried verbatim, accepted, and billed. Caveat:carried + accepted + billedis what's measured — it doesnotprove the target model semantically reuses another model's reasoning; only that the block crosses the boundary and costs you.

There is a documented rule about when previous thinking blocks are *stripped* from context **[docs]**:

"When a non-tool-result user block is included: on Opus 4.5+ and Sonnet 4.6+, previous thinking blocks are kept; on earlier Opus/Sonnet models and all Haiku models, all previous thinking blocks are ignored and stripped from context."

So by default: **stripped** on all Haiku + earlier Opus/Sonnet; **kept** on Opus 4.5+ and Sonnet 4.6+. The trigger is a normal (non-tool-result) user turn; inside tool loops, thinking is kept either way.

Does that strip disturb the cache? Only if the model counts thinking as part of its cached prefix in the first place — and that differs by model. *How to read the next table:* the same warm cache, measured with thinking **kept** vs forcibly **stripped**, on each model. Identical rows = thinking was never in that model's cache key (stripping is free); a `cache_read`

that collapses on STRIP = it was in the key (stripping re-keys it) **[measured]**:

| model | variant | `cache_read` |
`cache_create` |
reading |
|---|---|---|---|---|
| Haiku 4.5 | KEEP | 64,202 | 0 | rows identical → thinking isn't in Haiku's cache… |
| Haiku 4.5 | STRIP | 64,202 | 0 | …so stripping it costs nothing |
| Sonnet 4.6 | KEEP | 72,499 | 15 |
`cache_read` collapses on STRIP → thinking is in Sonnet's cache… |
| Sonnet 4.6 | STRIP | 54,080 | 9,342 | …so stripping re-keys ~9.3K tokens (read drops ~18K) |

Haiku strips thinking *before* it reaches the cache or the bill, so a strip is free. Sonnet keeps it in the prefix, so removing it cold-rewrites that slice — on a keep-model, stripping thinking *hurts*.

**But inside real Claude Code, neither default bites** — because Claude Code always sends one field **[measured]**:

```
"context_management": {"edits": [{"type": "clear_thinking_20251015", "keep": "all"}]}
```

`keep:"all"`

overrides Haiku's default strip and forces *every* model to retain thinking:

| default API | inside Claude Code (`keep:"all"` ) |
|
|---|---|---|
| Haiku 4.5 | strips (excluded from cache + billing) | keeps (in cache + billed) |
| Sonnet 4.6 | keeps | keeps |

So in genuine Claude Code usage every model keeps thinking; the Haiku-strip behavior only appears in raw API usage that omits `keep:"all"`

.

If you run a proxy that rewrites or strips thinking blocks, the cost depends sharply on *where* the touched block sits — because everything before the edit stays cached and everything from the edit onward must be rewritten. *What this shows:* on a keep-model, warm the cache, then remove a single thinking block from a 13-message conversation and watch how much warm `cache_read`

survives, depending on the removed block's position:

| variant | block removed at | `cache_read` |
cache lost vs baseline |
|---|---|---|---|
| baseline | — | 82,065 | — |
| strip last | msg 11 of 13 | 81,625 | 440 |
| strip first | msg 1 of 13 | 73,961 | 8,104 |

Same total prompt size either way — the loss is purely cheap `cache_read`

turning into expensive `cache_creation`

. Touch the **last** block and you lose almost nothing (440); touch the **first** and you re-key 8,104 tokens. The earlier the edit, the bigger the bill (~18× here).

⚠️ Mistake — assuming HTTP 200 means your cache survived.A 200 means "valid request,"not"cache preserved." A proxy that removes a thinking block is accepted (200) but can silently convert tens of thousands of cheap cache reads into expensive cold writes. (Editing a block's content or signature is outright rejected — 400, integrity — butremovingis tolerated and quietly costly.)

✅ Fix— Never mutate the prefix in a proxy. If you must touch messages, do surgical byte-fragment replacement on thetailonly and confirm`cache_read`

is unchanged on the next request.

**What to do (end of Act II):** Switching models and carrying thinking are the two transitions that blow up a bill. Both reward the same discipline: pick a model per conversation and leave the request body — including its thinking blocks and its interlocked thinking/effort/context-management parameters — exactly as Claude Code assembled it.
