{"slug": "claude-code-costs-act-ii-where-the-big-hidden-costs-are", "title": "Claude Code Costs, Act II — Where the big hidden costs are", "summary": "An engineer at Anthropic analyzed the hidden costs of switching models in Claude Code sessions, finding that model switches invalidate prompt caches, leading to duplicated cache writes that can increase costs by up to 85,113 tokens in a mixed session. The analysis shows that while switching models can save costs if done correctly, accidental switches via routers or manual changes are the most expensive, and reasoning blocks are re-billed or stripped on switch, affecting behavior rather than cost.", "body_md": "A single-model session that stays well-cached is cheap. The biggest swing in a multi-model bill comes from one move — **switching models** — because the prompt cache belongs to a single model. The instant you switch, the cache you already paid for is thrown away.\n\nWhether that *helps or hurts* comes down to **how** you switch:\n\nOur 25-turn run shows both ends: bouncing 20% of turns to Sonnet *lost* ~2%, while keeping the whole run on Sonnet *saved* ~53% (on Haiku, ~85%) — see Act III. So the goal isn't to avoid switching; it's to switch the right way. The costly version is the one that sneaks in by accident: a router that picks a model per request, or a manual swap partway through a conversation.\n\nWhat about the model's prior **reasoning** — its thinking blocks? It's tempting to count that as a second switching cost, but mechanically it's *just context*: the client re-sends it every turn, switch or not (Mental model 1). A switch only changes what the *new* model does with it — it either **re-bills** the reasoning as input (a cost), or **strips** it before the model sees it (a behavioral change).\n\nThe strip case is the one to watch, and it's about *behavior*, not money. Reasoning travels as an opaque, encrypted signature: you can't read it or edit it — you can only carry it whole or lose it. If a switch drops it, the model continues *without* its earlier chain of thought, and may no longer behave the way the previous turns set it up to.\n\nSo this part covers the cache cost of a switch first, then what happens to reasoning across one.\n\nA cache entry belongs to exactly one model. Another model cannot read it.\n\nThis is the rule that makes \"just route the easy turns to a cheaper model\" so often backfire. There's really **one fundamental reason** a switch can't reuse the cache you already paid for — proven below: the cache key is model-scoped. And even once you accept that, a switch costs *more* than a clean cold start would, because the **token counts shift between models** — a separate effect that compounds the bill, covered after.\n\nIt falls straight out of *What the cache stores* (Mental model 2). A cache hit reuses the **key/value vectors** a model computed for the prefix — and those vectors are produced by that model's *own weights*. Run the identical tokens through a *different* model and you get different queries, keys, and values, a different attention computation, and therefore a different KV state. So a cached entry is meaningful only to the exact model that produced it: hand Sonnet's KV cache to Haiku and it's noise. **That's why the cache can't carry across a switch** — not a policy choice but a consequence of attention itself; the saved state simply isn't the state the new model would have computed.\n\nThe clean proof — a byte-identical 63K-token prompt, varying *only* the model:\n\n| call | model | `cache_read` |\n`cache_creation` |\n|---|---|---|---|\n| sonnet #1 | sonnet-4-6 | 0 | 63,422 |\n| sonnet #2 | sonnet-4-6 | 63,422 | 0 |\n| haiku #1 | haiku-4-5 | 0 | 64,031 |\n| haiku #2 | haiku-4-5 | 64,031 | 0 |\n\nSonnet's second call reads its own warm entry. Haiku's *first* call — same bytes — reads **0** and cold-writes its own copy. The two models cannot share an entry.\n\n**The live consequence — duplicated cache writes in a mixed session** (a real 50/50 Sonnet/Haiku session, per-model totals) **[measured]**:\n\n| model | requests | `cache_creation` |\n`cache_read` |\n|---|---|---|---|\n| sonnet | 9 | 27,374 | 536,662 |\n| haiku | 6 | 57,739 | 321,744 |\n\nHaiku cold-wrote a **57,739-token duplicate** of the shared prefix it could never read from Sonnet — about **85,113 total cache_creation tokens of pure duplication** a single-model session would never pay. At the 2× write rate you pay the write premium\n\nThe bytes happen to differ too — diffing the system prompt across models, two lines change (the model-name line and the knowledge-cutoff line, e.g. `Opus 4.7 / January 2026`\n\n→ `Haiku 4.5 / February 2025`\n\n). But that's moot for caching: the model-scoped key already settled it. No amount of byte-matching would let Haiku read Sonnet's KV state — so don't think of the differing text as a *second* cause; it's the same wall.\n\nThis one isn't about the cache at all — it's a separate cost effect that rides along with every switch. Token counts shift between models, so even \"the same text\" bills as a different number of tokens:\n\nAnthropic publishes no official ratio (use `count_tokens`\n\nper model); it documents the 4.7/4.8/Fable tokenizer as ~1×–1.35× an older one, putting older models at roughly 0.74–1.0× of Opus.\n\nThe good news is that a switch is **not cold forever:**\n\nOnly the\n\nfirstcall to a given model is fully cold. Eachsubsequentcall to that model ispartially warm: it reads that model's own cache and cold-writes only the catch-up diff — the content added by intervening turns on theothermodel. Cost = one cold start + a recurring catch-up write per re-entry,nota cold start every call.\n\nTwo bounds apply: the TTL (1 hour, refreshed on each read of that model's entry) and the 20-block lookback (~7–10 turns). Beyond the lookback the *tail* can't re-link, but the front breakpoints (tools+system) still hit — so you re-read the system prefix warm and only cold-write the message history. At the 2× write rate, both the cold start and the catch-up writes hurt twice as much, which is exactly what flips routing economics in Act III.\n\n⚠️ Mistake — routing per turn to save money.Bouncing between models mid-conversation pays a cold prefix write on the first switchplusa catch-up write on every re-entry. For a higher-write-rate model (Sonnet), this can costmorethan just staying on Opus.\n\n✅ Fix— Routesticky: pick a model perconversationor persub-agentand stay there. (Act III quantifies it: 20%-Sonnet per-turn bouncingloses2.1%, while all-Sonnet stickysaves53%.)\n\n**What to do:** Default to treating the model as a *per-conversation* decision, not a per-turn one — and if you use multiple tiers, isolate them into separate sub-agents/conversations so each keeps its own warm cache. Plenty of commercial routers and gateways will route per request for you automatically, and they *can* save money — but the win is workload-dependent, and (as the numbers above show) per-request routing can quietly cost *more*, especially for a higher-write-rate model. So don't switch one on blindly. Adopt it only once you can see the evidence on **your** traffic: measure `cache_read`\n\nvs `cache_creation`\n\nand your actual billed cost, understand *why* the cache is model-scoped, and confirm a real net saving before you rely on it. And weigh more than cost: a cheaper tier can carry an older knowledge cutoff (e.g. Haiku 4.5's is ~11 months behind Opus's) — a *behavioral* difference that routing-for-price quietly inherits.\n\nAct I established that the client re-sends everything each turn, including the model's prior reasoning — the **thinking blocks.** Carrying them is ordinary context, but a switch forces a choice with two kinds of consequence: they're either re-rendered into the target model's prompt and **billed as input** (a cost), or **stripped** before they get there (a *behavioral* change — the new model loses the prior chain of thought), depending on the target model's class. To weigh either, you first need to know what a thinking block *is*.\n\nIf you haven't worked with reasoning models, start here; if you have, skip to Mental model 4.\n\nA reasoning model doesn't answer immediately. Given a hard prompt, it first generates a run of intermediate tokens — working the problem out step by step — and only then writes its reply. That working-out is the model's **thinking** (also called *reasoning*, *extended thinking*, or *chain of thought*): a scratch pad, the \"let me work through this\" pass a person makes before answering, except the model does it by emitting tokens.\n\nWhy it does this: spending tokens on reasoning before answering measurably improves accuracy on anything multi-step — math, code, planning — where one wrong early step dooms the result. It's *test-time compute* — trade tokens (and latency, and money) for a better answer. Modern models use **adaptive thinking**: the model decides per request whether a problem is worth thinking about and how hard, so a trivial lookup gets none and a hard puzzle gets thousands of tokens.\n\nIn the response, thinking isn't blended into the answer. The reply is a list of typed **content blocks**, and thinking is its own block type, emitted *before* the `text`\n\nanswer — the wire keeps the model's private working-out separate from the words meant for the user. That separate block is what gets carried back each turn, and what a model switch has to make a decision about. The next question is what's inside it.\n\nA thinking block is not readable text you're carrying around. It's a sealed envelope.\n\nThe model seals its reasoning into an encrypted\n\nand hands you an envelope you can't open. You carry it back each turn (stateless —`signature`\n\nyouhold it, not the server). The server has the key: it decrypts the signature to reconstruct the reasoning for the model. The result isprivate(you can't read it),stateless(the content rides in your request), andcontinuous(the server reconstructs it each turn).\n\n**It wasn't always sealed.** The first generation of extended thinking handed the chain of thought back as plain, readable text — you got the model's working-out verbatim and could log it, diff it, even hand-edit it before resending. That openness is gone. Current Claude 4.x returns reasoning only in protected form: a *summary* written by a separate model, or nothing but the encrypted signature. The motive is **anti-distillation** — a raw chain of thought is exactly the training signal a competitor needs to clone the reasoning into their own model, so the readable text was replaced by a `signature`\n\nyou can carry but not inspect. (`summarized`\n\nis the protected form, not a peek behind it — see *Why you can't just read the chain of thought* below.)\n\n**What you can still control — and what you can't.** A few knobs shape the envelope; none of them open it:\n\n`thinking`\n\nparameter (`adaptive`\n\n, or `enabled`\n\nwith a `budget_tokens`\n\nceiling) plus the effort setting decide whether a block is produced and how long the reasoning runs. More reasoning → a bigger signature (the depth table below shows ~45× across difficulty).`display`\n\nhas exactly two values — `\"summarized\"`\n\n(a paraphrase) and `\"omitted\"`\n\n(empty text, signature only) — and the default flips by model (newer models default to `omitted`\n\n). Neither returns the raw reasoning, and `display`\n\nis visibility-only: it `400`\n\n. You take the envelope whole, or not at all.**Carry-over: required in some places, impossible in others.** Because the content is sealed *and* integrity-locked, several moves that were fair game when thinking was open text are now off the table:\n\n`200`\n\n— silently breaks continuity and can convert cheap cache reads into cold writes.The measured detail behind each of these — what an omitted block contains, how signature size tracks reasoning depth, how it's billed, and what survives a switch — follows.\n\nOpus 4.7 with `display:\"omitted\"`\n\nemits: `{ \"type\":\"thinking\", \"thinking\":\"\" (empty), \"signature\": \"<360–732 chars>\" }`\n\n. Nothing else. The readable thinking text is empty; the signature is the payload.\n\nVerbatim from Anthropic's extended-thinking documentation:\n\n\"The`signature`\n\nfield still carries the encrypted full thinking for multi-turn continuity.\"\n\n\"The server decrypts the`signature`\n\nto reconstruct the original thinking for prompt construction.\"\n\nIt also enforces **integrity** — blocks may not be edited or reordered:\n\n\"the entire sequence of consecutive`thinking`\n\nblocks must match the outputs generated by the model… you can't rearrange or modify the sequence of these blocks.\"\n\nModifying a block returns `400 invalid_request_error`\n\n(*\" thinking … blocks in the latest assistant message cannot be modified\"*).\n\n*What this shows:* a thinking block isn't a fixed-size tag — it grows with how hard the model actually thought. Same model and settings (Sonnet 4.6, forced thinking, `display:\"summarized\"`\n\n), five prompts from trivial to hard. The column that matters is the last one, signature size:\n\n| prompt | `output_tokens` |\nsummary text (chars) | signature (chars) |\n|---|---|---|---|\n| trivial | 20 | 1 | 276 |\n| easy | 35 | 45 | 332 |\n| medium | 444 | 34 | 320 |\n| hard (12-coin puzzle) | 5,595 | 3,352 | 12,524 |\n| very_hard | 64 | 129 | 448 |\n\nRead down the signature column: the hard puzzle's signature is **~45×** the trivial one, and **larger than the visible summary** (12,524 vs 3,352 chars). So the signature carries the *full* thinking; the summary is just a condensation. (Adaptive thinking decides per prompt whether to think at all, which is why the jump tracks *actual reasoning* — note `very_hard`\n\nhappened to reason little — not the difficulty *label*.)\n\n`display:\"omitted\"`\n\nand you never see the text.`display`\n\n.`display`\n\ndoes not change billing.`omitted`\n\n, `summarized`\n\n, or full.There are only **two** allowed `display`\n\nvalues, and neither exposes the raw reasoning:\n\n`\"summarized\"`\n\n— a `\"omitted\"`\n\n— empty text; the signature carries the encrypted full thinking.There is no value that returns verbatim chain of thought (*\"In rare cases where you need access to full thinking output for Claude 4 models, contact Anthropic sales.\"*). So switching to `summarized`\n\ndoes **not** bypass anti-distillation — `summarized`\n\n*is* the protected form.\n\nWhich models hide the thinking text out of the box: **newer models default to omitted** (signature only — Opus 4.7, Opus 4.8, Fable 5, Mythos 5, Mythos Preview), while\n\n`summarized`\n\nThe key cost question for switching: when the next turn goes to a *different* model, does it re-read the previous model's thinking blocks (and bill them as input), or silently drop them? The API never tells you which — so the test is to send the same request **twice**, once with the blocks kept and once with them removed, and diff the prompt-token count. Costs more with them kept → rendered and billed. Identical → dropped.\n\nOne concrete run, 3 Sonnet blocks replayed to Haiku: **kept = 73,252** tokens vs **removed = 63,997** — a 9,255-token gap (~3,085 per block), so they were rendered into Haiku's prompt and billed. (A naive model-only swap `400`\n\ns first — Haiku rejects Sonnet's `adaptive`\n\nthinking param — so the params have to be fixed before the keep-vs-removed comparison is even valid.)\n\nRun that same keep-minus-removed diff across every Opus/Sonnet/Haiku pairing and the verdict is uniform — **always rendered, never dropped**. Each row is one source model's blocks replayed to one target; the number is the extra tokens billed when you keep them (positive = billed):\n\n| source blocks | → target | keep − removed | verdict |\n|---|---|---|---|\n| Sonnet (visible text) | Opus 4.7 | +10,950 | rendered & billed |\n| Sonnet | Haiku 4.5 | +9,253 | rendered & billed |\n| Sonnet | Sonnet (control) | +9,077 | rendered & billed |\n| Opus (omitted/empty text) | Sonnet 4.6 | +125 | rendered & billed |\n| Opus | Haiku 4.5 | +308 | rendered & billed |\n| Opus | Opus (control) | +170 | rendered & billed |\n\nThe Sonnet rows are large (~3,085 per block of real visible reasoning); the Opus rows are tiny (+125 to +308) — but that gap is **block size, not display**: those captured Opus blocks were *shallow* (short signatures), not cheap *because* their text is empty. Measured on a deep prompt **[2026-06-25]**: an Opus 4.8 omitted block — **zero** readable text, 4,300-char signature — cost **+1,522 tokens** to replay (pure signature), and the same prompt on summarized Sonnet cost +6,106. The replay bill tracks serialized block size, dominated by the signature, not whether you can read it. Every number above is positive — nothing was dropped.\n\nThe Fable/Mythos exception [docs]:the Fable/Mythos family's thinking blocks aredropped (unbilled)when replayed to a different model. Not reproducible here (those models 404 on the test host), but the contrast is documented: the entire Opus/Sonnet/Haiku family replays freely.\n\nKey distinction:omittingthe thinking text (Opus 4.7/4.8 + Fable) isnotthe same asdropping blocks cross-model(Fable/Mythos only). Encryption isorthogonalto transfer: a sealed, empty-text Opus block still replays across models — confirmed live by resuming an Opus session on Sonnet, where the sealed Opus block was carried verbatim, accepted, and billed. Caveat:carried + accepted + billedis what's measured — it doesnotprove the target model semantically reuses another model's reasoning; only that the block crosses the boundary and costs you.\n\nThere is a documented rule about when previous thinking blocks are *stripped* from context **[docs]**:\n\n\"When a non-tool-result user block is included: on Opus 4.5+ and Sonnet 4.6+, previous thinking blocks are kept; on earlier Opus/Sonnet models and all Haiku models, all previous thinking blocks are ignored and stripped from context.\"\n\nSo by default: **stripped** on all Haiku + earlier Opus/Sonnet; **kept** on Opus 4.5+ and Sonnet 4.6+. The trigger is a normal (non-tool-result) user turn; inside tool loops, thinking is kept either way.\n\nDoes that strip disturb the cache? Only if the model counts thinking as part of its cached prefix in the first place — and that differs by model. *How to read the next table:* the same warm cache, measured with thinking **kept** vs forcibly **stripped**, on each model. Identical rows = thinking was never in that model's cache key (stripping is free); a `cache_read`\n\nthat collapses on STRIP = it was in the key (stripping re-keys it) **[measured]**:\n\n| model | variant | `cache_read` |\n`cache_create` |\nreading |\n|---|---|---|---|---|\n| Haiku 4.5 | KEEP | 64,202 | 0 | rows identical → thinking isn't in Haiku's cache… |\n| Haiku 4.5 | STRIP | 64,202 | 0 | …so stripping it costs nothing |\n| Sonnet 4.6 | KEEP | 72,499 | 15 |\n`cache_read` collapses on STRIP → thinking is in Sonnet's cache… |\n| Sonnet 4.6 | STRIP | 54,080 | 9,342 | …so stripping re-keys ~9.3K tokens (read drops ~18K) |\n\nHaiku strips thinking *before* it reaches the cache or the bill, so a strip is free. Sonnet keeps it in the prefix, so removing it cold-rewrites that slice — on a keep-model, stripping thinking *hurts*.\n\n**But inside real Claude Code, neither default bites** — because Claude Code always sends one field **[measured]**:\n\n```\n\"context_management\": {\"edits\": [{\"type\": \"clear_thinking_20251015\", \"keep\": \"all\"}]}\n```\n\n`keep:\"all\"`\n\noverrides Haiku's default strip and forces *every* model to retain thinking:\n\n| default API | inside Claude Code (`keep:\"all\"` ) |\n|\n|---|---|---|\n| Haiku 4.5 | strips (excluded from cache + billing) | keeps (in cache + billed) |\n| Sonnet 4.6 | keeps | keeps |\n\nSo in genuine Claude Code usage every model keeps thinking; the Haiku-strip behavior only appears in raw API usage that omits `keep:\"all\"`\n\n.\n\nIf you run a proxy that rewrites or strips thinking blocks, the cost depends sharply on *where* the touched block sits — because everything before the edit stays cached and everything from the edit onward must be rewritten. *What this shows:* on a keep-model, warm the cache, then remove a single thinking block from a 13-message conversation and watch how much warm `cache_read`\n\nsurvives, depending on the removed block's position:\n\n| variant | block removed at | `cache_read` |\ncache lost vs baseline |\n|---|---|---|---|\n| baseline | — | 82,065 | — |\n| strip last | msg 11 of 13 | 81,625 | 440 |\n| strip first | msg 1 of 13 | 73,961 | 8,104 |\n\nSame total prompt size either way — the loss is purely cheap `cache_read`\n\nturning into expensive `cache_creation`\n\n. Touch the **last** block and you lose almost nothing (440); touch the **first** and you re-key 8,104 tokens. The earlier the edit, the bigger the bill (~18× here).\n\n⚠️ Mistake — assuming HTTP 200 means your cache survived.A 200 means \"valid request,\"not\"cache preserved.\" A proxy that removes a thinking block is accepted (200) but can silently convert tens of thousands of cheap cache reads into expensive cold writes. (Editing a block's content or signature is outright rejected — 400, integrity — butremovingis tolerated and quietly costly.)\n\n✅ Fix— Never mutate the prefix in a proxy. If you must touch messages, do surgical byte-fragment replacement on thetailonly and confirm`cache_read`\n\nis unchanged on the next request.\n\n**What to do (end of Act II):** Switching models and carrying thinking are the two transitions that blow up a bill. Both reward the same discipline: pick a model per conversation and leave the request body — including its thinking blocks and its interlocked thinking/effort/context-management parameters — exactly as Claude Code assembled it.", "url": "https://wpnews.pro/news/claude-code-costs-act-ii-where-the-big-hidden-costs-are", "canonical_source": "https://dev.to/sumedhbala/claude-code-costs-act-ii-where-the-big-hidden-costs-are-4gf1", "published_at": "2026-06-26 05:59:40+00:00", "updated_at": "2026-06-26 06:03:41.147297+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-infrastructure", "developer-tools"], "entities": ["Anthropic", "Claude Code", "Sonnet", "Haiku", "Opus"], "alternates": {"html": "https://wpnews.pro/news/claude-code-costs-act-ii-where-the-big-hidden-costs-are", "markdown": "https://wpnews.pro/news/claude-code-costs-act-ii-where-the-big-hidden-costs-are.md", "text": "https://wpnews.pro/news/claude-code-costs-act-ii-where-the-big-hidden-costs-are.txt", "jsonld": "https://wpnews.pro/news/claude-code-costs-act-ii-where-the-big-hidden-costs-are.jsonld"}}