Claude Code Costs, Act I — How the billing actually works An engineer reverse-engineered Claude Code's billing model by pointing the client at a local proxy that logs each API request. The analysis reveals that the client can under-report cache-write costs by using a 1.25× rate instead of the published 2× rate, and that every turn is a complete HTTP request with no privileged server-side session. The guide provides four mental models for predicting and reducing costs, with all claims backed by provenance tags indicating measurement, documentation, or both. How billing actually works, the ecosystem of options to spend less, and the mistakes that quietly cost you money — grounded in measurement, not folklore. This guide teaches the cost model of Claude Code https://claude.com/claude-code from first principles. It is written for engineers who want to predict their bill, lower it deliberately, and recognize the anti-patterns before they hit production. Every non-obvious claim carries a provenance tag — measured observed on the wire , docs from Anthropic's published documentation , or doc-confirmed measured and matching the docs — and the table or experiment that supports it. The best way to use this guide:keep a Claude session open on the side as you read. When something isn't clear, paste it in and ask Claude to explain it or to dig deeper. Better still, don't take the numbers on faith — ask Claude toreproduce the experimentson your own setup, walk you through what the results mean, andshow you the raw request/response logsbehind each table. The provenance tags exist so you can verify everything here yourself; treat this guide as a starting point for that conversation, not the last word. usage block and tell the difference between a warm session and one that is silently rebuilding its cache every turn.The guide is built around four mental models . Learn them and the rest is derivable: The guide is in four acts mapping to what you're trying to do: Act I — how billing works; Act II — where the big hidden costs are model switching and thinking blocks ; Act III — the ecosystem of options for spending less; Act IV — the consolidated mistakes catalogue and a one-page cheat sheet. None of this relies on internal access. Claude Code honors the ANTHROPIC BASE URL environment variable, so it can be pointed at a plain-HTTP local reverse proxy that logs each /v1/messages request body and forwards it to https://api.anthropic.com . There is no TLS interception: the OAuth Authorization: Bearer token and anthropic-beta headers pass through untouched; the proxy rewrites only the JSON model field when an experiment calls for it. The proxy parses the streamed SSE response for the usage block and counts cache control markers. A replay harness re-sends real captured requests verbatim — genuine headers — under single-variable variations, so each finding isolates one cause. The honest constraint throughout: the API exposes no field that announces "thinking was dropped" or "your cache was busted." Every conclusion here is inferred from HTTP status codes, token-count deltas between near-identical requests, and inspection of the response. Where a claim is inferred rather than directly reported, it says so. Measurements were taken against Claude Code 2.1.150 OAuth auth, mid-2026 on Sonnet 4.6, Opus 4.7/4.8, and Haiku 4.5. Fable 5 and the Mythos family were gated HTTP 404 on the test host, so claims about them are documentation-only. Prices and exact behaviors are version-specific — re-verify them for your own models and client version. Billing-rate caveat, stated once and used throughout.Claude Code's self-reported total cost usd what /cost and the status line show canunder-reportcache-write cost: it prices the 1-hour-TTL writes it sends at the 5-minute1.25×rate. Anthropic's published rate for a 1-hour-TTL write is2×base input — the rate this guide uses everywhere. So expect your displayed session cost to readlowerthan the figures here. Treat the Anthropic Console as authoritative for actual billing. This gap is itself one of the mistakes in Act IV. The entire cost model falls out of one architectural fact. Get this fact and the three cost buckets, and you can already predict most of your bill. Claude Code is a client of Anthropic's /v1/messages HTTP API. It holds no privileged channel to the model and no special server-side session. Every turn is one complete HTTP request that carries the entire context the model will see: the tool definitions, the system prompt, and every message in the conversation so far. The server generates a response and then forgets everything — no session, no memory, no "conversation" object. The next turn re-sends the whole thing again, one turn longer. That single fact — the request is the only state — drives every cost lever in this guide: Picture a brilliant contractor with no long-term memory. Every time you consult them, you must hand over a complete dossier: the tools they're allowed to use, their standing instructions, and the entire history of the project so far. They read it, give you excellent advice, and then forget the entire engagement the instant you leave. Next time, you bring the same dossier plus today's new page. That is exactly the relationship between Claude Code and the API. Everything expensive about Claude Code follows from "the dossier gets re-read, in full, every single time." Two requests, same session design, isolating whether the server remembers a fact across turns: | request sent | model's answer | |---|---| | "my secret is 42" + acknowledgement + "what's my secret?" the fact is in the payload | "42" | | only "what's my secret?" the prior turn omitted from the payload | "I don't have a secret number stored in memory…" | The model knows only what the current request carries. When the prior turn is omitted from the request body, the secret is gone — there is no session to recall it from. The "memory" you experience in a chat is an illusion maintained entirely by the client re-sending history. It isn't just visible messages. A captured Opus continuation shows message 17 as 'thinking sig=yes, len=0 ', 'text', 'tool use Grep ' , followed by 18 as a tool result . The thinking block is in the request body — the client is sending the model's own prior reasoning back to it. What those blocks contain, how they're billed, and what happens to them on a model switch is the whole of Act II's second half. For now: they are part of the dossier, and parts of the dossier cost money. A subtlety worth internalizing early: resending thinking is not strictly mandatory in this configuration. Replaying the captured continuation with thinking blocks a kept, b stripped from the last assistant turn, and c stripped everywhere all returned HTTP 200 measured . The server doesn't error and doesn't substitute a remembered copy — because it has none. Resending thinking is how the client gives the model reasoning continuity; it is not how the server maintains state. The server maintains no state. Every request is assembled in this order: tools → system → messages Tool definitions sit at byte 0, the system prompt next, and the conversation last. This isn't cosmetic. It is the single most important layout decision for cost, and it follows one rule: Stable content first, volatile content last. You'll see why this ordering is load-bearing the moment we get to caching: the cache is a prefix match, so the things that change least must come first, or they'll keep getting invalidated by the things that change most. One more piece of anatomy, because it decides cache behavior later. The messages tier is an ordered list of messages — conversation turns, each with a role and some content — and each message's content is itself an ordered list of typed content blocks : a text block, a tool use block the model calling a tool , a tool result block your answer to it , a thinking block, an image. One message can hold many blocks — an assistant turn that fires ten parallel tools is a single message but twenty-plus blocks. Hold onto the message-vs-block distinction: the cache's backward re-link counts blocks , not messages, which is exactly what a big tool burst trips see Agentic tool bursts overflow the 20-block lookback . Every request's usage block breaks input into three buckets, plus output. Learn what each one costs relative to base input , because that ratio is the whole game: Bucket usage. | Meaning | Price vs. base input | |---|---|---| cache read input tokens | served from an existing cache entry | 0.1× | cache creation input tokens | written to the cache this request | 2× 1-hour TTL | input tokens | processed uncached, not cached | 1× | output tokens | generated tokens | the model's output rate | Two facts to burn in: Re-verify at claude.com/pricing . Cache read = 0.1× input; 1-hour cache write = 2× input; output = 5× input. | Model | Input 1× | Cache read 0.1× | Cache write, 1h 2× | Output 5× | |---|---|---|---|---| | Fable 5 | $10.00 | $1.00 | $20.00 | $50.00 | | Opus 4.8 | $5.00 | $0.50 | $10.00 | $25.00 | | Sonnet 4.6 | $3.00 | $0.30 | $6.00 | $15.00 | | Haiku 4.5 | $1.00 | $0.10 | $2.00 | $5.00 | The ratios are clean and worth memorizing: Sonnet = 0.6× Opus, Haiku = 0.2× Opus, Sonnet = 3× Haiku. They'll matter when we compare routing options. Caching is a trade: you pay a one-time expensive write 2× the input price so that later requests get cheap reads 0.1× . Whether that trade wins depends on how many times you'll re-read the cached prefix. Process the same prefix over N requests, two ways: N × 1× 2× + N−1 × 0.1× Caching comes out ahead once 2 + 0.1 N−1 < N — i.e. from the 3rd request onward. Worked out: 2 + 0.1 + 0.1 = 2.2× vs 3× uncached. ✅ cache wins. 2.9× vs 10× . ✅ cache wins comfortably. Why "3rd request" and not "2nd"? The TTL sets the write price, and this is exactly where Claude Code differs from the raw API. By default Anthropic uses a 5-minute cache TTL, where a write costs only 1.25× — so caching breaks even one request sooner, on the 2nd . But the Claude Code client overrides that default and requests the 1-hour TTL on every breakpoint measured later in this act , where a write costs 2× — pushing break-even to the 3rd request. This guide uses 2× / 1-hour throughout because that's what Claude Code actually sends. ⚠️ Mistake — paying for a cache you never read back.Awasted write content cached but never read again costs2×—doublewhat you'd have paid had you never cached it. Caching is not free insurance; it's a bet that you'll re-read the prefix at least three times. A short, one-shot interaction that ends after one or two turns can becheaperwithout caching. ✅ Fix— Let Claude Code's defaults stand for interactive sessions they will be re-read many times . Only worry about this in custom harnesses that cache aggressively but terminate early. One more corollary you'll lean on constantly: every read also refreshes the TTL. A continuously-reused prefix never expires. What to do Act I so far : Internalize that cost = re-processing history + output you generate . The cache is the only tool against the first term, and it's a strict prefix match — which is the next mental model. This is the model that, once you have it, makes cache behavior obvious instead of mysterious — and it starts one level down, in how the model reads your prompt at all. The model itself — a transformer the neural-network architecture every modern LLM, Claude included, is built on — reads your prompt token by token, left to right. A token is roughly a word or a word-piece. For every token the model computes three vectors — a query , a key , and a value — and each token attends to all the tokens before it: its result is a blend of those earlier tokens' values, weighted by how well its query matches their keys. In plain terms:to handle each word, the model glances back over everything before it and weighs which earlier words matter — the way you look back to figure out what "it" refers to in"I poured water into the glass until it was full."It does that for every token, against every earlier token at once. Two facts fall straight out of this, and together they explain the entire cache: The prompt cache stores exactly those computed key/value vectors — the "KV cache" — for the prefix tokens , keyed to the exact tokens that produced them. On a hit, the server loads that saved attention state instead of recomputing it and starts real work only at the first uncached token. Every pricing rule in this guide follows from that one move: output tokens , absent from cache creation . The output cache creation , then read warm thereafter. So a generated span is paid once as output, once more as a single cache write next turn, then cheap reads — there's no cached output-KV to reuse, only the re-encoded text. Measured: a 440-token turn-1 output had cache creation 0 that turn, then appeared as ~450 of turn-2's cache creation , then rode along in turn-3's warm read. That's why output is its own, uncacheable-at-generation cost line back in Prompt caching is a strict prefix match.A cache entry is keyed on the exact tokens from position 0 up to a cut point.Change one token at position N and every cached state at position ≥ N is invalid— because each of those later key/value vectors was computedattending tothe token you changed. The cache is therefore content-addressed , not position-addressed: the API re-hashes the leading tokens each request and looks the hash up. Identical leading bytes → hit. This is why the render order tools → system → messages is load-bearing — put the bytes that never change at the front, and the model reloads the prefix's attention state instead of recomputing it. The cut point itself is a cache breakpoint — a cache control marker on a block, meaning "cache everything from the start up to here." In Claude Code the client places these for you; the rest of this section is what they do. Here's the elegant part. The last breakpoint slides forward to the newest turn on each request Claude Code does this automatically; on the raw API you do it yourself . Then three things happen together: In one line: write cost per turn ≈ new tokens since last boundary × 2× rate The exception is the whole story of cache-busting: a byte change inside the prefix misses at that point, and the next write must span from the change forward — now 2× as costly to rebuild as the read it replaced. The slide also has a reach limit : a breakpoint only re-links to a cache entry within ~20 content blocks of it. So a single turn that appends more than ~20 blocks — a large agentic tool burst — breaks the chain and forces a cold rewrite even when nothing else changed. That failure mode, and the proxy-side fix, are in Agentic tool bursts overflow the 20-block lookback under What busts the cache . Memorize which changes survive ✅ and which invalidate ❌ each tier: | Change | Tools | System | Messages | |---|---|---|---| | Tool definitions add/remove/reorder | ❌ | ❌ | ❌ | | Model switch | ❌ | ❌ | ❌ | speed / web-search / citations toggle | ✅ | ❌ | ❌ | | System prompt content | ✅ | ❌ | ❌ | tool choice / images / thinking toggle | ✅ | ✅ | ❌ | | Message content | ✅ | ✅ | ❌ | Source: Anthropic — Prompt caching https://platform.claude.com/docs/en/build-with-claude/prompt-caching cache-tiers / what-invalidates section . The two rows that force a full rebuild — touching every tier — are the expensive ones: tool-definition changes and model switches. Because both sit at or before byte 0 of what's cached, they re-key everything, and at the 2× write rate that rebuild is twice as painful as a read. Claude Code places the cache breakpoints for you — three of them, 1-hour TTL proven in Claude Code's actual caching — so you never write cache control yourself. Your cache lever isn't placing breakpoints; it's not disturbing the prefix the client already cached. The ways a real session loses it: @import --append-system-prompt , output styles, and the like sit in the