Claude Code Costs, Act IV — The mistakes catalogue & cheat sheet A developer cataloged 11 costly mistakes in using Anthropic's prompt caching with Claude Code, including routing per turn instead of per conversation, inserting proxies that re-serialize JSON, and leaving tool surfaces unstable. The mistakes range from cache-breaking volatile tokens to trusting displayed session costs over actual billing, with fixes provided for each. This act consolidates everything into two reference artifacts: a catalogue of the mistakes each with its symptom, cost, and fix for revision, and a one-page cheat sheet. Ordered roughly by how much money they cost and how often they happen. 1. Routing per turn instead of per conversation. Symptom: a "cost-saving" router that bounces models mid-conversation. Cost: a cold prefix write on the first switch + a catch-up write on every re-entry; for Sonnet, a net loss +2.1% measured . Fix: route sticky — per conversation or per sub-agent. Where you switch matters more than whether. 2. Inserting a proxy that re-serializes JSON. Symptom: a logging/compression/routing proxy that parses and re-emits the request. Cost: even a logically-identical re-encode changes the bytes separators, escaping, key order, number formatting and drops cache hit rate to ~0 for all traffic. Fix: forward original bytes verbatim; mutate only by surgical byte-fragment replacement on the messages tail. 3. Leaving the tool/MCP surface unstable. Symptom: tool count grows across early turns async MCP connectors , or dynamic/runtime tool discovery is enabled. Cost: byte-0 churn → full cold rebuild 2× write every affected turn. Fix: --strict-mcp-config and a frozen startup tool set; a fixed dispatcher tool instead of runtime tool mutation. 4. Volatile tokens early in the prompt. Symptom: a timestamp, UUID, or request-ID near the top of the system prompt; unsorted json.dumps . Cost: every request has a unique prefix → ~0% hit rate, silently. Fix: move all volatile content after the last breakpoint; sort keys deterministically. 5. Stripping or editing thinking blocks in a proxy. Symptom: a proxy that "cleans up" or removes thinking blocks. Cost: editing → 400 integrity ; removing → 200 but a position-dependent re-key removing an early block lost 8,104 cache tokens vs 440 for a late one . Fix: never mutate thinking blocks; carry the request body as Claude Code assembled it. 6. Trusting the displayed session cost as your invoice. Symptom: using /cost or the status line as the source of truth. Cost: it prices 1-hour writes at the 5-minute 1.25× rate and reads ~10% low $1.77 vs $1.97 on a real session . Fix: use the Anthropic Console for absolute billing; use the displayed figure only for in-session relative comparison. 7. Caching a prefix you'll read fewer than 3 times. Symptom: aggressive caching in a short-lived custom harness. Cost: a wasted write is 2× — double what no caching would have cost. Fix: cache only when you'll re-read the prefix ≥3 times the break-even ; let interactive sessions cache by default. 8. Running an aggressive input compressor over a cached prefix. Symptom: LLMLingua/LLM-based compression applied to the conversational prefix. Cost: rewriting cached bytes flips reads 0.1× to writes 2× ; accuracy degrades past ~20×. Fix: use input compressors only where there's no cache to lose one-shot, RAG assembly or only on the fresh tail; prefer a cache-aware tool. 9. Agentic bursts overflowing the 20-block lookback. Symptom: a single turn emits ~10+ parallel tool calls ≈21+ tool-call/result blocks . Cost: the next turn can't re-link the cache and cold-writes the gap. Fix: in Claude Code, bound the burst; behind a proxy, a stateless breakpoint grid stride ~18, 4 markers rescues up to ~74 blocks/turn. 10. A cache breakpoint below the minimum prefix. Symptom: marking a sub-1,024-token prefix for caching. Cost: the marker silently does nothing cache creation: 0 ; you think caching is on. Fix: confirm non-zero cache creation then non-zero cache read on the following request. 11. Assuming HTTP 200 means the cache survived. Symptom: judging a proxy change by "it still works." Cost: 200 means "valid request," not "cache preserved" — you can silently convert cheap reads into expensive writes. Fix: validate every prefix-touching change by watching cache read on the next request, not just the status code. 12. Switching Haiku → Sonnet mid-conversation. Symptom: escalating an in-flight Haiku conversation to Sonnet. Cost: stacks three things — full cache miss model-scoped , Sonnet now keeps-and-bills thinking Haiku had been ignoring, and any prefix-leanness benefit evaporates. Fix: decide the tier at conversation start; if you must escalate, accept it as a fresh cold start, not a cheap continuation. The four mental models signature ; you carry it, can't open it, can't edit it. The numbers total cost usd under-reports prices writes at 1.25× . The render order — tools → system → messages . Stable first, volatile last. The two safest cost levers never touch the prefix : The reflex to build: before any change that touches tools, system, or message history , ask "does this change a byte inside the cached prefix?" If yes, you're paying a 2× rewrite — only do it on purpose. /v1/messages request carrying tools + system + the whole conversation; the server stores nothing and the client re-sends everything including thinking each turn. The request is the only state. signature carries the full reasoning, decrypted server-side; you can't read it, can't edit it 400 , but must carry it for continuity. Billed as output at generation, input on replay; display is visibility-only. keep:"all" everywhere, so the strip never bites in real usage — and a proxy that list changed or async connector load churn byte 0 → full rebuild. Keep the tool set stable; use a fixed dispatcher. Stabilize the tool surface first. Costs are computed from measured token counts at Anthropic list prices per 1M, as of 2026-06-15 : Opus 4.8 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5 in/out; cache read 0.1×, cache write 2× 1-hour TTL . The tokenizer factor 0.775 is measured for Haiku and assumed for Sonnet. Sonnet routing figures are modeled, not run. Claude Code's displayed total cost usd uses a 1.25× write multiplier and reads lower than these documented-rate figures. Prices and behaviors are version-specific Claude Code 2.1.150, mid-2026 — re-verify for your own setup. Anthropic documentation platform.claude.com/docs/en/build-with-claude/extended-thinking platform.claude.com/docs/en/build-with-claude/adaptive-thinking platform.claude.com/docs/en/build-with-claude/prompt-caching claude.com/pricing Cost-reduction tooling lm-sys/RouteLLM · vLLM Semantic Router vllm-project/semantic-router chopratejas/headroom · LLMLingua microsoft/LLMLingua · RTK rtk-ai/rtk · lean-ctx yvgude/lean-ctx JuliusBrussee/caveman github/github-mcp-server v1.0.5; dynamic-toolsets removed in v1.1.0, PR 2512 — for tech-debt, not cache · docker/mcp-gateway v0.43.0 / 4833d8c ; dynamic-tools default-on