cd /news/large-language-models/claude-code-costs-act-iv-the-mistake… · home topics large-language-models article
[ARTICLE · art-40409] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Claude Code Costs, Act IV — The mistakes catalogue & cheat sheet

A developer cataloged 11 costly mistakes in using Anthropic's prompt caching with Claude Code, including routing per turn instead of per conversation, inserting proxies that re-serialize JSON, and leaving tool surfaces unstable. The mistakes range from cache-breaking volatile tokens to trusting displayed session costs over actual billing, with fixes provided for each.

read5 min views1 publishedJun 26, 2026

This act consolidates everything into two reference artifacts: a catalogue of the mistakes (each with its symptom, cost, and fix) for revision, and a one-page cheat sheet.

Ordered roughly by how much money they cost and how often they happen.

1. Routing per turn instead of per conversation. Symptom: a "cost-saving" router that bounces models mid-conversation. Cost: a cold prefix write on the first switch + a catch-up write on every re-entry; for Sonnet, a net loss (+2.1% measured). Fix: route sticky — per conversation or per sub-agent. Where you switch matters more than whether.

2. Inserting a proxy that re-serializes JSON. Symptom: a logging/compression/routing proxy that parses and re-emits the request. Cost: even a logically-identical re-encode changes the bytes (separators, escaping, key order, number formatting) and drops cache hit rate to ~0 for all traffic. Fix: forward original bytes verbatim; mutate only by surgical byte-fragment replacement on the messages

tail.

3. Leaving the tool/MCP surface unstable. Symptom: tool count grows across early turns (async MCP connectors), or dynamic/runtime tool discovery is enabled. Cost: byte-0 churn → full cold rebuild (2× write) every affected turn. Fix: --strict-mcp-config

and a frozen startup tool set; a fixed dispatcher tool instead of runtime tool mutation.

4. Volatile tokens early in the prompt. Symptom: a timestamp, UUID, or request-ID near the top of the system prompt; unsorted json.dumps

. Cost: every request has a unique prefix → ~0% hit rate, silently. Fix: move all volatile content after the last breakpoint; sort keys deterministically.

5. Stripping or editing thinking blocks in a proxy. Symptom: a proxy that "cleans up" or removes thinking blocks. Cost: editing → 400 (integrity); removing → 200 but a position-dependent re-key (removing an early block lost 8,104 cache tokens vs 440 for a late one). Fix: never mutate thinking blocks; carry the request body as Claude Code assembled it.

6. Trusting the displayed session cost as your invoice. Symptom: using /cost

or the status line as the source of truth. Cost: it prices 1-hour writes at the 5-minute 1.25× rate and reads ~10% low ($1.77 vs $1.97 on a real session). Fix: use the Anthropic Console for absolute billing; use the displayed figure only for in-session relative comparison.

7. Caching a prefix you'll read fewer than 3 times. Symptom: aggressive caching in a short-lived custom harness. Cost: a wasted write is 2× — double what no caching would have cost. Fix: cache only when you'll re-read the prefix ≥3 times (the break-even); let interactive sessions cache by default.

8. Running an aggressive input compressor over a cached prefix. Symptom: LLMLingua/LLM-based compression applied to the conversational prefix. Cost: rewriting cached bytes flips reads (0.1×) to writes (2×); accuracy degrades past ~20×. Fix: use input compressors only where there's no cache to lose (one-shot, RAG assembly) or only on the fresh tail; prefer a cache-aware tool.

9. Agentic bursts overflowing the 20-block lookback. Symptom: a single turn emits ~10+ parallel tool calls (≈21+ tool-call/result blocks). Cost: the next turn can't re-link the cache and cold-writes the gap. Fix: in Claude Code, bound the burst; behind a proxy, a stateless breakpoint grid (stride ~18, 4 markers) rescues up to ~74 blocks/turn.

10. A cache breakpoint below the minimum prefix. Symptom: marking a sub-1,024-token prefix for caching. Cost: the marker silently does nothing (cache_creation: 0

); you think caching is on. Fix: confirm non-zero cache_creation then non-zero cache_read

on the following request.

11. Assuming HTTP 200 means the cache survived. Symptom: judging a proxy change by "it still works." Cost: 200 means "valid request," not "cache preserved" — you can silently convert cheap reads into expensive writes. Fix: validate every prefix-touching change by watching cache_read

on the next request, not just the status code.

12. Switching Haiku → Sonnet mid-conversation. Symptom: escalating an in-flight Haiku conversation to Sonnet. Cost: stacks three things — full cache miss (model-scoped), Sonnet now keeps-and-bills thinking Haiku had been ignoring, and any prefix-leanness benefit evaporates. Fix: decide the tier at conversation start; if you must escalate, accept it as a fresh cold start, not a cheap continuation.

The four mental models

signature

; you carry it, can't open it, can't edit it.The numbers

total_cost_usd

under-reports (prices writes at 1.25×).The render ordertools → system → messages

. Stable first, volatile last.

The two safest cost levers (never touch the prefix): The reflex to build: before any change that touches tools, system, or message history, ask "does this change a byte inside the cached prefix?" If yes, you're paying a 2× rewrite — only do it on purpose.

/v1/messages

request carrying tools + system + the whole conversation; the server stores nothing and the client re-sends everything (including thinking) each turn. The request is the only state.signature

carries the full reasoning, decrypted server-side; you can't read it, can't edit it (400), but must carry it for continuity. Billed as output at generation, input on replay; display

is visibility-only.keep:"all" everywhere, so the strip never bites in real usage — and a proxy that list_changed

or async connector load) churn byte 0 → full rebuild. Keep the tool set stable; use a fixed dispatcher. Stabilize the tool surface first.Costs are computed from measured token counts at Anthropic list prices (per 1M, as of 2026-06-15): Opus 4.8 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5 in/out; cache read 0.1×, cache write 2× (1-hour TTL). The tokenizer factor 0.775 is measured for Haiku and assumed for Sonnet. Sonnet routing figures are modeled, not run. Claude Code's displayed total_cost_usd uses a 1.25× write multiplier and reads lower than these documented-rate figures. Prices and behaviors are version-specific (Claude Code 2.1.150, mid-2026) — re-verify for your own setup.

Anthropic documentation

platform.claude.com/docs/en/build-with-claude/extended-thinking

platform.claude.com/docs/en/build-with-claude/adaptive-thinking

platform.claude.com/docs/en/build-with-claude/prompt-caching

claude.com/pricing

Cost-reduction tooling

lm-sys/RouteLLM

· vLLM Semantic Router vllm-project/semantic-router

chopratejas/headroom

· LLMLingua microsoft/LLMLingua

· RTK rtk-ai/rtk

· lean-ctx yvgude/lean-ctx JuliusBrussee/caveman

`github/github-mcp-server`

(v1.0.5; dynamic-toolsets removed in v1.1.0, PR #2512 — for tech-debt, not cache) · `docker/mcp-gateway`

(v0.43.0 / 4833d8c

; `dynamic-tools`

default-on)
── more in #large-language-models 4 stories · sorted by recency
── more on @anthropic 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/claude-code-costs-ac…] indexed:0 read:5min 2026-06-26 ·