# Claude Code Costs, Act IV — The mistakes catalogue & cheat sheet

> Source: <https://dev.to/sumedhbala/claude-code-costs-act-iv-the-mistakes-catalogue-cheat-sheet-34bo>
> Published: 2026-06-26 06:00:29+00:00

This act consolidates everything into two reference artifacts: a catalogue of the mistakes (each with its symptom, cost, and fix) for revision, and a one-page cheat sheet.

Ordered roughly by how much money they cost and how often they happen.

**1. Routing per turn instead of per conversation.** *Symptom:* a "cost-saving" router that bounces models mid-conversation. *Cost:* a cold prefix write on the first switch + a catch-up write on every re-entry; for Sonnet, a net *loss* (+2.1% measured). *Fix:* route sticky — per conversation or per sub-agent. Where you switch matters more than whether.

**2. Inserting a proxy that re-serializes JSON.** *Symptom:* a logging/compression/routing proxy that parses and re-emits the request. *Cost:* even a logically-identical re-encode changes the bytes (separators, escaping, key order, number formatting) and drops cache hit rate to ~0 for *all* traffic. *Fix:* forward original bytes verbatim; mutate only by surgical byte-fragment replacement on the `messages`

tail.

**3. Leaving the tool/MCP surface unstable.** *Symptom:* tool count grows across early turns (async MCP connectors), or dynamic/runtime tool discovery is enabled. *Cost:* byte-0 churn → full cold rebuild (2× write) every affected turn. *Fix:* `--strict-mcp-config`

and a frozen startup tool set; a fixed dispatcher tool instead of runtime tool mutation.

**4. Volatile tokens early in the prompt.** *Symptom:* a timestamp, UUID, or request-ID near the top of the system prompt; unsorted `json.dumps`

. *Cost:* every request has a unique prefix → ~0% hit rate, silently. *Fix:* move all volatile content *after* the last breakpoint; sort keys deterministically.

**5. Stripping or editing thinking blocks in a proxy.** *Symptom:* a proxy that "cleans up" or removes thinking blocks. *Cost:* editing → 400 (integrity); removing → 200 but a position-dependent re-key (removing an *early* block lost 8,104 cache tokens vs 440 for a late one). *Fix:* never mutate thinking blocks; carry the request body as Claude Code assembled it.

**6. Trusting the displayed session cost as your invoice.** *Symptom:* using `/cost`

or the status line as the source of truth. *Cost:* it prices 1-hour writes at the 5-minute 1.25× rate and reads ~10% low ($1.77 vs $1.97 on a real session). *Fix:* use the Anthropic Console for absolute billing; use the displayed figure only for in-session relative comparison.

**7. Caching a prefix you'll read fewer than 3 times.** *Symptom:* aggressive caching in a short-lived custom harness. *Cost:* a wasted write is 2× — double what no caching would have cost. *Fix:* cache only when you'll re-read the prefix ≥3 times (the break-even); let interactive sessions cache by default.

**8. Running an aggressive input compressor over a cached prefix.** *Symptom:* LLMLingua/LLM-based compression applied to the conversational prefix. *Cost:* rewriting cached bytes flips reads (0.1×) to writes (2×); accuracy degrades past ~20×. *Fix:* use input compressors only where there's no cache to lose (one-shot, RAG assembly) or only on the fresh tail; prefer a cache-aware tool.

**9. Agentic bursts overflowing the 20-block lookback.** *Symptom:* a single turn emits ~10+ parallel tool calls (≈21+ tool-call/result blocks). *Cost:* the next turn can't re-link the cache and cold-writes the gap. *Fix:* in Claude Code, bound the burst; behind a proxy, a stateless breakpoint grid (stride ~18, 4 markers) rescues up to ~74 blocks/turn.

**10. A cache breakpoint below the minimum prefix.** *Symptom:* marking a sub-1,024-token prefix for caching. *Cost:* the marker silently does nothing (`cache_creation: 0`

); you think caching is on. *Fix:* confirm non-zero `cache_creation`

then non-zero `cache_read`

on the following request.

**11. Assuming HTTP 200 means the cache survived.** *Symptom:* judging a proxy change by "it still works." *Cost:* 200 means "valid request," not "cache preserved" — you can silently convert cheap reads into expensive writes. *Fix:* validate every prefix-touching change by watching `cache_read`

on the *next* request, not just the status code.

**12. Switching Haiku → Sonnet mid-conversation.** *Symptom:* escalating an in-flight Haiku conversation to Sonnet. *Cost:* stacks three things — full cache miss (model-scoped), Sonnet now keeps-and-bills thinking Haiku had been ignoring, and any prefix-leanness benefit evaporates. *Fix:* decide the tier at conversation start; if you must escalate, accept it as a fresh cold start, not a cheap continuation.

**The four mental models**

`signature`

; you carry it, can't open it, can't edit it.**The numbers**

`total_cost_usd`

under-reports (prices writes at 1.25×).**The render order** — `tools → system → messages`

. Stable first, volatile last.

**The two safest cost levers** (never touch the prefix):

**The reflex to build:** before any change that touches tools, system, or message *history*, ask *"does this change a byte inside the cached prefix?"* If yes, you're paying a 2× rewrite — only do it on purpose.

`/v1/messages`

request carrying tools + system + the whole conversation; the server stores nothing and the client re-sends everything (including thinking) each turn. The request is the only state.`signature`

carries the full reasoning, decrypted server-side; you can't read it, can't edit it (400), but must carry it for continuity. Billed as output at generation, input on replay; `display`

is visibility-only.`keep:"all"`

everywhere, so the strip never bites in real usage — and a proxy that `list_changed`

or async connector load) churn byte 0 → full rebuild. Keep the tool set stable; use a fixed dispatcher. Stabilize the tool surface first.*Costs are computed from measured token counts at Anthropic list prices (per 1M, as of 2026-06-15): Opus 4.8 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5 in/out; cache read 0.1×, cache write 2× (1-hour TTL). The tokenizer factor 0.775 is measured for Haiku and assumed for Sonnet. Sonnet routing figures are modeled, not run. Claude Code's displayed total_cost_usd uses a 1.25× write multiplier and reads lower than these documented-rate figures. Prices and behaviors are version-specific (Claude Code 2.1.150, mid-2026) — re-verify for your own setup.*

**Anthropic documentation**

`platform.claude.com/docs/en/build-with-claude/extended-thinking`

`platform.claude.com/docs/en/build-with-claude/adaptive-thinking`

`platform.claude.com/docs/en/build-with-claude/prompt-caching`

`claude.com/pricing`

**Cost-reduction tooling**

`lm-sys/RouteLLM`

· vLLM Semantic Router `vllm-project/semantic-router`

`chopratejas/headroom`

· LLMLingua `microsoft/LLMLingua`

· RTK `rtk-ai/rtk`

· lean-ctx `yvgude/lean-ctx`

`JuliusBrussee/caveman`

`github/github-mcp-server`

(v1.0.5; dynamic-toolsets removed in v1.1.0, PR #2512 — for tech-debt, not cache) · `docker/mcp-gateway`

(v0.43.0 / `4833d8c`

; `dynamic-tools`

default-on)
