This act consolidates everything into two reference artifacts: a catalogue of the mistakes (each with its symptom, cost, and fix) for revision, and a one-page cheat sheet.
Ordered roughly by how much money they cost and how often they happen.
1. Routing per turn instead of per conversation. Symptom: a "cost-saving" router that bounces models mid-conversation. Cost: a cold prefix write on the first switch + a catch-up write on every re-entry; for Sonnet, a net loss (+2.1% measured). Fix: route sticky — per conversation or per sub-agent. Where you switch matters more than whether.
2. Inserting a proxy that re-serializes JSON. Symptom: a logging/compression/routing proxy that parses and re-emits the request. Cost: even a logically-identical re-encode changes the bytes (separators, escaping, key order, number formatting) and drops cache hit rate to ~0 for all traffic. Fix: forward original bytes verbatim; mutate only by surgical byte-fragment replacement on the messages
tail.
3. Leaving the tool/MCP surface unstable. Symptom: tool count grows across early turns (async MCP connectors), or dynamic/runtime tool discovery is enabled. Cost: byte-0 churn → full cold rebuild (2× write) every affected turn. Fix: --strict-mcp-config
and a frozen startup tool set; a fixed dispatcher tool instead of runtime tool mutation.
4. Volatile tokens early in the prompt. Symptom: a timestamp, UUID, or request-ID near the top of the system prompt; unsorted json.dumps
. Cost: every request has a unique prefix → ~0% hit rate, silently. Fix: move all volatile content after the last breakpoint; sort keys deterministically.
5. Stripping or editing thinking blocks in a proxy. Symptom: a proxy that "cleans up" or removes thinking blocks. Cost: editing → 400 (integrity); removing → 200 but a position-dependent re-key (removing an early block lost 8,104 cache tokens vs 440 for a late one). Fix: never mutate thinking blocks; carry the request body as Claude Code assembled it.
6. Trusting the displayed session cost as your invoice. Symptom: using /cost
or the status line as the source of truth. Cost: it prices 1-hour writes at the 5-minute 1.25× rate and reads ~10% low ($1.77 vs $1.97 on a real session). Fix: use the Anthropic Console for absolute billing; use the displayed figure only for in-session relative comparison.
7. Caching a prefix you'll read fewer than 3 times. Symptom: aggressive caching in a short-lived custom harness. Cost: a wasted write is 2× — double what no caching would have cost. Fix: cache only when you'll re-read the prefix ≥3 times (the break-even); let interactive sessions cache by default.
8. Running an aggressive input compressor over a cached prefix. Symptom: LLMLingua/LLM-based compression applied to the conversational prefix. Cost: rewriting cached bytes flips reads (0.1×) to writes (2×); accuracy degrades past ~20×. Fix: use input compressors only where there's no cache to lose (one-shot, RAG assembly) or only on the fresh tail; prefer a cache-aware tool.
9. Agentic bursts overflowing the 20-block lookback. Symptom: a single turn emits ~10+ parallel tool calls (≈21+ tool-call/result blocks). Cost: the next turn can't re-link the cache and cold-writes the gap. Fix: in Claude Code, bound the burst; behind a proxy, a stateless breakpoint grid (stride ~18, 4 markers) rescues up to ~74 blocks/turn.
10. A cache breakpoint below the minimum prefix. Symptom: marking a sub-1,024-token prefix for caching. Cost: the marker silently does nothing (cache_creation: 0
); you think caching is on. Fix: confirm non-zero cache_creation
then non-zero cache_read
on the following request.
11. Assuming HTTP 200 means the cache survived. Symptom: judging a proxy change by "it still works." Cost: 200 means "valid request," not "cache preserved" — you can silently convert cheap reads into expensive writes. Fix: validate every prefix-touching change by watching cache_read
on the next request, not just the status code.
12. Switching Haiku → Sonnet mid-conversation. Symptom: escalating an in-flight Haiku conversation to Sonnet. Cost: stacks three things — full cache miss (model-scoped), Sonnet now keeps-and-bills thinking Haiku had been ignoring, and any prefix-leanness benefit evaporates. Fix: decide the tier at conversation start; if you must escalate, accept it as a fresh cold start, not a cheap continuation.
The four mental models
signature
; you carry it, can't open it, can't edit it.The numbers
total_cost_usd
under-reports (prices writes at 1.25×).The render order — tools → system → messages
. Stable first, volatile last.
The two safest cost levers (never touch the prefix): The reflex to build: before any change that touches tools, system, or message history, ask "does this change a byte inside the cached prefix?" If yes, you're paying a 2× rewrite — only do it on purpose.
/v1/messages
request carrying tools + system + the whole conversation; the server stores nothing and the client re-sends everything (including thinking) each turn. The request is the only state.signature
carries the full reasoning, decrypted server-side; you can't read it, can't edit it (400), but must carry it for continuity. Billed as output at generation, input on replay; display
is visibility-only.keep:"all"
everywhere, so the strip never bites in real usage — and a proxy that list_changed
or async connector load) churn byte 0 → full rebuild. Keep the tool set stable; use a fixed dispatcher. Stabilize the tool surface first.Costs are computed from measured token counts at Anthropic list prices (per 1M, as of 2026-06-15): Opus 4.8 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5 in/out; cache read 0.1×, cache write 2× (1-hour TTL). The tokenizer factor 0.775 is measured for Haiku and assumed for Sonnet. Sonnet routing figures are modeled, not run. Claude Code's displayed total_cost_usd uses a 1.25× write multiplier and reads lower than these documented-rate figures. Prices and behaviors are version-specific (Claude Code 2.1.150, mid-2026) — re-verify for your own setup.
Anthropic documentation
platform.claude.com/docs/en/build-with-claude/extended-thinking
platform.claude.com/docs/en/build-with-claude/adaptive-thinking
platform.claude.com/docs/en/build-with-claude/prompt-caching
claude.com/pricing
Cost-reduction tooling
lm-sys/RouteLLM
· vLLM Semantic Router vllm-project/semantic-router
chopratejas/headroom
· LLMLingua microsoft/LLMLingua
· RTK rtk-ai/rtk
· lean-ctx yvgude/lean-ctx
JuliusBrussee/caveman
`github/github-mcp-server`
(v1.0.5; dynamic-toolsets removed in v1.1.0, PR #2512 — for tech-debt, not cache) · `docker/mcp-gateway`
(v0.43.0 / 4833d8c
; `dynamic-tools`
default-on)