{"slug": "prompt-caching-in-production-the-4-patterns-that-cut-my-anthropic-bill-and-when", "title": "Prompt caching in production: the 4 patterns that cut my Anthropic bill (and when not to bother)", "summary": "A developer achieved an 80% cost reduction on their Anthropic API bill for the Career-OS application in a single afternoon by properly implementing prompt caching in the Claude SDK. The developer identified four production caching patterns that deliver the highest leverage, including caching static system prompts and large reference documents, while noting that caching fails when prompts are called less than once every five minutes. The developer also found that caching entire documents is often cheaper and more accurate than RAG-based retrieval for stable corpora that fit within Claude's 200K token context window.", "body_md": "The first month I ran Career-OS in production, the Anthropic bill was bigger\n\nthan my coffee budget. After I wired prompt caching properly into the scorer,\n\nthe drafter, and the digest, it dropped under it. Same calls. Same model.\n\nSame outputs. Roughly an 80% cost reduction in one afternoon.\n\nPrompt caching is the single highest-leverage knob in the Claude SDK. It's\n\nalso the one I see misconfigured most often in client code — usually because\n\npeople read the docs, slap `cache_control`\n\non something, and assume they're\n\ncaching when they're not.\n\nHere are the 4 patterns I ship in production, with the cost math, and the 4\n\ncases where caching genuinely does not help so you don't waste a day on it.\n\nThe mechanics, in three lines, because you need to know this to use it right:\n\n`\"cache_control\": { \"type\": \"ephemeral\" }`\n\n)\nis stored on Anthropic's side after the first call. Subsequent calls with\nan identical cached block hit the cache instead of re-processing.If your workload calls the same prompt twice within 5 minutes, caching pays\n\noff. If you call it once an hour with no warmup, you're paying the write\n\npenalty for nothing.\n\nThe pattern everyone reaches for first, and the one that gives the biggest\n\nwin in 90% of cases.\n\n``` python\n// app/api/agent/route.ts\nimport Anthropic from \"@anthropic-ai/sdk\";\n\nconst claude = new Anthropic();\n\nexport async function POST(req: Request) {\n  const { question } = await req.json();\n\n  const reply = await claude.messages.create({\n    model: \"claude-sonnet-4-6\",\n    max_tokens: 1024,\n    system: [\n      {\n        type: \"text\",\n        text: SYSTEM_PROMPT,                    // 2,400 tokens of context\n        cache_control: { type: \"ephemeral\" },   // ← the magic\n      },\n    ],\n    messages: [{ role: \"user\", content: question }],\n  });\n\n  return Response.json({ answer: reply.content });\n}\n```\n\nThe math, for a 2,400-token system prompt called 100 times in 5 minutes (the\n\nrealistic shape of a busy support endpoint):\n\nThe break-even is between the 1st and 2nd call. After call 2 you're already\n\nahead. After call 100 you've collapsed an 89% chunk of your bill into\n\noperating expense.\n\n**Cache hits are silent.** The API returns `cache_creation_input_tokens`\n\nand\n\n`cache_read_input_tokens`\n\nin the usage block. Log them. If you're not seeing\n\nreads, you're not caching:\n\n```\nconsole.log({\n  cache_write: reply.usage.cache_creation_input_tokens,\n  cache_read:  reply.usage.cache_read_input_tokens,\n  uncached:    reply.usage.input_tokens,\n});\n```\n\nA single dashboard tile showing cache_read / (cache_read + uncached) tells\n\nyou whether your caching is working. Mine sits at 94% for the Career-OS\n\nscorer during morning crawl runs.\n\nThe pattern that actually changes which architectures are economically viable.\n\nSay you have a 30,000-token product manual, customer policy document, or\n\ncodebase. Without caching, every customer question costs you ~$0.09 in input\n\ntokens alone. With caching, your *first* question of the day costs you\n\n~$0.11, and every subsequent question costs $0.01.\n\n```\n# document_qa.py\n\nreply = client.messages.create(\n    model=\"claude-sonnet-4-6\",\n    max_tokens=600,\n    system=[\n        {\n            \"type\": \"text\",\n            \"text\": LIGHT_INSTRUCTIONS,         # 200 tokens, uncached\n        },\n        {\n            \"type\": \"text\",\n            \"text\": SHOP_POLICY_DOCUMENT,       # 30,000 tokens, CACHED\n            \"cache_control\": {\"type\": \"ephemeral\"},\n        },\n    ],\n    messages=[{\"role\": \"user\", \"content\": user_question}],\n)\n```\n\nWhat this kills: most of the use cases people built RAG for. If your\n\n\"retrieval over a fixed corpus\" use case fits inside Claude's 200K context,\n\ncaching the full document is often cheaper and *always more accurate* than\n\nembedding-based retrieval. No chunking. No top-k tuning. No vector DB\n\noperational burden.\n\nThe catch: the corpus has to be relatively stable. If your \"document\" is\n\nyesterday's database dump, you're paying the cache write fee every single\n\nday. Use cache for things that change weekly, not hourly.\n\nTool use blocks are tokens. They count. And they're identical across every\n\ncall to the same agent.\n\n```\nTOOLS = [\n    {\"name\": \"search_orders\", \"description\": \"...\", \"input_schema\": {...}},\n    {\"name\": \"issue_refund\",  \"description\": \"...\", \"input_schema\": {...}},\n    {\"name\": \"lookup_user\",   \"description\": \"...\", \"input_schema\": {...}},\n    # … 12 tools in total, ~3,500 tokens of schema\n]\n\nreply = client.messages.create(\n    model=\"claude-sonnet-4-6\",\n    tools=TOOLS,\n    system=[\n        {\n            \"type\": \"text\",\n            \"text\": SYSTEM_PROMPT,\n            \"cache_control\": {\"type\": \"ephemeral\"},\n        },\n    ],\n    messages=[...],\n)\n```\n\nWhen you cache the system block, **tool definitions get cached too** if\n\nthey're declared in the same call. They become part of the cached prefix.\n\nYou don't need a separate `cache_control`\n\non the tools array — the cache\n\nboundary extends through everything in the system block and the tools.\n\nThis is a 3,500-token win you get for free when you're already caching the\n\nsystem block. Most of the time it's already happening and you don't realize\n\nit. Worth confirming with the cache_creation_input_tokens log line.\n\nThe pattern that makes long-running agentic loops affordable.\n\nMulti-turn agents — the ones that loop through `assistant → tool_use → tool_result → assistant → tool_use → …`\n\n— re-send the entire conversation\n\nhistory on every call. By turn 8, you're sending 12,000+ tokens of history,\n\nmost of which is unchanged from turn 7.\n\nCache the prefix.\n\n``` php\ndef agent_loop(initial_message: str) -> str:\n    messages = [{\"role\": \"user\", \"content\": initial_message}]\n\n    for turn in range(max_turns := 10):\n        # Cache everything up to the last assistant turn\n        cached_messages = mark_last_message_cached(messages)\n\n        reply = client.messages.create(\n            model=\"claude-sonnet-4-6\",\n            tools=TOOLS,\n            system=[{\n                \"type\": \"text\", \"text\": SYSTEM_PROMPT,\n                \"cache_control\": {\"type\": \"ephemeral\"}\n            }],\n            messages=cached_messages,\n        )\n\n        if reply.stop_reason == \"end_turn\":\n            return reply.content[0].text\n\n        messages.append({\"role\": \"assistant\", \"content\": reply.content})\n        messages.append({\"role\": \"user\", \"content\": run_tools(reply)})\n\ndef mark_last_message_cached(messages: list) -> list:\n    \"\"\"Add cache_control to the last user message so the whole prefix caches.\"\"\"\n    out = list(messages)\n    if out:\n        last = out[-1].copy()\n        if isinstance(last[\"content\"], str):\n            last[\"content\"] = [{\"type\": \"text\", \"text\": last[\"content\"]}]\n        last[\"content\"][-1][\"cache_control\"] = {\"type\": \"ephemeral\"}\n        out[-1] = last\n    return out\n```\n\nEach new turn extends the cached prefix by the previous turn's content. By\n\nturn 10, ~95% of your input tokens hit cache reads. An agent loop that would\n\ncost $0.40 to run uncached costs $0.05 with this pattern.\n\nThis is where I see clients waste afternoons. Be honest about whether your\n\nworkload fits.\n\n**1. Your prompts vary too much.** If each call has a different system\n\nprompt (you're concatenating user-specific data into it, or A/B-testing\n\nprompt variants), there's no shared cache prefix to hit. Either restructure\n\nto push the variation into the messages block (keeping the system stable),\n\nor accept that caching isn't your lever.\n\n**2. Your volume is low.** If you call the model 5 times an hour spread\n\nevenly, the 5-minute TTL means you almost never hit a warm cache. The\n\n1-hour TTL helps but doubles the write cost. At extremely low volumes the\n\nmath sometimes works out to \"uncached is cheaper.\"\n\n**3. Your prompts are short.** Below ~1,024 tokens of cacheable content (the\n\nAnthropic minimum), caching just doesn't activate. The write cost is paid;\n\nno cache is created. Quietly. Check the usage block.\n\n**4. Your content is per-user and short-lived.** If the cached content is\n\nspecific to one user and they only make one or two calls, you're paying the\n\nwrite penalty without ever hitting the cache. Aggregation across users or\n\nsessions doesn't apply.\n\nThe three things to wire up *before* you ship cached calls:\n\n`cache_creation_input_tokens`\n\nand `cache_read_input_tokens`\n\nfor\nevery call.For Career-OS, the four patterns above collapsed the morning crawl-and-score\n\nrun from \"noticeable on the bill\" to \"rounding error.\" Setup time: one\n\nafternoon. Ongoing maintenance: the three log lines + one dashboard tile.\n\nFor an inbound support agent handling 20,000 queries a month: easily\n\n$200–$400/month saved versus uncached, every month, forever, with the same\n\nquality of output.\n\nFor a documentation-QA endpoint over a stable corpus: the difference between\n\n\"too expensive to ship to all users\" and \"an obvious feature.\" I've watched\n\nthis single decision unblock entire roadmap items.\n\nIf you have a Claude-powered feature in production today and you do not have\n\na dashboard tile showing cache hit rate, that's the bug. Cache misses are\n\nsilent and your bill is paying for them.\n\nThis is a 1–3 day scoped audit + fix that I take on:\n\n[the shape is on the hire-me page](https://dev.to/hire-me).\n\nFor the full context where these patterns ship, see the\n\n[Career-OS architecture walkthrough](https://dev.to/blog/career-os-architecture).\n\nFor the upstream patterns — where to bolt the Claude call onto your stack\n\nin the first place — see the\n\n[5 places to bolt AI onto Laravel](https://dev.to/blog/5-places-to-bolt-ai-onto-laravel)\n\nand the\n\n[PrestaShop 5-file pattern](https://dev.to/blog/claude-agent-prestashop-5-files).\n\nAnd before any of this ships to production, the\n\n[eval harness post](https://dev.to/blog/evaluating-claude-features-before-production)\n\nis the discipline that catches the regressions caching alone can't.\n\n*Originally published on bak-dev.com. Find more build-in-public posts at bak-dev.com/blog.*", "url": "https://wpnews.pro/news/prompt-caching-in-production-the-4-patterns-that-cut-my-anthropic-bill-and-when", "canonical_source": "https://dev.to/akram_bakhouche/prompt-caching-in-production-the-4-patterns-that-cut-my-anthropic-bill-and-when-not-to-bother-1h5e", "published_at": "2026-05-28 16:14:58+00:00", "updated_at": "2026-05-28 16:25:49.261490+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-tools", "ai-infrastructure", "mlops"], "entities": ["Anthropic", "Claude", "Career-OS"], "alternates": {"html": "https://wpnews.pro/news/prompt-caching-in-production-the-4-patterns-that-cut-my-anthropic-bill-and-when", "markdown": "https://wpnews.pro/news/prompt-caching-in-production-the-4-patterns-that-cut-my-anthropic-bill-and-when.md", "text": "https://wpnews.pro/news/prompt-caching-in-production-the-4-patterns-that-cut-my-anthropic-bill-and-when.txt", "jsonld": "https://wpnews.pro/news/prompt-caching-in-production-the-4-patterns-that-cut-my-anthropic-bill-and-when.jsonld"}}