{"slug": "prompt-caching-cut-my-claude-api-bill-by-85-here-s-the-exact-setup", "title": "Prompt caching cut my Claude API bill by 85%. Here's the exact setup.", "summary": "An engineer at Anthropic achieved an 85% reduction in Claude API costs by enabling prompt caching on a long system prompt. The caching feature, which stores repeated prompt prefixes for five minutes, dropped daily costs from $47 to $6.80 for an agent processing 4,000 requests per day. The setup requires adding a cache_control block to the system prompt, with high-ROI scenarios including large system prompts, tool definitions, few-shot examples, and document analysis.", "body_md": "Last month I ran a side-by-side test on an AI agent that processes about 4,000 requests a day. The agent has a long system prompt (roughly 2,800 tokens of rules, tool definitions, and examples) that gets sent with every single call. Before prompt caching: $47/day. After enabling caching on that system prompt block: $6.80/day.\n\nThat's not a rounding error. That's an 85% cost reduction with a single configuration change and zero changes to the agent's behavior.\n\nHere's exactly how prompt caching works and how to set it up without the gotchas.\n\nAnthropic's prompt caching works at the prefix level. When you send a request, the API checks whether a prefix of your messages exactly matches a previously-cached prefix. If it does, those cached tokens are served from a KV store instead of re-processed through the full model — and you pay a dramatically lower per-token rate for them.\n\nThe pricing structure (as of mid-2026 on Claude 3.5 Sonnet):\n\nThe cache lasts **5 minutes** between requests (with the TTL resetting on each hit). For any agent that gets called more often than every 5 minutes — which is most production agents — this is almost always a win.\n\nThe key is the `cache_control`\n\nblock. You add it as a \"breakpoint\" at the end of any message block you want cached. The API caches everything **up to and including** that breakpoint.\n\n``` python\nimport anthropic\n\nclient = anthropic.Anthropic()\n\n# Your long system prompt - tool definitions, rules, examples, etc.\nSYSTEM_PROMPT = \"\"\"\nYou are a support agent for Acme Corp...\n[2,800 tokens of rules, tool definitions, persona, examples]\n\"\"\"\n\nresponse = client.messages.create(\n    model=\"claude-sonnet-4-5\",\n    max_tokens=1024,\n    system=[\n        {\n            \"type\": \"text\",\n            \"text\": SYSTEM_PROMPT,\n            \"cache_control\": {\"type\": \"ephemeral\"}  # <-- this is the entire setup\n        }\n    ],\n    messages=[\n        {\"role\": \"user\", \"content\": user_message}\n    ]\n)\n\n# Check what actually happened\nusage = response.usage\nprint(f\"Input tokens: {usage.input_tokens}\")\nprint(f\"Cache write tokens: {usage.cache_creation_input_tokens}\")\nprint(f\"Cache read tokens: {usage.cache_read_input_tokens}\")\n```\n\nThe `cache_creation_input_tokens`\n\nfield tells you a cache was written (you pay the 25% premium). On subsequent calls within 5 minutes, `cache_read_input_tokens`\n\nwill be populated instead, and you pay $0.30/M instead of $3.00/M.\n\n**High-ROI scenarios:**\n\n**Large system prompts repeated on every call.** If your system prompt is 1,000+ tokens and you're calling the API more than once every 5 minutes, caching it is almost always net positive.\n\n**Tool definitions.** Tool schemas count as input tokens, and they can be surprisingly large. A set of 10 reasonably-described tools might run 800-1,200 tokens. Cache the tools block.\n\n**Few-shot examples in the system prompt.** This is the big one. People add 5-10 worked examples to their system prompts to improve output quality. Those examples might be 2,000-4,000 tokens. Cache them.\n\n**Document analysis at scale.** If you're analyzing the same document with many different questions (think: extracting 20 different fields from a contract), cache the document text as a user message and issue all 20 queries against the same cache.\n\n**Low or negative ROI scenarios:**\n\nYou can have **up to 4 cache breakpoints per request**. This lets you cache different parts of the prompt independently:\n\n```\nresponse = client.messages.create(\n    model=\"claude-sonnet-4-5\",\n    max_tokens=1024,\n    system=[\n        {\n            \"type\": \"text\",\n            \"text\": BASE_RULES,           # Always the same\n            \"cache_control\": {\"type\": \"ephemeral\"}\n        },\n        {\n            \"type\": \"text\",\n            \"text\": TOOL_DEFINITIONS,     # Changes rarely\n            \"cache_control\": {\"type\": \"ephemeral\"}\n        },\n        {\n            \"type\": \"text\",\n            \"text\": dynamic_context       # Changes per request — NOT cached\n        }\n    ],\n    messages=[...]\n)\n```\n\nThe prefix caching rule is strict: the API caches everything up to the last marked breakpoint in sequence. If your dynamic context goes between two cached blocks, the second cache hit won't work — the prefix has to be identical. Always put dynamic content at the end.\n\n**Whitespace and character-level identity matter.**\n\nThe cache key is the exact token sequence of the prefix. If your system prompt is generated dynamically — say, you interpolate a user's name or account tier into it — each variation produces a different token sequence and you get zero cache hits even though 95% of the content is identical.\n\nThe fix: move all dynamic content to the end, after your last cache breakpoint. Put only truly static content (rules, tool definitions, examples) in the cached block.\n\n```\n# Bad: dynamic content inside the cached block breaks caching\nsystem = f\"\"\"\nYou are an agent for {company_name}.  # <-- this makes every request unique\n[2,800 tokens of static rules]\n\"\"\"\n\n# Good: static block cached, dynamic content appended outside the cache\nSTATIC_BLOCK = \"\"\"\n[2,800 tokens of static rules]\n\"\"\"\nsystem = [\n    {\"type\": \"text\", \"text\": STATIC_BLOCK, \"cache_control\": {\"type\": \"ephemeral\"}},\n    {\"type\": \"text\", \"text\": f\"Current context: working for {company_name}.\"}\n]\n```\n\nBefore enabling caching, run this math:\n\n```\nLet:\n  T = tokens in your cached block\n  R = requests per hour\n  W = cache write cost = T * $3.75/M\n  S = savings per read = T * ($3.00 - $0.30) / M = T * $2.70/M\n\nBreak-even reads = W / S = $3.75 / $2.70 ≈ 1.4 reads per cache window\n```\n\nIf you get more than 1.4 requests in a 5-minute window (that's about 17 requests/hour), caching is net positive. At 4,000 requests/day, you're hitting the cache hundreds of times per 5-minute window.\n\nAlways instrument your cache usage. The response usage object tells you exactly what happened:\n\n```\nusage = response.usage\ntotal_input = usage.input_tokens\ncache_writes = getattr(usage, 'cache_creation_input_tokens', 0)\ncache_reads = getattr(usage, 'cache_read_input_tokens', 0)\n\n# A healthy caching ratio: most calls should be reads, not writes\nprint(f\"Cache write: {cache_writes} tokens (paid at $3.75/M)\")\nprint(f\"Cache read:  {cache_reads} tokens (paid at $0.30/M)\")\nprint(f\"Regular:     {total_input} tokens (paid at $3.00/M)\")\n```\n\nIf you're seeing mostly `cache_creation_input_tokens`\n\nand few `cache_read_input_tokens`\n\n, your request cadence is slower than 5 minutes or your prompt isn't actually static. Fix the content, not the caching setup.\n\nPrompt caching is one of those rare API features where the implementation cost is 30 minutes and the payoff is immediate and ongoing. It doesn't change what your agent does — it just changes what you pay for the same work.\n\nIf your agent makes more than ~20 calls/hour with a system prompt over ~800 tokens, you should be caching. The `cache_control`\n\nblock is a one-liner. The usage fields tell you instantly whether it's working.\n\nIf you're building reliable AI agents at production scale, the free **Reliable Agent Field Guide** covers reliability patterns, cost controls, and testing strategies: [penloomstudio.com/field-guide.html](https://penloomstudio.com/field-guide.html)", "url": "https://wpnews.pro/news/prompt-caching-cut-my-claude-api-bill-by-85-here-s-the-exact-setup", "canonical_source": "https://dev.to/penloom_studio_829b7817d3/prompt-caching-cut-my-claude-api-bill-by-85-heres-the-exact-setup-3nd9", "published_at": "2026-07-01 02:18:57+00:00", "updated_at": "2026-07-01 02:49:04.043098+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "developer-tools", "ai-infrastructure"], "entities": ["Anthropic", "Claude", "Claude 3.5 Sonnet", "Claude Sonnet 4-5"], "alternates": {"html": "https://wpnews.pro/news/prompt-caching-cut-my-claude-api-bill-by-85-here-s-the-exact-setup", "markdown": "https://wpnews.pro/news/prompt-caching-cut-my-claude-api-bill-by-85-here-s-the-exact-setup.md", "text": "https://wpnews.pro/news/prompt-caching-cut-my-claude-api-bill-by-85-here-s-the-exact-setup.txt", "jsonld": "https://wpnews.pro/news/prompt-caching-cut-my-claude-api-bill-by-85-here-s-the-exact-setup.jsonld"}}