Prompt caching cut my Claude API bill by 85%. Here's the exact setup.

An engineer at Anthropic achieved an 85% reduction in Claude API costs by enabling prompt caching on a long system prompt. The caching feature, which stores repeated prompt prefixes for five minutes, dropped daily costs from $47 to $6.80 for an agent processing 4,000 requests per day. The setup requires adding a cache_control block to the system prompt, with high-ROI scenarios including large system prompts, tool definitions, few-shot examples, and document analysis.

Last month I ran a side-by-side test on an AI agent that processes about 4,000 requests a day. The agent has a long system prompt roughly 2,800 tokens of rules, tool definitions, and examples that gets sent with every single call. Before prompt caching: $47/day. After enabling caching on that system prompt block: $6.80/day. That's not a rounding error. That's an 85% cost reduction with a single configuration change and zero changes to the agent's behavior. Here's exactly how prompt caching works and how to set it up without the gotchas. Anthropic's prompt caching works at the prefix level. When you send a request, the API checks whether a prefix of your messages exactly matches a previously-cached prefix. If it does, those cached tokens are served from a KV store instead of re-processed through the full model — and you pay a dramatically lower per-token rate for them. The pricing structure as of mid-2026 on Claude 3.5 Sonnet : The cache lasts 5 minutes between requests with the TTL resetting on each hit . For any agent that gets called more often than every 5 minutes — which is most production agents — this is almost always a win. The key is the cache control block. You add it as a "breakpoint" at the end of any message block you want cached. The API caches everything up to and including that breakpoint. python import anthropic client = anthropic.Anthropic Your long system prompt - tool definitions, rules, examples, etc. SYSTEM PROMPT = """ You are a support agent for Acme Corp... 2,800 tokens of rules, tool definitions, persona, examples """ response = client.messages.create model="claude-sonnet-4-5", max tokens=1024, system= { "type": "text", "text": SYSTEM PROMPT, "cache control": {"type": "ephemeral"} <-- this is the entire setup } , messages= {"role": "user", "content": user message} Check what actually happened usage = response.usage print f"Input tokens: {usage.input tokens}" print f"Cache write tokens: {usage.cache creation input tokens}" print f"Cache read tokens: {usage.cache read input tokens}" The cache creation input tokens field tells you a cache was written you pay the 25% premium . On subsequent calls within 5 minutes, cache read input tokens will be populated instead, and you pay $0.30/M instead of $3.00/M. High-ROI scenarios: Large system prompts repeated on every call. If your system prompt is 1,000+ tokens and you're calling the API more than once every 5 minutes, caching it is almost always net positive. Tool definitions. Tool schemas count as input tokens, and they can be surprisingly large. A set of 10 reasonably-described tools might run 800-1,200 tokens. Cache the tools block. Few-shot examples in the system prompt. This is the big one. People add 5-10 worked examples to their system prompts to improve output quality. Those examples might be 2,000-4,000 tokens. Cache them. Document analysis at scale. If you're analyzing the same document with many different questions think: extracting 20 different fields from a contract , cache the document text as a user message and issue all 20 queries against the same cache. Low or negative ROI scenarios: You can have up to 4 cache breakpoints per request . This lets you cache different parts of the prompt independently: response = client.messages.create model="claude-sonnet-4-5", max tokens=1024, system= { "type": "text", "text": BASE RULES, Always the same "cache control": {"type": "ephemeral"} }, { "type": "text", "text": TOOL DEFINITIONS, Changes rarely "cache control": {"type": "ephemeral"} }, { "type": "text", "text": dynamic context Changes per request — NOT cached } , messages= ... The prefix caching rule is strict: the API caches everything up to the last marked breakpoint in sequence. If your dynamic context goes between two cached blocks, the second cache hit won't work — the prefix has to be identical. Always put dynamic content at the end. Whitespace and character-level identity matter. The cache key is the exact token sequence of the prefix. If your system prompt is generated dynamically — say, you interpolate a user's name or account tier into it — each variation produces a different token sequence and you get zero cache hits even though 95% of the content is identical. The fix: move all dynamic content to the end, after your last cache breakpoint. Put only truly static content rules, tool definitions, examples in the cached block. Bad: dynamic content inside the cached block breaks caching system = f""" You are an agent for {company name}. <-- this makes every request unique 2,800 tokens of static rules """ Good: static block cached, dynamic content appended outside the cache STATIC BLOCK = """ 2,800 tokens of static rules """ system = {"type": "text", "text": STATIC BLOCK, "cache control": {"type": "ephemeral"}}, {"type": "text", "text": f"Current context: working for {company name}."} Before enabling caching, run this math: Let: T = tokens in your cached block R = requests per hour W = cache write cost = T $3.75/M S = savings per read = T $3.00 - $0.30 / M = T $2.70/M Break-even reads = W / S = $3.75 / $2.70 ≈ 1.4 reads per cache window If you get more than 1.4 requests in a 5-minute window that's about 17 requests/hour , caching is net positive. At 4,000 requests/day, you're hitting the cache hundreds of times per 5-minute window. Always instrument your cache usage. The response usage object tells you exactly what happened: usage = response.usage total input = usage.input tokens cache writes = getattr usage, 'cache creation input tokens', 0 cache reads = getattr usage, 'cache read input tokens', 0 A healthy caching ratio: most calls should be reads, not writes print f"Cache write: {cache writes} tokens paid at $3.75/M " print f"Cache read: {cache reads} tokens paid at $0.30/M " print f"Regular: {total input} tokens paid at $3.00/M " If you're seeing mostly cache creation input tokens and few cache read input tokens , your request cadence is slower than 5 minutes or your prompt isn't actually static. Fix the content, not the caching setup. Prompt caching is one of those rare API features where the implementation cost is 30 minutes and the payoff is immediate and ongoing. It doesn't change what your agent does — it just changes what you pay for the same work. If your agent makes more than ~20 calls/hour with a system prompt over ~800 tokens, you should be caching. The cache control block is a one-liner. The usage fields tell you instantly whether it's working. If you're building reliable AI agents at production scale, the free Reliable Agent Field Guide covers reliability patterns, cost controls, and testing strategies: penloomstudio.com/field-guide.html https://penloomstudio.com/field-guide.html