cd /news/large-language-models/prompt-caching-cut-my-claude-api-bil… · home topics large-language-models article
[ARTICLE · art-45854] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Prompt caching cut my Claude API bill by 85%. Here's the exact setup.

An engineer at Anthropic achieved an 85% reduction in Claude API costs by enabling prompt caching on a long system prompt. The caching feature, which stores repeated prompt prefixes for five minutes, dropped daily costs from $47 to $6.80 for an agent processing 4,000 requests per day. The setup requires adding a cache_control block to the system prompt, with high-ROI scenarios including large system prompts, tool definitions, few-shot examples, and document analysis.

read5 min views1 publishedJul 1, 2026

Last month I ran a side-by-side test on an AI agent that processes about 4,000 requests a day. The agent has a long system prompt (roughly 2,800 tokens of rules, tool definitions, and examples) that gets sent with every single call. Before prompt caching: $47/day. After enabling caching on that system prompt block: $6.80/day.

That's not a rounding error. That's an 85% cost reduction with a single configuration change and zero changes to the agent's behavior.

Here's exactly how prompt caching works and how to set it up without the gotchas.

Anthropic's prompt caching works at the prefix level. When you send a request, the API checks whether a prefix of your messages exactly matches a previously-cached prefix. If it does, those cached tokens are served from a KV store instead of re-processed through the full model — and you pay a dramatically lower per-token rate for them.

The pricing structure (as of mid-2026 on Claude 3.5 Sonnet):

The cache lasts 5 minutes between requests (with the TTL resetting on each hit). For any agent that gets called more often than every 5 minutes — which is most production agents — this is almost always a win.

The key is the cache_control

block. You add it as a "breakpoint" at the end of any message block you want cached. The API caches everything up to and including that breakpoint.

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """
You are a support agent for Acme Corp...
[2,800 tokens of rules, tool definitions, persona, examples]
"""

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # <-- this is the entire setup
        }
    ],
    messages=[
        {"role": "user", "content": user_message}
    ]
)

usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")

The cache_creation_input_tokens

field tells you a cache was written (you pay the 25% premium). On subsequent calls within 5 minutes, cache_read_input_tokens

will be populated instead, and you pay $0.30/M instead of $3.00/M.

High-ROI scenarios:

Large system prompts repeated on every call. If your system prompt is 1,000+ tokens and you're calling the API more than once every 5 minutes, caching it is almost always net positive.

Tool definitions. Tool schemas count as input tokens, and they can be surprisingly large. A set of 10 reasonably-described tools might run 800-1,200 tokens. Cache the tools block.

Few-shot examples in the system prompt. This is the big one. People add 5-10 worked examples to their system prompts to improve output quality. Those examples might be 2,000-4,000 tokens. Cache them.

Document analysis at scale. If you're analyzing the same document with many different questions (think: extracting 20 different fields from a contract), cache the document text as a user message and issue all 20 queries against the same cache.

Low or negative ROI scenarios:

You can have up to 4 cache breakpoints per request. This lets you cache different parts of the prompt independently:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": BASE_RULES,           # Always the same
            "cache_control": {"type": "ephemeral"}
        },
        {
            "type": "text",
            "text": TOOL_DEFINITIONS,     # Changes rarely
            "cache_control": {"type": "ephemeral"}
        },
        {
            "type": "text",
            "text": dynamic_context       # Changes per request — NOT cached
        }
    ],
    messages=[...]
)

The prefix caching rule is strict: the API caches everything up to the last marked breakpoint in sequence. If your dynamic context goes between two cached blocks, the second cache hit won't work — the prefix has to be identical. Always put dynamic content at the end.

Whitespace and character-level identity matter.

The cache key is the exact token sequence of the prefix. If your system prompt is generated dynamically — say, you interpolate a user's name or account tier into it — each variation produces a different token sequence and you get zero cache hits even though 95% of the content is identical.

The fix: move all dynamic content to the end, after your last cache breakpoint. Put only truly static content (rules, tool definitions, examples) in the cached block.

system = f"""
You are an agent for {company_name}.  # <-- this makes every request unique
[2,800 tokens of static rules]
"""

STATIC_BLOCK = """
[2,800 tokens of static rules]
"""
system = [
    {"type": "text", "text": STATIC_BLOCK, "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": f"Current context: working for {company_name}."}
]

Before enabling caching, run this math:

Let:
  T = tokens in your cached block
  R = requests per hour
  W = cache write cost = T * $3.75/M
  S = savings per read = T * ($3.00 - $0.30) / M = T * $2.70/M

Break-even reads = W / S = $3.75 / $2.70 ≈ 1.4 reads per cache window

If you get more than 1.4 requests in a 5-minute window (that's about 17 requests/hour), caching is net positive. At 4,000 requests/day, you're hitting the cache hundreds of times per 5-minute window.

Always instrument your cache usage. The response usage object tells you exactly what happened:

usage = response.usage
total_input = usage.input_tokens
cache_writes = getattr(usage, 'cache_creation_input_tokens', 0)
cache_reads = getattr(usage, 'cache_read_input_tokens', 0)

print(f"Cache write: {cache_writes} tokens (paid at $3.75/M)")
print(f"Cache read:  {cache_reads} tokens (paid at $0.30/M)")
print(f"Regular:     {total_input} tokens (paid at $3.00/M)")

If you're seeing mostly cache_creation_input_tokens

and few cache_read_input_tokens

, your request cadence is slower than 5 minutes or your prompt isn't actually static. Fix the content, not the caching setup.

Prompt caching is one of those rare API features where the implementation cost is 30 minutes and the payoff is immediate and ongoing. It doesn't change what your agent does — it just changes what you pay for the same work.

If your agent makes more than ~20 calls/hour with a system prompt over ~800 tokens, you should be caching. The cache_control

block is a one-liner. The usage fields tell you instantly whether it's working.

If you're building reliable AI agents at production scale, the free Reliable Agent Field Guide covers reliability patterns, cost controls, and testing strategies: penloomstudio.com/field-guide.html

── more in #large-language-models 4 stories · sorted by recency
── more on @anthropic 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/prompt-caching-cut-m…] indexed:0 read:5min 2026-07-01 ·