Why your Anthropic prompt caching probably isn't working (and the npm package I built to fix it)

Anthropic's prompt caching often fails silently due to misplaced breakpoints, prefix drift between calls, short TTL expiration (reduced from 1 hour to 5 minutes), and lack of measurement. The author, a solo developer, created an npm package called `prompt-cache-optimizer` that wraps the Anthropic SDK to track cache hit rates, log dollars saved, and warn when performance drops below a configurable threshold. The package provides a `cacheInfo` field on every response and aggregate stats to help developers verify caching is actually working.

I'm a solo developer with about five years of experience, mostly outside AI. The last few months I've been getting serious about it — reading docs, building small things with Claude, learning how it differs from the web APIs I'm used to. While I was setting up Anthropic prompt caching for a project, I got stuck on a question I couldn't easily answer: how do I know it's actually working? The docs explained the cache control API and the 90% discount on cached tokens. But the only way to verify a call had hit the cache was to manually parse cache read input tokens from the response usage on every request. Nobody seems to do this. That gap turned into my first published npm package, prompt-cache-optimizer . This post is what I learned about the four ways prompt caching silently fails, and what the package does to catch them. What prompt caching is supposed to do When you call messages.create with a long, stable prefix system prompt, tool definitions, retrieved documents , Anthropic lets you mark a cache control breakpoint. On the first call, that prefix gets written to the cache at ~1.25x the normal input rate. On any subsequent call within the cache TTL, the cached tokens are read back at 10% of the input rate . That's a 90% discount on whatever portion of your prompt is stable. For a chatbot that re-sends a 10K-token system prompt every turn, this is the difference between a $5K monthly bill and a $500 one. The math is incredible. The execution is finicky. The four ways prompt caching silently fails Misplaced breakpoints cache control markers cache everything before them in the request. Put the breakpoint in the wrong place and you cache the wrong things. Worse, the call still succeeds — Anthropic happily processes it, you get a normal response, you just paid full price. Prefix drift across calls The cache only hits if the cacheable prefix is byte-identical to what was cached. If you reorder your tools array between calls, or shuffle retrieved documents, or insert a timestamp anywhere in your system prompt — the prefix is different, cache misses, you pay full price. Worse, you also pay the 1.25x write cost to cache the new now-different prefix, which expires in 5 minutes if nothing else hits it. So you're paying more than you would without caching at all. TTL expiration Anthropic recently dropped the default cache TTL from 1 hour to 5 minutes. A lot of setups that "had caching working" started silently regressing — calls that came in 6 minutes apart instead of 4 minutes started missing the cache. Nobody got an error. The bill just went up. No measurement The only way to verify any of the above is to parse cache read input tokens and cache creation input tokens from every single response, compute a hit rate, and compare against an expected baseline. Nobody does this. Most teams "set up caching" once, watch the first response come back with high cached tokens, and assume it works forever. The wrapper I built I shipped a small TypeScript package called prompt-cache-optimizer that fixes the measurement problem and warns about the other three. It's a drop-in wrapper for @anthropic-ai/sdk . Use it exactly like the SDK: js import { CachedAnthropic, placeBreakpoints } from "prompt-cache-optimizer"; const client = new CachedAnthropic { apiKey: process.env.ANTHROPIC API KEY , warnIfHitRateBelow: 0.6, } ; const { system, messages } = placeBreakpoints { system: longSystemPrompt, messages: conversation, strategy: "after-system", } ; const response = await client.messages.create { model: "claude-sonnet-4-6", max tokens: 1024, system, messages, } ; console.log response.cacheInfo ; // { // hit: true, // cachedTokens: 8420, // uncachedTokens: 312, // cacheWriteTokens: 0, // dollarsSaved: 0.024, // dollarsSpent: 0.001 // } Every response gets a cacheInfo field with the parsed numbers. The client also tracks aggregate stats: console.log client.stats ; // { // totalCalls: 142, // cacheHits: 124, // hitRate: 0.873, // totalCachedTokens: 1 240 000, // dollarsSaved: 3.72, // dollarsSpent: 1.41, // } And when something looks wrong, it emits passive warnings instead of throwing: - cache-write-without-read → your cacheable prefix changed call-over-call the silent failure mode - low-hit-rate → rolling cache hit rate dropped below your threshold - no-cache-control-found → you forgot to mark anything cacheable - unknown-model → pricing unknown, dollar accounting skipped Route them anywhere you like: js new CachedAnthropic { apiKey, onWarning: event = logger.warn event , } ; Real numbers The included example processes 5 questions reusing a large system prompt. Here's the actual output: Five calls. The first writes to cache cost: a tiny bit more than uncached . Calls 2-5 each hit the cache. - 80% hit rate 4 hits, 1 miss — the first call always misses since that's when the cache gets written - $0.017 saved on $0.020 spent - Same workload without caching would have cost $0.037 — a 46% reduction At higher call volumes the proportions get even better. A chatbot answering 1000 questions/day with a 10K-token system prompt easily hits 70%+ cost reductions. How big the install is The package is ~50KB unpacked, has zero runtime dependencies , and treats @anthropic-ai/sdk as a peer dependency. It does not phone home, store payloads, or require an account. Roadmap v0.1 is intentionally focused on measurement and explicit helpers. Coming up: - v0.2 — auto-placement of cache control breakpoints based on observed prompt stability no more manual placeBreakpoints - v0.3 — safe message/tool reordering to maximize the stable prefix - v0.4 — OpenAI and Gemini prompt caching support - v1.0 — persistent stats adapter, middleware mode Try it npm install prompt-cache-optimizer @anthropic-ai/sdk - npm: https://www.npmjs.com/package/prompt-cache-optimizer https://www.npmjs.com/package/prompt-cache-optimizer - GitHub: https://github.com/leonhail-nell/prompt-cache-optimizer https://github.com/leonhail-nell/prompt-cache-optimizer If you find it useful, a GitHub star is the single biggest signal that helps other developers find it. If it saves you real money on your Anthropic bill, I'd love to hear about it — file an issue or DM me.