Why your Claude API bill is 3x what it should be (and how to fix it)

A friend's startup was spending $4,200/month on the Claude API, but an audit revealed that $2,900 of that was waste caused by three common mistakes. The fixes included enabling prompt caching (saving $2,400/month), switching from the expensive Opus model to cheaper Sonnet and Haiku models for most tasks, and using the Batches API for non-urgent work. After implementing these changes, the monthly bill dropped to $1,540 with no loss in product quality.

TL;DR: I audited a friend's startup that was spending $4,200/month on Claude API. Only $1,300 produced business value. The other $2,900 was waste — split across three patterns that hit most teams using LLM APIs in production. Here's how to find them in your own bill, and the code to fix each one. A friend running a B2B doc-summarization product asked me to look at their Claude bill. Q1 was $4,200/month and climbing. We pulled their request logs into a spreadsheet, classified each call by purpose, then estimated what each should have cost. The answer was uncomfortable: Three problems, $2,900/month of waste. Each one is unsexy and easy to miss, but together they were 70% of the bill. This is the silent killer. Claude 4.x supports prompt caching: send a 5-minute or 1-hour TTL cache control block, and Anthropic charges you ~10x less for cached tokens on subsequent requests. Pricing today per million tokens for Sonnet 4.6 : The catch: you have to opt in per-request, and most code doesn't. Before/after: Before — every call pays for the full system prompt client.messages.create model="claude-sonnet-4-6", system="You are an expert at... 2000 words of rules + examples ", messages= {"role": "user", "content": user input} , max tokens=1024, After — system prompt cached for 5 minutes client.messages.create model="claude-sonnet-4-6", system= { "type": "text", "text": "You are an expert at... 2000 words of rules + examples ", "cache control": {"type": "ephemeral"}, }, , messages= {"role": "user", "content": user input} , max tokens=1024, One-line change. 90% discount on every subsequent call within the cache TTL. For my friend: 20K tokens of system prompt × 8 requests/min × 50% cache hit ratio = ~$80/day saved. That alone was $2,400/month — most of the $1,810 leak. OpenAI SDK calling Claude via compatible proxies has equivalent semantics: client.chat.completions.create model="claude-sonnet-4-6", messages= ... , prompt cache key="user-session-12345", Stable across calls = cache hit Action: open your last week of API logs. If you have any repeated system content across requests, you're leaking. The mental shortcut "Claude = quality, just always use Opus" is expensive. Opus is 4x the cost of Sonnet for inputs, 5x for outputs. For a lot of work, Sonnet or even Haiku is indistinguishable. I ran 5 tasks across the lineup 1000 samples, scored by judge model + human spot-check : The pattern: Opus wins clearly only on complex multi-step reasoning. For most tasks Sonnet is within margin of error at 1/4 the cost. Haiku trades 2-5% accuracy for 1/13 the cost — fine when you have downstream validation. My friend was running every doc through Opus by default. Switching to Sonnet for analysis + Haiku for tagging dropped that bucket from $680 to $140. No quality complaints. Action: pick the 3 most expensive endpoints in your bill, A/B-test them on the next cheapest model for a week, score outputs blind. If your work doesn't need a response in the next 30 seconds, the Anthropic Message Batches API charges half price with a 24-hour SLA. Same models, same quality, half the bill. Good fits: Bad fits: batch = client.messages.batches.create requests= { "custom id": f"doc-{doc.id}", "params": { "model": "claude-sonnet-4-6", "max tokens": 1024, "messages": {"role": "user", "content": doc.text} , }, } for doc in docs , Poll until done or just check tomorrow while batch.processing status = "ended": time.sleep 60 batch = client.messages.batches.retrieve batch.id My friend had a nightly job re-summarizing all docs from the previous 24h. Moving it from asyncio.gather to batches cut that bucket from $410 to $205, no user-visible impact. Action: any cron job, weekly report, or async task hitting your LLM API — most can be batched. After three changes cache hint, model rebalance, batch the async work , my friend's monthly bill went $4,200 → $1,540. Same product, same quality, no rewrites — just turning on features the API already supports. If your bill feels high, do the same audit: system prompts. <10 unique but 10,000 calls = no cachingI built a little proxy called MidRelay that handles the first two automatically: it injects a per-key cache hint into every request even SDK code that doesn't know about cache control gets the discount , and it exposes both OpenAI and Anthropic surfaces from the same key so you can route model-by-model without rewriting. It also happens to be 60-80% cheaper than calling Anthropic / OpenAI directly. Same models, same wire protocol — your existing SDK just changes the base url . $5 of free credit to test it: drop a comment, I'll DM a code. First 100 readers, no signup gate. But honestly — the techniques above work on any provider. Even if you never touch MidRelay, just turning on cache control and downshifting one over-spec'd Opus call will cut your bill more than any "AI cost optimization" SaaS will. Check your logs tonight.