# Why your Claude API bill is 3x what it should be (and how to fix it)

> Source: <https://dev.to/midrelay/why-your-claude-api-bill-is-3x-what-it-should-be-and-how-to-fix-it-4lfo>
> Published: 2026-05-22 19:24:53+00:00

TL;DR: I audited a friend's startup that was spending $4,200/month on Claude API. Only $1,300 produced business value. The other $2,900 was waste — split across three patterns that hit most teams using LLM APIs in production. Here's how to find them in your own bill, and the code to fix each one.
A friend running a B2B doc-summarization product asked me to look at their Claude bill. Q1 was $4,200/month and climbing. We pulled their request logs into a spreadsheet, classified each call by purpose, then estimated what each should have cost. The answer was uncomfortable:
Three problems, $2,900/month of waste. Each one is unsexy and easy to miss, but together they were 70% of the bill.
This is the silent killer. Claude 4.x supports prompt caching: send a 5-minute or 1-hour TTL cache_control
block, and Anthropic charges you ~10x less for cached tokens on subsequent requests. Pricing today (per million tokens for Sonnet 4.6):
The catch: you have to opt in per-request, and most code doesn't. Before/after:
# Before — every call pays for the full system prompt
client.messages.create(
model="claude-sonnet-4-6",
system="You are an expert at...[2000 words of rules + examples]",
messages=[{"role": "user", "content": user_input}],
max_tokens=1024,
)
# After — system prompt cached for 5 minutes
client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": "You are an expert at...[2000 words of rules + examples]",
"cache_control": {"type": "ephemeral"},
},
],
messages=[{"role": "user", "content": user_input}],
max_tokens=1024,
)
One-line change. 90% discount on every subsequent call within the cache TTL.
For my friend: 20K tokens of system prompt × 8 requests/min × 50% cache hit ratio = ~$80/day saved. That alone was $2,400/month — most of the $1,810 leak.
OpenAI SDK calling Claude (via compatible proxies) has equivalent semantics:
client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[...],
prompt_cache_key="user-session-12345", # Stable across calls = cache hit
)
Action: open your last week of API logs. If you have any repeated system
content across requests, you're leaking.
The mental shortcut "Claude = quality, just always use Opus" is expensive. Opus is 4x the cost of Sonnet for inputs, 5x for outputs. For a lot of work, Sonnet or even Haiku is indistinguishable.
I ran 5 tasks across the lineup (1000 samples, scored by judge model + human spot-check):
The pattern: Opus wins clearly only on complex multi-step reasoning. For most tasks Sonnet is within margin of error at 1/4 the cost. Haiku trades 2-5% accuracy for 1/13 the cost — fine when you have downstream validation.
My friend was running every doc through Opus by default. Switching to Sonnet for analysis + Haiku for tagging dropped that bucket from $680 to $140. No quality complaints.
Action: pick the 3 most expensive endpoints in your bill, A/B-test them on the next cheapest model for a week, score outputs blind.
If your work doesn't need a response in the next 30 seconds, the Anthropic Message Batches API charges half price with a 24-hour SLA. Same models, same quality, half the bill.
Good fits:
Bad fits:
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"doc-{doc.id}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": [{"role": "user", "content": doc.text}],
},
}
for doc in docs
],
)
# Poll until done (or just check tomorrow)
while batch.processing_status != "ended":
time.sleep(60)
batch = client.messages.batches.retrieve(batch.id)
My friend had a nightly job re-summarizing all docs from the previous 24h. Moving it from asyncio.gather
to batches cut that bucket from $410 to $205, no user-visible impact.
Action: any cron job, weekly report, or async task hitting your LLM API — most can be batched.
After three changes (cache hint, model rebalance, batch the async work), my friend's monthly bill went $4,200 → $1,540. Same product, same quality, no rewrites — just turning on features the API already supports.
If your bill feels high, do the same audit:
system
prompts. <10 unique but >10,000 calls = no cachingI built a little proxy called MidRelay that handles the first two automatically: it injects a per-key cache hint into every request (even SDK code that doesn't know about cache_control
gets the discount), and it exposes both OpenAI and Anthropic surfaces from the same key so you can route model-by-model without rewriting.
It also happens to be 60-80% cheaper than calling Anthropic / OpenAI directly. (Same models, same wire protocol — your existing SDK just changes the base_url
.)
$5 of free credit to test it: drop a comment, I'll DM a code. First 100 readers, no signup gate.
But honestly — the techniques above work on any provider. Even if you never touch MidRelay, just turning on cache_control
and downshifting one over-spec'd Opus call will cut your bill more than any "AI cost optimization" SaaS will.
Check your logs tonight.
