{"slug": "how-i-cut-my-anthropic-api-bill-by-50-with-a-local-python-tool", "title": "How I Cut My Anthropic API Bill by 50% With a Local Python Tool", "summary": "A developer built a local Python CLI tool called `ai-cost-optimizer` that reduced their Anthropic API bill by 50% by implementing a semantic cache, prompt compressor, and model router. The tool intercepts API requests to store and retrieve cached responses via vector embeddings, compress bloated prompts by discarding irrelevant sentences, and automatically route queries to the cheapest suitable model (e.g., Haiku over Opus). After two weeks of development use, the tool saved an estimated $0.000300 from cache hits alone across 23 API calls totaling $0.16.", "body_md": "My Anthropic bill doubled two months in a row. Not because I was building something bigger — because I kept asking the same questions, sending bloated prompts, and defaulting to Sonnet for tasks that Haiku could handle. I built a tool to fix it. Here's how it works.\n\nAI API costs compound fast for three reasons. First, if you're iterating on a project, you ask similar questions repeatedly — \"how does X work,\" \"what's wrong with this code\" — and pay full price every time. Second, prompts accumulate context: documentation snippets, error traces, boilerplate instructions that add hundreds of tokens but contribute nothing to the answer. Third, most people just use whatever model they defaulted to first. Claude Opus at $15/1M input tokens for a query that Haiku could answer for $1/1M is a 15x cost multiplier on every single call.\n\nI built `ai-cost-optimizer`\n\n— a local CLI that sits between your terminal and the Anthropic API. It runs a semantic cache, a prompt compressor, and a model router on every request before anything hits the network. No cloud, no subscription, no data leaving your machine. Just a Python package you install once.\n\nThe cache stores every response as a vector embedding. On each new request, it computes the embedding for your prompt and checks cosine similarity against everything stored. If similarity is above the threshold (default: 0.80), it returns the cached answer — no API call, zero cost.\n\n``` bash\n$ aiproxy ask \"What is the capital of France?\"\n\n  Model         claude-haiku-4-5-20251001\n  Cached        No\n  Input tokens  18\n  Output tokens 9\n  Cost          $0.000063\n  Cache saved   $0.000000 (cached for next time)\n  Total saved   $0.000000\n\n$ aiproxy ask \"Capital of France?\"\n\n  Model         claude-haiku-4-5-20251001\n  Cached        Yes\n  Input tokens  0\n  Output tokens 0\n  Cost          $0.000000\n  Cache saved   $0.000063\n  Total saved   $0.000063\n```\n\n\"Capital of France?\" is semantically identical to the first query. Cache hit. The API never sees it.\n\nThe cache uses `sentence-transformers/all-MiniLM-L6-v2`\n\nfor embeddings (80 MB, runs entirely in-process) and `usearch`\n\nfor fast ANN lookup. Cold load is ~1.5 seconds on first run; subsequent queries are sub-100ms.\n\nLong prompts are expensive not because they're long, but because most of that length is filler. The compressor uses BM25 to score each sentence by relevance to the query, keeps the top-scoring sentences, and discards the rest.\n\nIn plain terms: it reads your prompt, figures out which sentences actually relate to what you're asking, and throws out the ones that don't. No summarization, no LLM — pure lexical scoring, deterministic, fast.\n\nReal example from a documentation query:\n\n```\nOriginal prompt:  370 tokens  ($0.001110 at Sonnet pricing)\nCompressed:        61 tokens  ($0.000183 at Sonnet pricing)\nTokens saved:     309 tokens  (83% reduction)\n```\n\nThe threshold for compression is configurable (`MAX_PROMPT_TOKENS=500`\n\nin `.env`\n\n). Prompts under that limit are sent as-is.\n\nThe router classifies each prompt and picks the cheapest model that can handle it. The logic is rule-based: token count, keyword signals for complexity, and a few heuristics for code vs. prose vs. reasoning tasks.\n\n| Query | Routed to | Input cost per 1M tokens |\n|---|---|---|\n| \"What is 2+2?\" | Haiku | $1.00 |\n| \"Explain binary search trees\" | Sonnet | $3.00 |\n| \"Review this system architecture\" | Opus | $15.00 |\n\nYou can override the router by passing `--model`\n\nexplicitly. But if you don't, it defaults to the cheapest model that fits the task, and in practice that means Haiku handles the majority of short factual queries.\n\nAfter two weeks of normal development use — asking questions about code, debugging errors, generating short snippets:\n\n``` bash\n$ aiproxy stats\n\n  Cache entries       14\n  Cache hits           3\n  Estimated savings   $0.000300\n  Total API calls     23\n  Total cost          $0.160000\n```\n\nThe compression savings don't show in `stats`\n\nyet — that's a known gap I'm fixing next.\n\n```\ngit clone https://github.com/desaikat/ai-cost-optimizer.git\ncd ai-cost-optimizer\npython -m venv .venv\nsource .venv/Scripts/activate  # Windows: .venv\\Scripts\\activate\npip install -e .\n\ncp .env.example .env\n# Add your ANTHROPIC_API_KEY to .env\n\naiproxy ask \"What is the difference between a list and a tuple in Python?\"\n```\n\nThere's also a Streamlit dashboard (`aiproxy-dashboard`\n\n) that shows cumulative spend, cache hit rate, model distribution, and compression savings over time.\n\n`.exe`\n\nRepo: [github.com/desaikat/ai-cost-optimizer](https://github.com/desaikat/ai-cost-optimizer)\n\nTwo things I'm actively unsure about and would value input on:\n\nOpen issues and PRs welcome.", "url": "https://wpnews.pro/news/how-i-cut-my-anthropic-api-bill-by-50-with-a-local-python-tool", "canonical_source": "https://dev.to/saikat_de_4c1cb4fd6050ecb/how-i-cut-my-anthropic-api-bill-by-50-with-a-local-python-tool-pol", "published_at": "2026-05-25 22:13:57+00:00", "updated_at": "2026-05-25 23:03:35.361133+00:00", "lang": "en", "topics": ["ai-tools", "ai-products", "large-language-models", "artificial-intelligence", "generative-ai"], "entities": ["Anthropic", "Claude Opus", "Haiku", "Sonnet", "ai-cost-optimizer"], "alternates": {"html": "https://wpnews.pro/news/how-i-cut-my-anthropic-api-bill-by-50-with-a-local-python-tool", "markdown": "https://wpnews.pro/news/how-i-cut-my-anthropic-api-bill-by-50-with-a-local-python-tool.md", "text": "https://wpnews.pro/news/how-i-cut-my-anthropic-api-bill-by-50-with-a-local-python-tool.txt", "jsonld": "https://wpnews.pro/news/how-i-cut-my-anthropic-api-bill-by-50-with-a-local-python-tool.jsonld"}}