We measured what 10 tools 1,000 calls/day actually costs in AI agents Based on the article, building an AI agent with 10 tools requires sending 800–1,200 tokens of tool definitions on every API call, which can become costly at scale (e.g., 10,000 calls/day). The authors introduce Promptolian, an open-source compression layer that reduces tool schema tokens by up to 97% on subsequent calls and compresses verbose prompts by 20–36%, saving an estimated $2,800 annually on a $19/month tool setup. Promptolian operates deterministically with sub-millisecond latency and requires only a single line of code to integrate with Anthropic's API. We measured what 10 tools × 1,000 calls/day actually costs. Here's the data. Posted to r/ClaudeAI · r/LocalLLaMA · Hacker News When you build an AI agent, you give it tools. Search the web. Read a file. Call an API. Query a database. Each tool needs a description — a JSON block that tells the model what the tool does and what parameters it takes. Here's what a single tool looks like: { "name": "search web", "description": "Search the web for recent information", "parameters": { "type": "object", "properties": { "query": { "type": "string", "description": "The search query" }, "max results": { "type": "integer", "description": "Maximum number of results to return" } }, "required": "query" } } That single definition is about 80 tokens. If your agent has 10 tools, you're sending ~800–1,200 tokens of tool definitions on every API call . Not once. Every call. The actual numbers We ran 1,000 simulated agent sessions across four agent sizes. Pricing at Claude Sonnet 4 input $3 / 1M tokens . | Tools | Tokens / call | 1k calls/day | Cost / month | Cost / year | |---|---|---|---|---| | 5 | ~600 | 600k tok/day | $54 | $657 | | 10 | ~1,200 | 1.2M tok/day | $108 | $1,314 | | 20 | ~2,400 | 2.4M tok/day | $216 | $2,628 | | 50 | ~6,000 | 6M tok/day | $540 | $6,570 | At 10k calls/day not unusual for a production agent , multiply those numbers by 10. Why this doesn't go away The obvious answer is: Anthropic has prompt caching. Use that. Prompt caching helps, but: - Cached input tokens are still billed — at 10% of normal price. Not free. - Cache TTL is 5 minutes. If your sessions are longer than 5 minutes apart, you pay full price. - Cache invalidates on any change. If you add a tool, update a description, or rotate an API key in a tool — full price again. So even with caching, you're paying for tool tokens. And most agents don't have caching set up at all. What we built Promptolian is a compression layer that sits between your code and any LLM API. You call it once at startup — everything else stays unchanged. It intercepts every API call, compresses what it can, and forwards the request. No proxy, no routing change, no new infrastructure. It has three independent compression layers: Layer 1 — Prompt compression Replaces verbose patterns with compact equivalents before the text reaches the model. "You are an expert Python developer. Please write a function..." becomes "§EXP py developer. ACT write FN...". Runs locally in under 1ms. ~20% savings on typical prompts. Layer 2 — Context engine As a conversation grows, old turns get expensive. Promptolian summarises older messages and keeps only the most relevant recent turns — using a layout that works with how LLMs weight context. Up to 52.9% savings on long sessions. Layer 3 — Tool schema compiler This is the one that surprised us. It works in two phases: Call 1 — compact DSL Instead of the full JSON, the model receives a function-signature format: search web query: str, max results: int = 10 Search the web for recent information read file path: str, encoding: str = utf-8 Read a local file call api url: str, method: GET|POST, body: str HTTP request Same information. About 40 tokens instead of 120. ~69% smaller. Call 2 onward — cached by the proxy From call 2, you omit the tools parameter entirely. The proxy re-injects the stored schemas automatically — with Anthropic's cache control flag set. Anthropic detects the cache hit and charges 10% of normal tool token cost . Call 1 — send tools as normal response = client.messages.create model="claude-sonnet-4-6", tools= ... , full JSON, stored by proxy messages= ... , extra headers={"X-Session": "my-session"}, max tokens=1000, Call 2+ — omit tools entirely response = client.messages.create model="claude-sonnet-4-6", no tools= needed — proxy re-injects with cache control messages= ... , extra headers={"X-Session": "my-session"}, max tokens=1000, Response headers tell you exactly what was saved: X-Promptolian-Cache-Hit: true X-Promptolian-Tokens-Saved: 1080 ~90% savings on tool tokens from call 2 onward. That's real — it comes from Anthropic's own prompt cache pricing, triggered automatically by the proxy. All three layers are deterministic — no LLM calls, no data sent anywhere, sub-millisecond latency. The tool is open source and self-hostable. Benchmark results across 20 prompt types We ran our prompt compression layer against 20 real-world prompts system prompts, user instructions, domain-specific text : | Tier | Median CR | Mean CR | Range | |---|---|---|---| | Standard | 20.2% | 23.6% | 10–50% | | Pro | 21.9% | 24.3% | 10–50% | | Developer | 21.9% | 24.3% | 10–50% | Verbose prompts filler words, hedging language compress 30–36%. Technical system prompts compress less 10–15% because they're already dense. Short prompts can hit 40–50% but the absolute saving is smaller. 100% fact preservation across all 41 runs — numbers, file paths, named entities came through unchanged every time. Combined savings: a real example Agent setup: 10 tools, 2,000 calls/day, average 800-token system prompt, 5-turn sessions. Without Promptolian: - Tool schemas: 1,200 tok × 2,000 = 2.4M tok/day - System prompt: 800 tok × 2,000 = 1.6M tok/day Total: 4M tok/day = ~$360/month With Promptolian session avg : - Tool schemas: ~84 tok × 2,000 = 168k tok/day 93% saved - System prompt: ~620 tok × 2,000 = 1.24M tok/day 22% saved Total: 1.41M tok/day = ~$127/month Monthly saving: ~$233. Annual: ~$2,800. On a $19/month tool. How to try it Install pip install promptolian One line to compress every Anthropic call from promptolian import patch anthropic patch anthropic Your existing code unchanged import anthropic client = anthropic.Anthropic response = client.messages.create model="claude-sonnet-4-6", system="You are an expert Python developer...", compressed automatically messages= ... , max tokens=1000, Check savings from promptolian import get stats print get stats .summary → 47 calls · 18,432 tok saved · 22.1% CR For Claude Code users: promptolian mcp install adds to ~/.claude/settings.json restart Claude Code — done Tool schema compression via the API: curl -X POST https://api.promptolian.com/compress-tools \ -H "Content-Type: application/json" \ -d '{"tools": ... , "session id": "my-session-1"}' Known limitations & what's next Being honest about where the edges are — and where we're heading. The proxy savings depend on session continuity. Anthropic's prompt cache expires after 5 minutes. If your agent sessions have long gaps between calls, the cache goes cold and you pay full price on the next one. For always-on production agents this is fine. For bursty or human-in-the-loop workflows, realistic savings are 30–50% on average across a session rather than 90% on every call after the first. Prompt compression gets weaker as text gets denser. The rule-based compressor spots verbose patterns — filler phrases, hedging language, redundant qualifiers. Already-tight technical prompts compress 10–15% rather than 30%+. A future semantic tier using a cheap model for rewriting is on the roadmap. pip install promptolian gives you the local engine, not the full hosted stack. The package includes the compressor, SDK wrappers, MCP server, and proxy. The production API Stripe billing, multi-tenant auth, usage dashboards is separate. The local proxy is the right starting point for most use cases. We're shipping fixes and new features regularly. The full list of open issues and recent changes is on GitHub: github.com/Maurizio-L/promptolian-public https://github.com/Maurizio-L/promptolian-public Open questions we'd love feedback on - What's your typical tool count per agent? - Do you use prompt caching today? Does it actually hit in practice? - Would you pay for usage-based pricing per token saved vs flat monthly? The full benchmark methodology and raw data are at promptolian.com/benchmarks https://promptolian.com/benchmarks . Source: github.com/Maurizio-L/promptolian-public https://github.com/Maurizio-L/promptolian-public Built by Maurizio Lospi — maurizio.lospi@gmail.com. Feedback welcome — especially if your numbers look different from mine.