The 5 Cost Traps That Will Quietly Bleed Your AI API Gateway Dry (And How to Fix Them) An engineer at LiteLLM reveals five cost traps in AI API gateways that can silently inflate bills, including retry logic that multiplies API calls across fallback chains and misconfigured fallbacks that escalate traffic to expensive models. The post provides configuration fixes such as capping total retries and structuring fallbacks by cost tier to prevent runaway costs. In my last post, we talked about key cache invalidation — the silent production killer that turns your gateway into a 502 factory. Today I want to talk about something equally dangerous but far more insidious: cost traps . These aren't bugs. They're not crashes. Your gateway runs fine. Your users are happy. Then finance sends you a Slack message: "Why did our OpenAI bill jump 4x last month?" I've been running LiteLLM Proxy in production for multiple teams across three companies. Here are the five cost traps I've personally been burned by — each with the config that would have saved me thousands of dollars. num retries=3 Actually Means 15 LiteLLM's retry logic is smart. Too smart. When a request fails, it retries. When the retried request hits a fallback model and that fails, it retries again. If you've configured a fallback chain of 5 models with 3 retries each, a single user request can trigger up to 15 upstream API calls — and you pay for every single one, including the ones that errored out after consuming tokens. The default num retries in LiteLLM is 3. Most teams set it and forget it. But retries multiply across your fallback chain. Here's the math: Request → Model A fails → Retry A 1 → Retry A 2 → Retry A 3 → Fallback to Model B → Fails → Retry B 1 → Retry B 2 → Retry B 3 → Fallback to Model C → Fails → Retry C 1 → Retry C 2 → Retry C 3 Total upstream calls: 9 retries + 3 initial = 12 billable calls for 1 user request If Model C is GPT-4o and each retry consumes 2K input tokens before timing out, that's 24K tokens on a single failed request . Cap total attempts across the entire chain, not just per-model: litellm config.yaml litellm settings: num retries: 2 per-model retries max fallbacks: 2 hard cap on fallback chain depth retry after: 5 respect 429 Retry-After headers allowed fails: 3 circuit breaker: after 3 fails, stop entirely model list: - model name: gpt-4o litellm params: model: gpt-4o max retries: 2 override: fewer retries on expensive models - model name: gpt-4o-fallback litellm params: model: gpt-4o-mini cheap fallback, not another expensive model The key insight: your fallback should be cheaper than your primary , not equally expensive. If GPT-4o fails, fall back to GPT-4o-mini, not to Claude Opus. This is the trap that cost me $2,300 in a single weekend. A well-meaning engineer configured a fallback chain that looked like this: THE EXPENSIVE WAY — do not do this router settings: fallbacks: - "gpt-4o-mini": "gpt-4o" - "gpt-4o": "claude-3-5-sonnet" - "claude-3-5-sonnet": "claude-3-opus" The logic seemed sound: "If the cheap model fails, try the better one." But here's what actually happened: GPT-4o-mini was rate-limited during a traffic spike 429s everywhere , so every single request fell through to GPT-4o and then to Claude 3.5 Sonnet . For 6 hours, we were running 100% of our traffic on the most expensive models in the chain. Rate limits are per-model, not per-gateway. When you hit OpenAI's TPM limit on gpt-4o-mini , LiteLLM dutifully falls back. But if the traffic spike is caused by overall volume not a model-specific outage , the fallback model gets the same volume that caused the 429 in the first place. You're not solving the problem — you're just paying 10x more to have it on a different model. Structure fallbacks by cost tier , not by capability tier: THE SMART WAY — fallback within price tier, not up router settings: fallbacks: Tier 1: Cheap models fallback to other cheap models - "gpt-4o-mini": "gemini-1.5-flash", "claude-3-haiku" Tier 2: Mid-tier models fallback to other mid-tier - "gpt-4o": "claude-3-5-sonnet", "gemini-1.5-pro" NEVER fall up from cheap to expensive If all cheap models fail, return an error, don't escalate Add a cooldown so the same model isn't retried immediately cooldown time: 60 Also add alerting. If your fallback rate exceeds 5% of total traffic, something is structurally wrong: prometheus metric in your LiteLLM custom callback from litellm.integrations.custom logger import CustomLogger import litellm class FallbackAlertLogger CustomLogger : def log pre api call self, model, messages, kwargs : if kwargs.get "metadata", {} .get "fallback idx", 0 0: This is a fallback call, not the primary self.fallback counter.inc Alert if fallback rate 5% if self.fallback counter. value.get / self.total counter. value.get 0.05: self.alert webhook.send "⚠️ Fallback rate 5% — check rate limits on primary models" Most teams don't enable LiteLLM's built-in caching because "our prompts are dynamic." But in practice, a huge percentage of your traffic is near-identical : system prompts are the same, the first 500 tokens of user messages are often boilerplate, and many users ask the exact same questions. I audited one team's traffic and found that 34% of their requests were exact duplicates of requests made in the last hour. They were paying OpenAI ~$400/day for identical completions. LiteLLM has Redis caching built in. But it's disabled by default, and the documentation buries it under "Advanced Settings." Most engineers set up the proxy, test it, ship it, and never circle back. Enable Redis caching with a sensible TTL. This is a 30-second config change that can cut your bill by 30-50%: litellm settings: cache: true cache params: type: "redis" host: "your-redis-host" port: 6379 namespace: "litellm cache" Cache settings ttl: 3600 1 hour for exact matches For semantic caching similar but not identical prompts : semantic cache: true similarity threshold: 0.8 Cache based on messages content, not just the full request cache key include models: true don't share cache across models model list: - model name: gpt-4o litellm params: model: gpt-4o cache: true enable per-model For even bigger savings, use prompt caching with providers that support it Claude, GPT-4o . LiteLLM supports this natively: python import litellm Enable prompt caching for Claude response = litellm.completion model="claude-3-5-sonnet", messages= {"role": "user", "content": {"type": "text", "text": "