When Your AI API Budget Blew Up: Multi-Provider Routing

A developer built a multi-provider AI routing layer after a junior engineer accidentally caused a $3,200 monthly bill by leaving a loop running in production. The solution uses a configurable Python class that selects providers based on cost and performance, handles fallbacks, and tracks statistics to avoid single points of failure and reduce costs.

I remember the exact moment my heart sank. It was a Tuesday morning, and I opened the billing dashboard for our AI API provider to find a $3,200 charge staring back at me. Our previous month had been $400. A junior dev had accidentally left a loop running in production that was hammering the endpoint with redundant prompts. That pain was real, but it forced me to solve a deeper issue: we were relying on a single AI provider, and our costs and reliability were completely out of our control. Like many teams, we'd started with one provider because it was the easiest. The API was straightforward, the documentation was decent. But as we scaled from a simple chatbot to more complex automations—parsing emails, summarizing documents, generating code reviews—the single point of failure became unbearable. Rate limits started biting us during peak hours. Costs exploded because we had no way to route cheaper queries to a different model. And if that provider had an outage which happened twice in three months , our product was dead in the water. My first instinct was to just duplicate the calls: try provider A, if it fails, try provider B. I slapped together a quick Python script with try/except blocks and a requests library. It worked… for about two days. python Naive fallback don't do this def query ai prompt : try: return provider a call prompt except Exception: try: return provider b call prompt except Exception: raise RuntimeError "All providers failed" Problems: each exception added seconds of latency, I had no way to prioritize cheaper providers, and I wasn't tracking which calls actually succeeded or failed. Plus, the code quickly turned into a spaghetti mess as we added a third provider. Then I tried a more sophisticated queue-based approach with Celery and task retries. That made things even worse—we were overloading downstream APIs, hitting stricter rate limits, and paying for compute we didn't need. After a lot of trial and error, I settled on a different pattern: a routing layer that sits between your application code and your AI providers. It's not fancy—it's essentially a Python class that uses a configurable strategy to pick which provider to call, tracks performance, and handles fallbacks gracefully. Here's the core idea in about 80 lines: python import time from typing import Callable, Dict, List class AIRouter: def init self, providers: Dict str, Callable , config: dict = None : self.providers = providers self.config = config or { 'cost per token': { 'provider a': 0.03, 'provider b': 0.01, 'provider c': 0.008, }, 'max retries': 2, 'timeout': 10, 'preferred order': 'provider c', 'provider b', 'provider a' } self.stats = {name: {'calls': 0, 'errors': 0, 'total time': 0.0} for name in providers} def query self, prompt: str, context: dict = None - str: Use context to optionally override order e.g., based on user tier order = self.config 'preferred order' if context and 'force provider' in context: order = context 'force provider' last error = None for provider name in order: if provider name not in self.providers: continue provider fn = self.providers provider name for attempt in range self.config 'max retries' : try: start = time.time result = provider fn prompt, timeout=self.config 'timeout' elapsed = time.time - start self. record success provider name, elapsed return result except Exception as e: self. record error provider name last error = e Small backoff before retry time.sleep 0.5 attempt + 1 raise RuntimeError f"All providers failed. Last error: {last error}" def record success self, name, elapsed : self.stats name 'calls' += 1 self.stats name 'total time' += elapsed def record error self, name : self.stats name 'errors' += 1 This class isn't production-ready—no logging, no async, no circuit breakers—but it's the skeleton you can build on. The key insight is decoupling the which provider logic from the how to call logic. Once you have that, you can add all sorts of strategies: cheapest-first, fastest-first, based on prompt length, or based on user subscription level. I also added a simple cost-tracking module that estimates tokens and logs each request. That alone saved our team—we could see which endpoints were costing us the most and adjust the routing order accordingly. To use this, you'd define provider functions that wrap API calls. For example: python import openai import anthropic def call openai prompt: str, timeout=10 : response = openai.ChatCompletion.create model="gpt-3.5-turbo", messages= {"role": "user", "content": prompt} , timeout=timeout return response.choices 0 .message.content def call anthropic prompt: str, timeout=10 : client = anthropic.Anthropic message = client.messages.create model="claude-3-haiku-20240307", max tokens=1024, messages= {"role": "user", "content": prompt} , timeout=timeout return message.content 0 .text We also add a local model for cheap tasks from transformers import pipeline gen = pipeline 'text2text-generation', model='google/flan-t5-small' def call local prompt: str, timeout=10 : return gen prompt 0 'generated text' Then wire it up router = AIRouter providers={ 'openai': call openai, 'anthropic': call anthropic, 'local': call local }, config={ 'preferred order': 'local', 'openai', 'anthropic' , 'cost per token': { 'local': 0.0, 'openai': 0.002, gpt-3.5-turbo 'anthropic': 0.00025 claude-haiku } } Use it in your app result = router.query "Summarize this email: ..." Now, when we get a simple request like "summarize an email", the router tries local first free , and only falls back to paid APIs if it fails or times out. This cut our AI bill by 60% in the first month. This pattern adds complexity. If you have a single, stable use case with predictable load and acceptable costs, don't bother. Also, if you need strict consistency e.g., always the same model version for reproducibility , routing is a bad idea. I'd start with a simpler config-driven router from day one, rather than the ad-hoc fallback mess. I'd also add rate-limit awareness—my current router doesn't proactively slow down when a provider is throttling; it just fails and moves on. A proper circuit breaker pattern would be better. And I'd definitely not leave a loop running in production. But maybe that's just me. The whole experience taught me that the real art isn't in picking the "best" AI model—it's in building systems that gracefully handle the messiness of real-world APIs. So, what's your setup look like? Are you using a single provider or something more distributed?