When Your AI API Budget Blew Up: Multi-Provider Routing

wpnews.pro

I remember the exact moment my heart sank. It was a Tuesday morning, and I opened the billing dashboard for our AI API provider to find a $3,200 charge staring back at me. Our previous month had been $400. A junior dev had accidentally left a loop running in production that was hammering the endpoint with redundant prompts.

That pain was real, but it forced me to solve a deeper issue: we were relying on a single AI provider, and our costs and reliability were completely out of our control.

Like many teams, we'd started with one provider because it was the easiest. The API was straightforward, the documentation was decent. But as we scaled from a simple chatbot to more complex automations—parsing emails, summarizing documents, generating code reviews—the single point of failure became unbearable.

Rate limits started biting us during peak hours. Costs exploded because we had no way to route cheaper queries to a different model. And if that provider had an outage (which happened twice in three months), our product was dead in the water.

My first instinct was to just duplicate the calls: try provider A, if it fails, try provider B. I slapped together a quick Python script with try/except blocks and a requests

library. It worked… for about two days.

def query_ai(prompt):
    try:
        return provider_a_call(prompt)
    except Exception:
        try:
            return provider_b_call(prompt)
        except Exception:
            raise RuntimeError("All providers failed")

Problems: each exception added seconds of latency, I had no way to prioritize cheaper providers, and I wasn't tracking which calls actually succeeded or failed. Plus, the code quickly turned into a spaghetti mess as we added a third provider.

Then I tried a more sophisticated queue-based approach with Celery and task retries. That made things even worse—we were over downstream APIs, hitting stricter rate limits, and paying for compute we didn't need.

After a lot of trial and error, I settled on a different pattern: a routing layer that sits between your application code and your AI providers. It's not fancy—it's essentially a Python class that uses a configurable strategy to pick which provider to call, tracks performance, and handles fallbacks gracefully.

Here's the core idea in about 80 lines:

import time
from typing import Callable, Dict, List

class AIRouter:
    def __init__(self, providers: Dict[str, Callable], config: dict = None):
        self.providers = providers
        self.config = config or {
            'cost_per_token': {
                'provider_a': 0.03,
                'provider_b': 0.01,
                'provider_c': 0.008,
            },
            'max_retries': 2,
            'timeout': 10,
            'preferred_order': ['provider_c', 'provider_b', 'provider_a']
        }
        self.stats = {name: {'calls': 0, 'errors': 0, 'total_time': 0.0} for name in providers}

    def query(self, prompt: str, context: dict = None) -> str:
        order = self.config['preferred_order']
        if context and 'force_provider' in context:
            order = [context['force_provider']]

        last_error = None
        for provider_name in order:
            if provider_name not in self.providers:
                continue
            provider_fn = self.providers[provider_name]
            for attempt in range(self.config['max_retries']):
                try:
                    start = time.time()
                    result = provider_fn(prompt, timeout=self.config['timeout'])
                    elapsed = time.time() - start
                    self._record_success(provider_name, elapsed)
                    return result
                except Exception as e:
                    self._record_error(provider_name)
                    last_error = e
                    time.sleep(0.5 * (attempt + 1))
        raise RuntimeError(f"All providers failed. Last error: {last_error}")

    def _record_success(self, name, elapsed):
        self.stats[name]['calls'] += 1
        self.stats[name]['total_time'] += elapsed

    def _record_error(self, name):
        self.stats[name]['errors'] += 1

This class isn't production-ready—no logging, no async, no circuit breakers—but it's the skeleton you can build on. The key insight is decoupling the which provider logic from the how to call logic. Once you have that, you can add all sorts of strategies: cheapest-first, fastest-first, based on prompt length, or based on user subscription level.

I also added a simple cost-tracking module that estimates tokens and logs each request. That alone saved our team—we could see which endpoints were costing us the most and adjust the routing order accordingly.

To use this, you'd define provider functions that wrap API calls. For example:

import openai
import anthropic

def call_openai(prompt: str, timeout=10):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        timeout=timeout
    )
    return response.choices[0].message.content

def call_anthropic(prompt: str, timeout=10):
    client = anthropic.Anthropic()
    message = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
        timeout=timeout
    )
    return message.content[0].text

from transformers import pipeline
gen = pipeline('text2text-generation', model='google/flan-t5-small')
def call_local(prompt: str, timeout=10):
    return gen(prompt)[0]['generated_text']

router = AIRouter(
    providers={
        'openai': call_openai,
        'anthropic': call_anthropic,
        'local': call_local
    },
    config={
        'preferred_order': ['local', 'openai', 'anthropic'],
        'cost_per_token': {
            'local': 0.0,
            'openai': 0.002,  # gpt-3.5-turbo
            'anthropic': 0.00025  # claude-haiku
        }
    }
)

result = router.query("Summarize this email: ...")

Now, when we get a simple request like "summarize an email", the router tries local first (free), and only falls back to paid APIs if it fails or times out. This cut our AI bill by 60% in the first month.

This pattern adds complexity. If you have a single, stable use case with predictable load and acceptable costs, don't bother. Also, if you need strict consistency (e.g., always the same model version for reproducibility), routing is a bad idea.

I'd start with a simpler config-driven router from day one, rather than the ad-hoc fallback mess. I'd also add rate-limit awareness—my current router doesn't proactively slow down when a provider is throttling; it just fails and moves on. A proper circuit breaker pattern would be better.

And I'd definitely not leave a loop running in production. But maybe that's just me.

The whole experience taught me that the real art isn't in picking the "best" AI model—it's in building systems that gracefully handle the messiness of real-world APIs.

So, what's your setup look like? Are you using a single provider or something more distributed?

source & further reading

dev.to — original article I Made Claude Code Ding When It's Done (And It Changed My Workflow) I scanned my MCP setup and it scored 0/100. Here's what was wrong. Azure Databricks for MLOps and Feature Engineering at Scale with Apache Spark, Delta Lake, and MLflow

When Your AI API Budget Blew Up: Multi-Provider Routing

Run your AI side-project on zahid.host