Last month my AI API bill hit a number that made me close my laptop and go for a walk.
I wasn't doing anything crazy — just running a mid-size AI SaaS product with a few thousand daily requests across GPT and Claude. But between the two providers, my monthly spend had crept up to around $800, and the billing dashboards from each provider told completely different stories.
The thing is: I didn't need to rewrite my application. I didn't need to optimize prompts. I didn't need to switch models. All I did was change the base_url
in my OpenAI client, and my bill dropped.
Here's exactly what I did.
My stack was pretty standard:
Each provider had its own API key, its own billing dashboard, its own usage limits, and its own pricing page that seemed to change every other month.
The real pain wasn't the integration code — that's a one-time cost. The pain was the ongoing overhead: logging into two separate dashboards to check spend, guessing which model was cheaper for a given task, not knowing if I was overpaying, and getting surprised by a bill because one provider's usage reporting lagged by 24 hours.
I needed one place to manage everything. But I didn't want to rewrite my application.
The insight is simple: most LLM providers either natively support the OpenAI API format or can be accessed through a gateway that normalizes everything to it. If your application already uses the OpenAI SDK, you can swap the base_url
and keep everything else the same.
Before — two different SDKs, two different response formats, two separate bills:
from openai import OpenAI
from anthropic import Anthropic
gpt_client = OpenAI(api_key="sk-...")
gpt_response = gpt_client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": "Analyze this customer feedback..."}],
)
claude_client = Anthropic(api_key="sk-ant-...")
claude_response = claude_client.messages.create(
model="claude-opus-4-7-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Summarize this document..."}],
)
After — one SDK, one API key, one billing dashboard:
from openai import OpenAI
client = OpenAI(
base_url="https://api.tokenbay.com/v1",
api_key="***",
)
gpt_response = client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": "Analyze this customer feedback..."}],
)
claude_response = client.chat.completions.create(
model="claude-opus-4.7",
messages=[{"role": "user", "content": "Summarize this document..."}],
)
The application code change took about 3 minutes — literally just the base_url
and api_key
.
Let me break down what changed after the switch:
| Item | Before (direct) | After (gateway, 15% off) |
|---|---|---|
| GPT-5.5 input | $5.00/M tokens | $4.25/M tokens |
| GPT-5.5 output | $30.00/M tokens | $25.50/M tokens |
| Claude Opus 4.7 input | $5.00/M tokens | $4.25/M tokens |
| Claude Opus 4.7 output | $25.00/M tokens | $21.25/M tokens |
That's a flat 15% off across both providers just from using the gateway. But the bigger savings came from visibility.
Once I could see all my usage in one dashboard, I noticed my classification tasks (tagging, sentiment) were hitting GPT-5.5 at $4.25/M input tokens. Switching those to a cheaper model — DeepSeek-V4-Flash at $0.119/M input — dropped that cost by over 35x. Classification accounted for about 30% of my volume, so that one change made a real dent.
The point isn't the specific numbers. It's that I couldn't see the opportunity until all my usage was in one place.
In production, I don't hardcode model names. Everything lives in environment variables:
import os
from openai import OpenAI
client = OpenAI(
base_url=os.getenv("LLM_BASE_URL"),
api_key=***"LLM_API_KEY"),
)
def classify(text: str) -> str:
response = client.chat.completions.create(
model=os.getenv("LLM_CLASSIFICATION_MODEL"),
messages=[{"role": "user", "content": f"Classify: {text}"}],
)
return response.choices[0].message.content
LLM_BASE_URL=https://api.tokenbay.com/v1
LLM_API_KEY=***
LLM_PRIMARY_MODEL=gpt-5.5
LLM_CLASSIFICATION_MODEL=deepseek-v4-flash
LLM_SUMMARIZATION_MODEL=claude-opus-4.7
This has a nice side effect: if I want to test whether Claude is better than GPT for classification, I change one line in .env
instead of rewriting integration code.
Added latency. Your request now goes through one extra hop, adding ~50-150ms on average. For most applications that's invisible to users. For latency-critical stuff (real-time voice, gaming), direct provider integration might still be better.
Provider-specific features. If you rely on beta features that only exist on one provider's native API, a gateway won't expose those. For me, the only provider-specific feature I used was Claude's extended thinking, and the gateway supports it fine. Your mileage may vary.
Another dependency. You're adding a layer to your stack. Check the gateway's status page and uptime history before committing.
Trust. You're routing prompts through a third party. Read their privacy policy. Understand what data they log. If you handle sensitive data (healthcare, finance, legal), this deserves extra scrutiny.
This approach makes sense if:
It's probably not worth it if:
base_url
in your dev environmentNo rewriting, no refactoring, no commitment. If it doesn't save you money, switch back and you're out 3 minutes.