How I Cut Our AI API Bill by 95%: What Actually Worked

wpnews.pro

Honestly, how I Cut Our AI API Bill by 95%: What Actually Worked

When I first looked at our AI infrastructure spend six months ago, I nearly choked on my coffee. We were burning $11,000 a month on LLM calls for a product serving maybe 4,000 active users. The math was brutal — we were subsidizing every interaction, and our unit economics were completely broken.

The worst part? I knew it was bad, but I didn't realise how much was being left on the table. After three months of focused optimization, we're running the same workload for under $400/month. That's not a typo. Here's the playbook, written from the trenches.

If you're a CTO or engineering lead shipping AI features right now, this is for you. No fluff, no hand-waving — just the architecture decisions that moved the needle on our P&L.

I'm guilty of this. We started with GPT-4o for everything because it was the path of least resistance. The docs are good, the SDK works out of the box, and when you're moving fast on a prototype, you don't want to think about model selection.

The problem is that "don't think about model selection" becomes a permanent state when nobody on the team questions it. Six months in, we were still sending classification tasks, simple chat replies, and translation requests through the most expensive model in the stack. That's pure waste.

Here's what changed my mind: I built a simple mapping table that matched task complexity to model cost. Just sitting down and writing it out made the absurdity obvious.

Task Type	What We Were Using	What We Switched To	Savings
Simple chat	GPT-4o at $10.00/M output	DeepSeek V4 Flash at $0.25/M	97.5%
Classification	GPT-4o-mini at $0.60/M	Qwen3-8B at $0.01/M	98.3%
Code generation	GPT-4o at $10.00/M	DeepSeek Coder at $0.25/M	97.5%
Summarization	GPT-4o at $10.00/M	Qwen3-32B at $0.28/M	97.2%
Translation	GPT-4o at $10.00/M	Qwen-MT-Turbo at $0.30/M	97%

Look at that classification row. We were paying $0.60/M for routing user inputs into one of six buckets when Qwen3-8B handles it at $0.01/M. That's a 60× multiplier on zero added complexity.

Here's the basic implementation we ended up standardizing across our services:

from openai import OpenAI

client = OpenAI(
    base_url="https://global-apis.com/v1",
    api_key="YOUR_GLOBAL_API_KEY"
)

MODEL_MAP = {
    "chat": "deepseek-v4-flash",        # $0.25/M output
    "code": "deepseek-coder",           # $0.25/M output
    "classification": "Qwen/Qwen3-8B",  # $0.01/M output
    "summarization": "Qwen/Qwen3-32B",  # $0.28/M output
    "translation": "Qwen-MT-Turbo",     # $0.30/M output
    "reasoning": "deepseek-reasoner",   # $2.50/M output — only for hard stuff
}

def route_request(user_input: str) -> str:
    task = classify_complexity(user_input)
    return MODEL_MAP[task]

response = client.chat.completions.create(
    model=route_request(user_input),
    messages=[{"role": "user", "content": user_input}]
)

The big lesson here: model selection isn't a one-time decision, it's a per-request routing problem. And the routing logic is trivial — usually a few hundred tokens of classifier output.

After we deployed basic model selection, we still had a problem. Some requests needed the good models. Some didn't. We were paying for the good model on every request because we didn't have a confidence threshold to fall back on.

So we built a tiered routing layer. Try cheap first, escalate only when needed. This is the pattern that took us from "already pretty good" to "absurdly cheap."

def smart_generate(prompt: str, max_budget_tier: int = 3) -> dict:
    """
    Tier 1: Ultra-budget model handles easy queries
    Tier 2: Standard model handles moderate complexity  
    Tier 3: Premium model reserved for hard reasoning
    """

    tier1_resp = call_model("Qwen/Qwen3-8B", prompt)
    if quality_score(tier1_resp) >= 0.8:
        return {"response": tier1_resp, "tier": 1, "cost": 0.00001}

    tier2_resp = call_model("deepseek-v4-flash", prompt)
    if quality_score(tier2_resp) >= 0.9:
        return {"response": tier2_resp, "tier": 2, "cost": 0.00025}

    tier3_model = "deepseek-reasoner" if max_budget_tier >= 3 else "deepseek-v4-flash"
    tier3_resp = call_model(tier3_model, prompt)
    return {"response": tier3_resp, "tier": 3, "cost": 0.0025}

The real-world result on our customer support chatbot: monthly bill dropped from $420 to $28. That's an 85% reduction from tiered routing alone, on top of the savings we already had from smart model selection.

The reason this works at scale is that quality requirements are bimodal. Most queries are either trivially easy (greetings, simple lookups, FAQ-type questions) or genuinely hard (multi-step reasoning, edge cases). The middle ground is smaller than you'd expect.

Your quality scoring function is the heart of this system. We use a combination of:

This one's almost embarrassing because it's so obvious in retrospect. We had no caching layer for months. Every request hit the API even when the exact same question had been answered 50 times that day.

FAQ pages, documentation lookups, "how do I reset my password" type queries — these are massively cacheable. Our hit rate now sits between 50-80% depending on the surface.

import hashlib
import json
import time
from typing import Optional

_cache = {}

def cached_chat(model: str, messages: list, ttl: int = 3600) -> dict:
    """
    Cache identical requests for `ttl` seconds.
    Saves 20-50% on most workloads at zero quality cost.
    """
    key = hashlib.md5(
        json.dumps({"model": model, "messages": messages}, sort_keys=True).encode()
    ).hexdigest()

    if key in _cache:
        entry = _cache[key]
        if time.time() - entry["timestamp"] < ttl:
            return entry["response"]  # Cache hit — $0 cost

    response = client.chat.completions.create(
        model=model,
        messages=messages
    )

    _cache[key] = {
        "response": response,
        "timestamp": time.time()
    }
    return response

For production we moved this from an in-memory dict to Redis with a 24-hour TTL on most entries. The implementation got a bit more complex around serialization, but the pattern is identical.

One caveat: don't cache personalized responses or anything where the prompt includes user-specific data without normalizing it first. We strip PII from cache keys to avoid serving User A's response to User B.

This is where it gets interesting at scale. Every token you don't send is money saved, and most prompts are way longer than they need to be.

We had a system prompt for our RAG pipeline that clocked in around 2,000 tokens. It was thorough, well-organized, and completely bloated. Compressing it to 400 tokens saved us $0.024 per request on DeepSeek V4 Flash.

$0.024 sounds trivial. Multiply by 10,000 requests per day and you're at $240/day. That's $87,600/year saved on a single prompt.

The compression itself is cheap — you use the budget model to summarize context before you send it to the expensive model:

def compress_prompt(text: str, target_ratio: float = 0.5) -> str:
    """
    Compress long prompts using a cheap model.
    target_ratio=0.5 means compress to 50% of original length.
    """
    if len(text) < 500:
        return text  # Not worth compressing

    target_chars = int(len(text) * target_ratio)
    summary = call_model(
        "Qwen/Qwen3-8B",
        f"Summarize this content in approximately {target_chars} characters, "
        f"preserving all key instructions and constraints: {text}"
    )
    return summary

system_prompt = load_full_prompt()
compressed = compress_prompt(system_prompt, target_ratio=0.2)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": compressed},
        {"role": "user", "content": user_input}
    ]
)

The trick is to preserve the semantic content while cutting filler. LLMs are remarkably good at this when you ask them to.

A few prompt compression tactics we use beyond model-based summarization:

At scale, even a 15% reduction in average prompt length compounds significantly across millions of requests.

This one's simple. If you have 10 questions to answer, don't make 10 API calls. Make one.

The naive approach:

for question in questions:
    response = client.chat.completions.create(
        model="deepseek-v4-flash",
        messages=[{"role": "user", "content": question}]
    )
    results.append(response)

The batched approach:

batch_prompt = "\n\n".join([f"Question {i+1}: {q}" for i, q in enumerate(questions)])

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "Answer each numbered question in order. Format: '1. [answer]\n2. [answer]'"},
        {"role": "user", "content": batch_prompt}
    ]
)

answers = parse_numbered_response(response.choices[0].message.content)

The savings come from sharing the system prompt across all questions. You're paying for input tokens once instead of N times. With typical overhead of 100-300 tokens per request for system prompts and message formatting, batching 10 requests saves you 900-2700 input tokens per batch.

The catch is latency — a batched call takes longer than a single call. So this only works for asynchronous workflows: bulk classification, batch summarization, overnight report generation, etc. Don't batch your user-facing chat responses.

Let me step back from the tactical stuff for a second. The reason model selection, tiered routing, and prompt compression are all possible is that we have access to multiple models through a unified API. This is the strategic move that enables everything else.

If you're locked into a single provider, you can't negotiate, you can't route around outages, and you can't take advantage of price drops when new models launch. Last quarter, three major providers cut prices on their flagship models within six weeks of each other. The teams locked into single vendors missed all three windows.

We use Global API as our primary routing layer. One endpoint, one SDK, every model we need. This gave us three things that mattered:

client = OpenAI(
    base_url="https://global-apis.com/v1",
    api_key="YOUR_GLOBAL_API_KEY"
)

def call_model(model_name: str, prompt: str):
    return client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}]
    )

That last block of code looks trivial, but it's the foundation. One client object, one auth flow, every model in the market. When someone tells me they've achieved vendor independence, I ask them how many code paths they had to rewrite to switch providers last time. If the answer isn't "zero," they aren't actually independent.

Let me put this together with actual numbers from our production system. We serve about 4,000 active users generating roughly 180,000 LLM requests per month.

Before optimization:

After optimization:

source & further reading

dev.to — original article The Budget Mistake Most Companies Make in Their Second Year of AI Bolt.new vs Velra: Best AI Builder for Billing What Your Production Agents Aren't Telling You: A Practical Guide to Agent Observability

How I Cut Our AI API Bill by 95%: What Actually Worked

Run your AI side-project on zahid.host