cd /news/artificial-intelligence/i-slashed-ai-api-costs-by-80-with-a-… · home topics artificial-intelligence article
[ARTICLE · art-43254] src=dev.to ↗ pub= topic=artificial-intelligence verified=true sentiment=↑ positive

I slashed AI API costs by 80% with a cache that actually works

A developer built a semantic cache using sentence-transformers to reduce AI API costs by 80%. The cache stores embeddings of prompts and reuses responses for semantically similar queries, cutting a $400 monthly bill to under $80. The approach uses cosine similarity with a threshold of 0.92 to balance accuracy and cache hit rate.

read3 min views1 publishedJun 29, 2026

A few months ago I built a side project that generates personalized product descriptions using an AI API. Within two weeks my API bill hit $400 and I knew something had to change.

I started with the obvious approach: a simple dictionary cache keyed on the exact prompt string.

cache = {}
def get_ai_response(prompt):
    if prompt in cache:
        return cache[prompt]
    response = api_call(prompt)  # expensive
    cache[prompt] = response
    return response

The problem? My users rarely typed the exact same request twice. "Red running shoes for women" vs. "women's red running shoes" are semantically identical but cached as different keys. I was paying for 95% of calls twice.

First I attempted prompt normalization: lowercasing, sorting words, removing punctuation. It helped a little, but real-world queries vary too much. "Nike sneakers size 9" and "size 9 Nike sneakers" still looked different after normalization.

Then I tried TF-IDF vectorization with cosine similarity. Better, but the bag-of-words approach missed semantic meaning. "Cheap laptop" and "budget notebook" would pass as unrelated.

The core idea is simple: convert every incoming prompt into a vector (embedding), store it together with the API response, and for each new query first check if we already have a semantically similar prompt in the cache. If cosine similarity exceeds a threshold, reuse the stored response instead of calling the API.

I built this using:

sentence-transformers

(all-MiniLM-L6-v2) for embeddingsHere's the heart of it:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

class SemanticCache:
    def __init__(self, similarity_threshold=0.92):
        self.embeddings = []
        self.responses = []
        self.threshold = similarity_threshold

    def _cosine_similarity(self, a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def get(self, prompt):
        emb = model.encode(prompt, convert_to_tensor=False)
        for i, stored_emb in enumerate(self.embeddings):
            if self._cosine_similarity(emb, stored_emb) >= self.threshold:
                return self.responses[i]
        return None

    def set(self, prompt, response):
        emb = model.encode(prompt, convert_to_tensor=False)
        self.embeddings.append(emb)
        self.responses.append(response)

And the integration with my AI API (example with Interwest AI – just replace with your provider):

cache = SemanticCache(threshold=0.92)

def get_description(product_query):
    cached = cache.get(product_query)
    if cached:
        return cached
    response = requests.post(
        "https://ai.interwestinfo.com/v1/generate",
        json={"prompt": product_query, "max_tokens": 100}
    ).json()
    cache.set(product_query, response["text"])
    return response["text"]

After deploying this cache I tracked API calls over a week. Out of 10,000 unique user prompts, 8,200 were served from cache after the first day. My bill dropped from $400 to under $80. Latency improved dramatically too – cache hits took ~30ms, while API calls averaged 1.2s.

The threshold is tricky. At 0.95 I missed too many valid matches. At 0.85 I started getting occasional nonsense responses (e.g., "red backpack" returning a description meant for "blue jacket"). 0.92 worked for my domain, but you'll need to tune it.

Memory grows. For a large-scale app you'd want a vector database (Pinecone, Qdrant, pgvector). I'm currently migrating to pgvector for persistence and faster search.

Embedding computation isn't free. Generating the embedding takes ~5ms. That's still much cheaper than an API call, but it's not zero. Batch processing can help.

Semantic cache is only useful for queries that are genuinely similar. If your users constantly ask completely different things, the hit rate will be low. My use case had repetitive patterns (product categories, attributes), so it worked perfectly.

Semantic caching won't solve every scaling problem, but for AI apps that serve repetitive or similar prompts – chat bots, code generators, content creators – it's a massive win. The technique itself is more valuable than any specific tool I used.

What's your experience been with managing AI API costs? Have you tried semantic caching or other approaches? I'd love to hear what worked (or didn't) in your projects.

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @interwest ai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-slashed-ai-api-cos…] indexed:0 read:3min 2026-06-29 ·