I slashed AI API costs by 80% with a cache that actually works

wpnews.pro

cd /news/artificial-intelligence/i-slashed-ai-api-costs-by-80-with-a-… · home › topics › artificial-intelligence › article

[ARTICLE · art-43254] src=dev.to ↗ pub=2026-06-29T10:00Z topic=artificial-intelligence verified=true sentiment=↑ positive

I slashed AI API costs by 80% with a cache that actually works

A developer built a semantic cache using sentence-transformers to reduce AI API costs by 80%. The cache stores embeddings of prompts and reuses responses for semantically similar queries, cutting a $400 monthly bill to under $80. The approach uses cosine similarity with a threshold of 0.92 to balance accuracy and cache hit rate.

read3 min views1 publishedJun 29, 2026

A few months ago I built a side project that generates personalized product descriptions using an AI API. Within two weeks my API bill hit $400 and I knew something had to change.

I started with the obvious approach: a simple dictionary cache keyed on the exact prompt string.

cache = {}
def get_ai_response(prompt):
    if prompt in cache:
        return cache[prompt]
    response = api_call(prompt)  # expensive
    cache[prompt] = response
    return response

The problem? My users rarely typed the exact same request twice. "Red running shoes for women" vs. "women's red running shoes" are semantically identical but cached as different keys. I was paying for 95% of calls twice.

First I attempted prompt normalization: lowercasing, sorting words, removing punctuation. It helped a little, but real-world queries vary too much. "Nike sneakers size 9" and "size 9 Nike sneakers" still looked different after normalization.

Then I tried TF-IDF vectorization with cosine similarity. Better, but the bag-of-words approach missed semantic meaning. "Cheap laptop" and "budget notebook" would pass as unrelated.

The core idea is simple: convert every incoming prompt into a vector (embedding), store it together with the API response, and for each new query first check if we already have a semantically similar prompt in the cache. If cosine similarity exceeds a threshold, reuse the stored response instead of calling the API.

I built this using:

sentence-transformers

(all-MiniLM-L6-v2) for embeddingsHere's the heart of it:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

class SemanticCache:
    def __init__(self, similarity_threshold=0.92):
        self.embeddings = []
        self.responses = []
        self.threshold = similarity_threshold

    def _cosine_similarity(self, a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def get(self, prompt):
        emb = model.encode(prompt, convert_to_tensor=False)
        for i, stored_emb in enumerate(self.embeddings):
            if self._cosine_similarity(emb, stored_emb) >= self.threshold:
                return self.responses[i]
        return None

    def set(self, prompt, response):
        emb = model.encode(prompt, convert_to_tensor=False)
        self.embeddings.append(emb)
        self.responses.append(response)

And the integration with my AI API (example with Interwest AI – just replace with your provider):

cache = SemanticCache(threshold=0.92)

def get_description(product_query):
    cached = cache.get(product_query)
    if cached:
        return cached
    response = requests.post(
        "https://ai.interwestinfo.com/v1/generate",
        json={"prompt": product_query, "max_tokens": 100}
    ).json()
    cache.set(product_query, response["text"])
    return response["text"]

After deploying this cache I tracked API calls over a week. Out of 10,000 unique user prompts, 8,200 were served from cache after the first day. My bill dropped from $400 to under $80. Latency improved dramatically too – cache hits took ~30ms, while API calls averaged 1.2s.

The threshold is tricky. At 0.95 I missed too many valid matches. At 0.85 I started getting occasional nonsense responses (e.g., "red backpack" returning a description meant for "blue jacket"). 0.92 worked for my domain, but you'll need to tune it.

Memory grows. For a large-scale app you'd want a vector database (Pinecone, Qdrant, pgvector). I'm currently migrating to pgvector for persistence and faster search.

Embedding computation isn't free. Generating the embedding takes ~5ms. That's still much cheaper than an API call, but it's not zero. Batch processing can help.

Semantic cache is only useful for queries that are genuinely similar. If your users constantly ask completely different things, the hit rate will be low. My use case had repetitive patterns (product categories, attributes), so it worked perfectly.

Semantic caching won't solve every scaling problem, but for AI apps that serve repetitive or similar prompts – chat bots, code generators, content creators – it's a massive win. The technique itself is more valuable than any specific tool I used.

What's your experience been with managing AI API costs? Have you tried semantic caching or other approaches? I'd love to hear what worked (or didn't) in your projects.

source & further reading

dev.to — original article Build Your First MCP Server in 30 Minutes The Browser Testing Problems That Appear After Your Test Suite Starts Growing Agent-Ready Commerce, Part 7: Delegated Payment Needs More Than a Token

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-slashed-ai-api-costs-b…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/i-slashed-ai-api-c…

mentioned entities

Interwest AI

SentenceTransformers

Pinecone

Qdrant

pgvector

metadata

slugi-slashed-ai-api-costs-by-80-with-a-cache-that-actually-works

topic#artificial-intelligence

secondary2 topics

sentimentpositive

canonicaldev.to

navigation

← prevBritish American Tobacco to cut …

next →How to keep your IT talent pipel…

── more in #artificial-intelligence 4 stories · sorted by recency

dev.to · 27 Jun · #artificial-intelligence

Building a RAG System from Scratch — Design Decisions Explained

github.com · 29 Jun · #artificial-intelligence

Show HN: Self hosting a modern LLM stack

dev.to · 28 Jun · #artificial-intelligence

Pinecone vs Weaviate vs Milvus vs Qdrant: Which Vector DB in 2026?

dev.to · 27 Jun · #artificial-intelligence

SQL + AI: Real-World Database Solutions You Can Use Today

── more on @interwest ai 3 stories trending now

wpnews · 28 May · #ai-startups

[AINews] Cognition raises $1B in $26B Series D

wpnews · 5 Jun · #ai-agents

Miasma Worm Targets AI Coding Agents via GitHub Repos

wpnews · 28 Jun · #ai-agents

OpenCode v1.17: Session Snapshots Undo Your AI Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required