I slashed AI API costs by 80% with a cache that actually works

A developer built a semantic cache using sentence-transformers to reduce AI API costs by 80%. The cache stores embeddings of prompts and reuses responses for semantically similar queries, cutting a $400 monthly bill to under $80. The approach uses cosine similarity with a threshold of 0.92 to balance accuracy and cache hit rate.

A few months ago I built a side project that generates personalized product descriptions using an AI API. Within two weeks my API bill hit $400 and I knew something had to change. I started with the obvious approach: a simple dictionary cache keyed on the exact prompt string. python cache = {} def get ai response prompt : if prompt in cache: return cache prompt response = api call prompt expensive cache prompt = response return response The problem? My users rarely typed the exact same request twice. "Red running shoes for women" vs. "women's red running shoes" are semantically identical but cached as different keys. I was paying for 95% of calls twice. First I attempted prompt normalization: lowercasing, sorting words, removing punctuation. It helped a little, but real-world queries vary too much. "Nike sneakers size 9" and "size 9 Nike sneakers" still looked different after normalization. Then I tried TF-IDF vectorization with cosine similarity. Better, but the bag-of-words approach missed semantic meaning. "Cheap laptop" and "budget notebook" would pass as unrelated. The core idea is simple: convert every incoming prompt into a vector embedding , store it together with the API response, and for each new query first check if we already have a semantically similar prompt in the cache. If cosine similarity exceeds a threshold, reuse the stored response instead of calling the API. I built this using: sentence-transformers all-MiniLM-L6-v2 for embeddingsHere's the heart of it: python from sentence transformers import SentenceTransformer import numpy as np model = SentenceTransformer 'all-MiniLM-L6-v2' class SemanticCache: def init self, similarity threshold=0.92 : self.embeddings = self.responses = self.threshold = similarity threshold def cosine similarity self, a, b : return np.dot a, b / np.linalg.norm a np.linalg.norm b def get self, prompt : emb = model.encode prompt, convert to tensor=False for i, stored emb in enumerate self.embeddings : if self. cosine similarity emb, stored emb = self.threshold: return self.responses i return None def set self, prompt, response : emb = model.encode prompt, convert to tensor=False self.embeddings.append emb self.responses.append response And the integration with my AI API example with Interwest AI – just replace with your provider : python cache = SemanticCache threshold=0.92 def get description product query : cached = cache.get product query if cached: return cached Replace with your actual API call. I used Interwest AI's endpoint. response = requests.post "https://ai.interwestinfo.com/v1/generate", json={"prompt": product query, "max tokens": 100} .json cache.set product query, response "text" return response "text" After deploying this cache I tracked API calls over a week. Out of 10,000 unique user prompts, 8,200 were served from cache after the first day. My bill dropped from $400 to under $80. Latency improved dramatically too – cache hits took ~30ms, while API calls averaged 1.2s. The threshold is tricky. At 0.95 I missed too many valid matches. At 0.85 I started getting occasional nonsense responses e.g., "red backpack" returning a description meant for "blue jacket" . 0.92 worked for my domain, but you'll need to tune it. Memory grows. For a large-scale app you'd want a vector database Pinecone, Qdrant, pgvector . I'm currently migrating to pgvector for persistence and faster search. Embedding computation isn't free. Generating the embedding takes ~5ms. That's still much cheaper than an API call, but it's not zero. Batch processing can help. Semantic cache is only useful for queries that are genuinely similar. If your users constantly ask completely different things, the hit rate will be low. My use case had repetitive patterns product categories, attributes , so it worked perfectly. Semantic caching won't solve every scaling problem, but for AI apps that serve repetitive or similar prompts – chat bots, code generators, content creators – it's a massive win. The technique itself is more valuable than any specific tool I used. What's your experience been with managing AI API costs? Have you tried semantic caching or other approaches? I'd love to hear what worked or didn't in your projects.