How We Reduced Our LLM API Costs by 60%: What Actually Worked

A developer team reduced their LLM API costs by 60% through systematic optimization, starting with instrumenting every call to log token counts per request type. The biggest savings came from semantic caching, which caches responses based on semantic similarity rather than exact string matches, avoiding redundant API calls for similar user queries.

At some point in most of our production AI projects, someone looks at the monthly API bill and asks whether we can do something about it. The answer is always yes — but the specific answers vary a lot depending on what you are actually spending the money on. This post covers the techniques that moved the needle for us, in rough order of impact. Some of these are obvious in retrospect. A few took longer than they should have to figure out. Before optimising anything, you need to know what is driving your costs. LLM API pricing is based on token https://platform.openai.com/tokenizer s — input tokens and output tokens, usually priced differently, with output tokens costing more. In most production systems we have built, the cost breakdown looks something like this: a large fraction of input tokens are repetitive context — the same system prompt, the same retrieved documents, the same few-shot examples — sent with every request. Output tokens are often smaller than people expect, because most real-world tasks involve classification, extraction, or short-form generation rather than long prose. The implication is that the biggest gains usually come from reducing redundant input tokens, not from compressing outputs or switching models for their own sake. We instrument every LLM call in production to log token counts per request type. Without this, you are guessing. Here is the middleware we use on Django projects: python import time import logging from dataclasses import dataclass, field from typing import Optional logger = logging.getLogger "llm.usage" @dataclass class LLMCallRecord: model: str call type: str e.g. "rag query", "classification", "extraction" input tokens: int output tokens: int latency ms: float cached: bool = False metadata: dict = field default factory=dict @property def estimated cost usd self - float: Update rates as pricing changes rates = { "gpt-4o": {"input": 0.0000025, "output": 0.00001}, "gpt-4o-mini": {"input": 0.00000015, "output": 0.0000006}, "claude-sonnet-4-6": {"input": 0.000003, "output": 0.000015}, "claude-haiku-4-5-20251001": {"input": 0.0000008, "output": 0.000004}, } rate = rates.get self.model, {"input": 0.000003, "output": 0.000015} return self.input tokens rate "input" + self.output tokens rate "output" def log llm call record: LLMCallRecord : logger.info "llm call", extra={ "model": record.model, "call type": record.call type, "input tokens": record.input tokens, "output tokens": record.output tokens, "latency ms": record.latency ms, "cached": record.cached, "estimated cost usd": record.estimated cost usd, record.metadata, }, Once you have a week of data, you will know exactly which call types account for the most spend. Every optimisation effort since has started with this data, not with instinct. The single biggest reduction came from semantic caching — caching LLM responses not by exact string match, but by semantic similarity. Users ask the same questions in different ways. Without semantic caching, each phrasing triggers a fresh API call. The principle: embed the incoming query, search your cache store for similar queries above a similarity threshold, and return the cached response if found. Only call the LLM on genuinely novel requests. python import hashlib import json from typing import Optional import numpy as np from django.core.cache import cache from openai import OpenAI client = OpenAI def get embedding text: str - list float : response = client.embeddings.create model="text-embedding-3-small", input=text, return response.data 0 .embedding def cosine similarity a: list float , b: list float - float: a arr, b arr = np.array a , np.array b return float np.dot a arr, b arr / np.linalg.norm a arr np.linalg.norm b arr class SemanticCache: """ Cache LLM responses by semantic similarity of the query. Stores embedding, response pairs in Django's cache backend. """ CACHE KEY INDEX = "semantic cache:index" SIMILARITY THRESHOLD = 0.95 MAX CACHE SIZE = 1000 def get self, query: str - Optional str : query embedding = get embedding query index = cache.get self.CACHE KEY INDEX, for entry in index: similarity = cosine similarity query embedding, entry "embedding" if similarity = self.SIMILARITY THRESHOLD: cached response = cache.get entry "cache key" if cached response: return cached response return None def set self, query: str, response: str, ttl: int = 3600 : query embedding = get embedding query cache key = f"semantic cache:{hashlib.md5 query.encode .hexdigest }" cache.set cache key, response, ttl index = cache.get self.CACHE KEY INDEX, index.append {"embedding": query embedding, "cache key": cache key} Keep the index bounded if len index self.MAX CACHE SIZE: index = index -self.MAX CACHE SIZE: cache.set self.CACHE KEY INDEX, index, ttl 2 semantic cache = SemanticCache In practice, for customer-facing query interfaces, cache hit rates above 30% are common after the first few weeks of traffic. The embedding calls for cache lookup cost a fraction of a full LLM completion. One thing to watch: the similarity threshold matters a lot. 0.95 is conservative and safe for factual queries. For creative or generative tasks, caching is usually not appropriate at all — you do not want users getting each other's generated content. System prompts grow over time. You add instructions to handle edge cases. You add examples. You add clarifications about what the model should not do. Before long, a system prompt that started at 200 tokens is 1,500 tokens, and you are paying for every token on every call. We do two things here. First, we audit system prompts quarterly for redundancy. Prompts often contain instructions that are now unnecessary because the model handles them correctly by default, or because the use case evolved. Second, for RAG pipelines, we compress retrieved context aggressively. The naive approach retrieves full document chunks. In practice, much of the retrieved text is irrelevant to the specific query. We add a compression step: python from openai import OpenAI client = OpenAI def compress context query: str, retrieved chunks: list str - str: """ Given a query and retrieved document chunks, extract only the sentences or passages directly relevant to answering the query. Uses a cheap, fast model — cost is much lower than sending full chunks. """ combined = "\n\n---\n\n".join retrieved chunks response = client.chat.completions.create model="gpt-4o-mini", cheap model for this step messages= { "role": "system", "content": "Extract only the sentences or short passages from the provided text " "that are directly relevant to answering the query. " "Remove everything else. Preserve the meaning of what you keep. " "Do not add anything that is not in the source text." , }, { "role": "user", "content": f"Query: {query}\n\nText:\n{combined}", }, , max tokens=800, return response.choices 0 .message.content Usage in your RAG pipeline: compressed = compress context user query, retrieved chunks final response = expensive model call user query, compressed This adds a small cost for the compression step, but the reduction in context sent to the main model more than covers it — typically 3–4x reduction in RAG context length. Not every LLM call needs the most capable model you have access to. We maintain a simple routing layer that assigns each call type to the cheapest model that handles it reliably. The categories we use: gpt-4o-mini or claude-haiku for these. python from enum import Enum from dataclasses import dataclass class TaskComplexity Enum : SIMPLE = "simple" classification, extraction, yes/no MODERATE = "moderate" summarisation, short generation COMPLEX = "complex" reasoning, planning, nuanced generation @dataclass class ModelConfig: model: str max tokens: int ROUTING TABLE: dict TaskComplexity, ModelConfig = { TaskComplexity.SIMPLE: ModelConfig model="gpt-4o-mini", max tokens=256 , TaskComplexity.MODERATE: ModelConfig model="gpt-4o-mini", max tokens=1024 , TaskComplexity.COMPLEX: ModelConfig model="gpt-4o", max tokens=2048 , } def get model config complexity: TaskComplexity - ModelConfig: return ROUTING TABLE complexity Example: classifying support tickets def classify support ticket ticket text: str - str: config = get model config TaskComplexity.SIMPLE response = client.chat.completions.create model=config.model, messages= { "role": "system", "content": "Classify the support ticket into exactly one of: " "billing, technical, account, feature request, other. " "Reply with only the category name." , }, {"role": "user", "content": ticket text}, , max tokens=config.max tokens, return response.choices 0 .message.content.strip The key discipline here is to benchmark each task type with both model tiers before committing to the cheaper option. "It seems fine" is not a good enough standard. We run each call type through 50–100 real production examples, grade the outputs, and only route to the cheaper model if quality is within an acceptable margin. Aggressive output length constraints. We tried setting low max tokens on generation tasks to reduce output cost. It saved a small amount but made the outputs worse — models truncate in ways that break coherence. Output token cost is usually not the problem; do not sacrifice quality here. Batching requests. The batch API reduces cost by ~50% on some providers but introduces latency of minutes to hours. For anything user-facing, the tradeoff is not worth it. It works for offline processing jobs where latency does not matter. Switching providers entirely based on benchmark performance. We spent time evaluating alternative providers that were cheaper per token. For some tasks they were fine. For others, the quality drop was meaningful and affected the product. The benchmarks do not tell you how a model performs on your specific task with your specific data — only testing on your own workload does. Sixty percent cost reduction sounds dramatic. In practice it came from three things applied together: semantic caching biggest impact , smarter model routing second biggest , and prompt/context compression smaller but meaningful . None of these required rearchitecting anything fundamental. The prerequisite for all of it was instrumentation. You cannot optimise what you cannot measure. Log every call, log token counts by call type, and look at the data before deciding where to focus. The calls that feel expensive are often not the ones that are actually expensive. The other thing worth saying: do not optimise prematurely. If your LLM spend is $300 a month and growing slowly, the engineering time to implement semantic caching is not worth it yet. Do it when the numbers justify it — and make sure the instrumentation is already in place so you know when that point arrives. Lycore builds production AI systems https://www.lycore.com/ai-development-services/ for businesses — RAG pipelines, AI agents, LLM integrations, and custom AI applications built for scale and reliability. Get in touch https://www.lycore.com/contact-us/ if you want to talk through your use case.