cd /news/large-language-models/how-we-reduced-our-llm-api-costs-by-… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-42855] src=dev.to β†— pub= topic=large-language-models verified=true sentiment=↑ positive

How We Reduced Our LLM API Costs by 60%: What Actually Worked

A developer team reduced their LLM API costs by 60% through systematic optimization, starting with instrumenting every call to log token counts per request type. The biggest savings came from semantic caching, which caches responses based on semantic similarity rather than exact string matches, avoiding redundant API calls for similar user queries.

read8 min views1 publishedJun 29, 2026

At some point in most of our production AI projects, someone looks at the monthly API bill and asks whether we can do something about it. The answer is always yes β€” but the specific answers vary a lot depending on what you are actually spending the money on.

This post covers the techniques that moved the needle for us, in rough order of impact. Some of these are obvious in retrospect. A few took longer than they should have to figure out.

Before optimising anything, you need to know what is driving your costs. LLM API pricing is based on tokens β€” input tokens and output tokens, usually priced differently, with output tokens costing more.

In most production systems we have built, the cost breakdown looks something like this: a large fraction of input tokens are repetitive context β€” the same system prompt, the same retrieved documents, the same few-shot examples β€” sent with every request. Output tokens are often smaller than people expect, because most real-world tasks involve classification, extraction, or short-form generation rather than long prose.

The implication is that the biggest gains usually come from reducing redundant input tokens, not from compressing outputs or switching models for their own sake.

We instrument every LLM call in production to log token counts per request type. Without this, you are guessing. Here is the middleware we use on Django projects:

import time
import logging
from dataclasses import dataclass, field
from typing import Optional

logger = logging.getLogger("llm.usage")

@dataclass
class LLMCallRecord:
    model: str
    call_type: str  # e.g. "rag_query", "classification", "extraction"
    input_tokens: int
    output_tokens: int
    latency_ms: float
    cached: bool = False
    metadata: dict = field(default_factory=dict)

    @property
    def estimated_cost_usd(self) -> float:
        rates = {
            "gpt-4o": {"input": 0.0000025, "output": 0.00001},
            "gpt-4o-mini": {"input": 0.00000015, "output": 0.0000006},
            "claude-sonnet-4-6": {"input": 0.000003, "output": 0.000015},
            "claude-haiku-4-5-20251001": {"input": 0.0000008, "output": 0.000004},
        }
        rate = rates.get(self.model, {"input": 0.000003, "output": 0.000015})
        return (self.input_tokens * rate["input"]) + (self.output_tokens * rate["output"])

def log_llm_call(record: LLMCallRecord):
    logger.info(
        "llm_call",
        extra={
            "model": record.model,
            "call_type": record.call_type,
            "input_tokens": record.input_tokens,
            "output_tokens": record.output_tokens,
            "latency_ms": record.latency_ms,
            "cached": record.cached,
            "estimated_cost_usd": record.estimated_cost_usd,
            **record.metadata,
        },
    )

Once you have a week of data, you will know exactly which call types account for the most spend. Every optimisation effort since has started with this data, not with instinct.

The single biggest reduction came from semantic caching β€” caching LLM responses not by exact string match, but by semantic similarity. Users ask the same questions in different ways. Without semantic caching, each phrasing triggers a fresh API call.

The principle: embed the incoming query, search your cache store for similar queries above a similarity threshold, and return the cached response if found. Only call the LLM on genuinely novel requests.

import hashlib
import json
from typing import Optional
import numpy as np
from django.core.cache import cache
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

class SemanticCache:
    """
    Cache LLM responses by semantic similarity of the query.
    Stores (embedding, response) pairs in Django's cache backend.
    """

    CACHE_KEY_INDEX = "semantic_cache:index"
    SIMILARITY_THRESHOLD = 0.95
    MAX_CACHE_SIZE = 1000

    def get(self, query: str) -> Optional[str]:
        query_embedding = get_embedding(query)
        index = cache.get(self.CACHE_KEY_INDEX, [])

        for entry in index:
            similarity = cosine_similarity(query_embedding, entry["embedding"])
            if similarity >= self.SIMILARITY_THRESHOLD:
                cached_response = cache.get(entry["cache_key"])
                if cached_response:
                    return cached_response

        return None

    def set(self, query: str, response: str, ttl: int = 3600):
        query_embedding = get_embedding(query)
        cache_key = f"semantic_cache:{hashlib.md5(query.encode()).hexdigest()}"

        cache.set(cache_key, response, ttl)

        index = cache.get(self.CACHE_KEY_INDEX, [])
        index.append({"embedding": query_embedding, "cache_key": cache_key})

        if len(index) > self.MAX_CACHE_SIZE:
            index = index[-self.MAX_CACHE_SIZE:]

        cache.set(self.CACHE_KEY_INDEX, index, ttl * 2)

semantic_cache = SemanticCache()

In practice, for customer-facing query interfaces, cache hit rates above 30% are common after the first few weeks of traffic. The embedding calls for cache lookup cost a fraction of a full LLM completion.

One thing to watch: the similarity threshold matters a lot. 0.95 is conservative and safe for factual queries. For creative or generative tasks, caching is usually not appropriate at all β€” you do not want users getting each other's generated content.

System prompts grow over time. You add instructions to handle edge cases. You add examples. You add clarifications about what the model should not do. Before long, a system prompt that started at 200 tokens is 1,500 tokens, and you are paying for every token on every call.

We do two things here. First, we audit system prompts quarterly for redundancy. Prompts often contain instructions that are now unnecessary because the model handles them correctly by default, or because the use case evolved.

Second, for RAG pipelines, we compress retrieved context aggressively. The naive approach retrieves full document chunks. In practice, much of the retrieved text is irrelevant to the specific query. We add a compression step:

from openai import OpenAI

client = OpenAI()

def compress_context(query: str, retrieved_chunks: list[str]) -> str:
    """
    Given a query and retrieved document chunks, extract only the
    sentences or passages directly relevant to answering the query.
    Uses a cheap, fast model β€” cost is much lower than sending full chunks.
    """
    combined = "\n\n---\n\n".join(retrieved_chunks)

    response = client.chat.completions.create(
        model="gpt-4o-mini",  # cheap model for this step
        messages=[
            {
                "role": "system",
                "content": (
                    "Extract only the sentences or short passages from the provided text "
                    "that are directly relevant to answering the query. "
                    "Remove everything else. Preserve the meaning of what you keep. "
                    "Do not add anything that is not in the source text."
                ),
            },
            {
                "role": "user",
                "content": f"Query: {query}\n\nText:\n{combined}",
            },
        ],
        max_tokens=800,
    )

    return response.choices[0].message.content

This adds a small cost for the compression step, but the reduction in context sent to the main model more than covers it β€” typically 3–4x reduction in RAG context length.

Not every LLM call needs the most capable model you have access to. We maintain a simple routing layer that assigns each call type to the cheapest model that handles it reliably.

The categories we use:

gpt-4o-mini

or claude-haiku

for these.

from enum import Enum
from dataclasses import dataclass

class TaskComplexity(Enum):
    SIMPLE = "simple"       # classification, extraction, yes/no
    MODERATE = "moderate"   # summarisation, short generation
    COMPLEX = "complex"     # reasoning, planning, nuanced generation

@dataclass
class ModelConfig:
    model: str
    max_tokens: int

ROUTING_TABLE: dict[TaskComplexity, ModelConfig] = {
    TaskComplexity.SIMPLE: ModelConfig(model="gpt-4o-mini", max_tokens=256),
    TaskComplexity.MODERATE: ModelConfig(model="gpt-4o-mini", max_tokens=1024),
    TaskComplexity.COMPLEX: ModelConfig(model="gpt-4o", max_tokens=2048),
}

def get_model_config(complexity: TaskComplexity) -> ModelConfig:
    return ROUTING_TABLE[complexity]

def classify_support_ticket(ticket_text: str) -> str:
    config = get_model_config(TaskComplexity.SIMPLE)

    response = client.chat.completions.create(
        model=config.model,
        messages=[
            {
                "role": "system",
                "content": (
                    "Classify the support ticket into exactly one of: "
                    "billing, technical, account, feature_request, other. "
                    "Reply with only the category name."
                ),
            },
            {"role": "user", "content": ticket_text},
        ],
        max_tokens=config.max_tokens,
    )

    return response.choices[0].message.content.strip()

The key discipline here is to benchmark each task type with both model tiers before committing to the cheaper option. "It seems fine" is not a good enough standard. We run each call type through 50–100 real production examples, grade the outputs, and only route to the cheaper model if quality is within an acceptable margin.

Aggressive output length constraints. We tried setting low max_tokens

on generation tasks to reduce output cost. It saved a small amount but made the outputs worse β€” models truncate in ways that break coherence. Output token cost is usually not the problem; do not sacrifice quality here.

Batching requests. The batch API reduces cost by ~50% on some providers but introduces latency of minutes to hours. For anything user-facing, the tradeoff is not worth it. It works for offline processing jobs where latency does not matter.

Switching providers entirely based on benchmark performance. We spent time evaluating alternative providers that were cheaper per token. For some tasks they were fine. For others, the quality drop was meaningful and affected the product. The benchmarks do not tell you how a model performs on your specific task with your specific data β€” only testing on your own workload does.

Sixty percent cost reduction sounds dramatic. In practice it came from three things applied together: semantic caching (biggest impact), smarter model routing (second biggest), and prompt/context compression (smaller but meaningful). None of these required rearchitecting anything fundamental.

The prerequisite for all of it was instrumentation. You cannot optimise what you cannot measure. Log every call, log token counts by call type, and look at the data before deciding where to focus. The calls that feel expensive are often not the ones that are actually expensive.

The other thing worth saying: do not optimise prematurely. If your LLM spend is $300 a month and growing slowly, the engineering time to implement semantic caching is not worth it yet. Do it when the numbers justify it β€” and make sure the instrumentation is already in place so you know when that point arrives.

Lycore builds production AI systems for businesses β€” RAG pipelines, AI agents, LLM integrations, and custom AI applications built for scale and reliability. Get in touch if you want to talk through your use case.

── more in #large-language-models 4 stories Β· sorted by recency
── more on @openai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/how-we-reduced-our-l…] indexed:0 read:8min 2026-06-29 Β· β€”