We Cut Our LLM API Bill 30% With Four Lines of YAML

wpnews.pro

cd /news/large-language-models/we-cut-our-llm-api-bill-30-with-four… · home › topics › large-language-models › article

[ARTICLE · art-35072] src=dev.to ↗ pub=2026-06-20T19:13Z topic=large-language-models verified=true sentiment=↑ positive

We Cut Our LLM API Bill 30% With Four Lines of YAML

A developer at a company handling thousands of LLM calls per hour cut their API bill by 30% using semantic caching. By embedding prompts and checking cosine similarity against cached responses, they avoided duplicate model calls for semantically identical questions. The fix required only four lines of YAML configuration using LiteLLM's valkey-semantic cache backend with Valkey.

read3 min views1 publishedJun 20, 2026

Our gateway handles a few thousand LLM calls per hour. Mostly internal tools, some customer-facing agents. We noticed something in the logs: a lot of prompts were basically the same question worded differently.

"Summarize this quarterly report" and "give me a summary of the Q2 report" hitting the same model, getting nearly identical responses, costing us twice. Multiply that across a few hundred users and it adds up fast.

Quick back-of-envelope. GPT-4o runs $2.50 per million input tokens, $10 per million output. Claude Sonnet is $3/$15. A typical summarization request with context is maybe 2K input tokens and 500 output. That's roughly $0.007 per call on GPT-4o.

Doesn't sound like much until you're doing 50K calls a day and 30-40% of them are semantically identical. That's $100+/day in duplicate spend. $3K/month. For responses you already generated.

The fix is semantic caching at the gateway layer. Instead of matching prompts by exact string (which almost never hits because users word things differently), you embed the prompt into a vector and check cosine similarity against cached responses. Similar enough prompt? Return the cached response. Skip the model call entirely.

We'd been running this on Redis with RediSearch. Worked well but RediSearch needs Redis Stack, which isn't standard Redis anymore. When we moved to Valkey (like a lot of teams post-license-change), we needed the same thing on valkey-search.

LiteLLM shipped a valkey-semantic

cache backend that does exactly this. Four lines in the config:

litellm_settings:
  cache: True
  cache_params:
    type: valkey-semantic
    host: os.environ/VALKEY_HOST
    port: os.environ/VALKEY_PORT
    valkey_semantic_cache_embedding_model: openai-embedding
    similarity_threshold: 0.8

The similarity_threshold

controls how close a match needs to be. 0.8 worked well for us. Too low and you get false positives. Too high and you miss obvious duplicates. Tune it for your traffic.

Every prompt gets embedded (using whatever model you configure), stored in an HNSW vector index on Valkey, and tagged with a scope key so different users or API keys don't cross-contaminate caches. At lookup time it runs a KNN query and returns the cached response if cosine similarity clears the threshold.

The embedding call itself costs something (text-embedding-3-small is $0.02 per million tokens), but it's two orders of magnitude cheaper than the model call you're skipping. Net savings are significant.

Cache hits come back with an x-litellm-semantic-similarity

header so you can track your hit rate and measure actual savings.

docker run -d -p 6379:6379 valkey/valkey-bundle:8.1

Set VALKEY_HOST=localhost

, VALKEY_PORT=6379

, start LiteLLM with the config above. Send the same question two different ways. Second one returns instantly from cache.

If you're using LiteLLM as a library:

import os
import litellm
from litellm.caching.caching import Cache

litellm.cache = Cache(
    type="valkey-semantic",
    host=os.environ["VALKEY_HOST"],
    port=os.environ["VALKEY_PORT"],
    similarity_threshold=0.8,
    valkey_semantic_cache_embedding_model="text-embedding-ada-002",
)

response1 = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "explain kubernetes pods"}],
)

response2 = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "what are pods in k8s"}],
)

assert response1.id == response2.id

If you're on AWS, ElastiCache for Valkey supports this on node-based Valkey 8.2+ clusters. Serverless doesn't have vector search yet. Cluster-mode-disabled with read replicas works fine. For TLS, add ssl: true

or use a rediss://

URL. IAM auth supported, just skip the password.

The env vars (VALKEY_HOST

, VALKEY_PORT

, VALKEY_PASSWORD

) fall back to REDIS_HOST

/REDIS_PORT

/REDIS_PASSWORD

, so if you're migrating from Redis you don't even need to update your environment.

Semantic caching is not a silver bullet. It works best for read-heavy, repetitive workloads: internal Q&A bots, document summarization, support tools. It's less useful for creative generation or highly personalized responses where similar prompts should produce different outputs.

Also, if your prompts include large, unique contexts (like full documents), the semantic similarity might not trigger even for functionally identical questions, because the embedding is dominated by the context rather than the question.

Know your traffic patterns. Check the similarity header. Tune the threshold.

Full setup: docs.litellm.ai/docs/proxy/caching | Blog post: docs.litellm.ai/blog/valkey_semantic_caching

source & further reading

dev.to — original article NeuroImprint Detector: Audita adapters PEFT para detectar backdoors de privacidad en Federated Learning AI productivity gains vanish when you measure them honestly Building a Kernel-Integrated AI that Doesn't Hallucinate

~/api · this article 200

$curl api.wpnews.pro/v1/news/we-cut-our-llm-api-bill-…

Read original on dev.to → dev.to/paultwist/we-cut-our-llm-api-bill-30-with…

mentioned entities

LiteLLM

Valkey

Redis

GPT-4o

Claude Sonnet

OpenAI

AWS

ElastiCache

metadata

slugwe-cut-our-llm-api-bill-30-with-four-lines-of-yaml

topic#large-language-models

secondary2 topics

sentimentpositive

canonicaldev.to

navigation

← prevCyberSentinel AI launches autono…

next →Prem AI brings multi-GPU confide…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 20 Jun · #large-language-models

Token Budgeting: The Engineering Skill Nobody Talks About

dev.to · 20 Jun · #large-language-models

GitHub Copilot vs Cursor vs Windsurf: Top AI Coding Assistants Every Developer Should Know in 2026

dev.to · 19 Jun · #large-language-models

We Let 40 Engineers Loose With Coding Agents. The Bill Was Brutal.

github.com · 20 Jun · #large-language-models

HSIP–local identity server in Rust with Ed25519 signing and AI agent governance

── more on @litellm 3 stories trending now

wpnews · 19 Jun · #artificial-intelligence

From Dream Job to 'The Gulag': Inside Staff Revolt Zuckerberg's Brutal AI Push

wpnews · 19 Jun · #artificial-intelligence

Stop Guessing Which Library to Use — I Built an AI Capability Discovery Engine

wpnews · 19 Jun · #artificial-intelligence

Joanna Stern spent one week with new Siri AI, and it’s very good

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required