Cost Optimization for LLM Systems: Where the Money Actually Goes LLM costs scale linearly with usage, and enterprises spending over $10,000 annually can optimize by implementing token budgets, choosing between API and local inference, and using fallback strategies. Token budgeting methods include per-session, per-task, and adaptive budgets, while local inference becomes cost-effective at moderate to high usage, with hardware break-even ranging from months to years. Cost Optimization for LLM Systems: Where the Money Actually Goes Spend tokens where they actually matter. LLM costs scale linearly with usage. A system processing 10,000 requests a day at $0.01 per request costs $100 daily — $365 a year. At enterprise scale, that’s over $10,000. Cost optimization isn’t about cutting corners. It’s about spending tokens where they matter. Every token you waste is a token you could have spent on a better answer. Token budgeting The simplest way to control costs is to set limits. Per session, per task, or per day. Strategy 1: Per-Session Budgets Per-session budgets are straightforward: python class SessionBudget: def init self, budget tokens: int = 10000 : self.budget = budget tokens self.used = 0 def allocate self, tokens: int - bool: if self.used + tokens <= self.budget: self.used += tokens return True return False def remaining self - int: return self.budget - self.used Strategy 2: Per-Task Budgets Per-task budgets are more useful. Different tasks need different amounts of context: task budgets: classify: max tokens: 100 model: qwen2.5-1.5b summarize: max tokens: 500 model: qwen2.5-7b code review: max tokens: 2000 model: qwen2.5-coder-7b reason: max tokens: 4000 model: qwen2.5-32b Strategy 3: Adaptive Budgets Adaptive budgets adjust based on what actually happens. If classification tasks consistently use 80 tokens, stop allocating 100: python class AdaptiveBudget: def init self : self.task history = {} def allocate self, task type: str - int: if task type in self.task history: return int self.task history task type 1.5 return 1000 def record self, task type: str, tokens used: int : if task type not in self.task history: self.task history task type = tokens used else: self.task history task type = 0.9 self.task history task type + 0.1 tokens used The exponential moving average 0.9 weight means recent usage matters more than history. Adjust the weight based on how volatile your workloads are. API vs local inference Local inference is cheaper at scale. The break-even depends on your hardware and API rates. | Model | API $/M tokens | Local cost/hour | Break-even | |---|---|---|---| | GPT-4o | $2.50 / $10.00 | — | N/A | | Claude Sonnet 4 | $3.00 / $15.00 | — | N/A | | Qwen2.5-72B | $0.50 / $2.00 | ~$0.50 | ~4 hours/day | | Qwen2.5-32B | $0.30 / $1.20 | ~$0.20 | ~2 hours/day | | Qwen2.5-7B | $0.10 / $0.40 | ~$0.05 | ~1 hour/day | The hardware math: | Hardware | Upfront | Monthly electricity | Break-even vs API | |---|---|---|---| | RTX 3090 used | $600 | $15 | ~4 months | | RTX 4090 | $1,500 | $20 | ~6 months | | RTX 5080 | $1,000 | $18 | ~5 months | | DGX Spark | $2,000 | $30 | ~8 months | At moderate usage — an hour or more per day — local inference pays for itself. At high usage, the savings are dramatic. The catch is upfront capital. A RTX 5080 is $1,000. An API bill you can pause. Hardware you can’t. Fallback strategies When your preferred model is too expensive or too slow, fall back to something cheaper. The key is knowing when quality is “good enough.” Strategy 1: Quality-Based Fallback Quality-based fallback tries models until the output meets a threshold: python class QualityFallback: def init self, quality threshold: float = 0.8 : self.threshold = quality threshold self.models = {"model": "claude-sonnet-4", "cost": 0.015}, {"model": "qwen2.5-72b", "cost": 0.002}, {"model": "qwen2.5-32b", "cost": 0.001}, {"model": "qwen2.5-7b", "cost": 0.0004}, def route self, prompt: str - str: for model config in self.models: result = self.call model model config "model" , prompt if self.evaluate quality result = self.threshold: return result return self.call model self.models 0 "model" , prompt The problem is evaluation itself. How do you measure quality without calling another model? Some systems use a small classifier. Others use heuristic checks — length, structure, keyword presence. None of these are perfect. Strategy 2: Latency-Based Fallback Latency-based fallback is simpler. Route to the fastest model that meets your time budget: python class LatencyFallback: def init self, max latency: float = 5.0 : self.max latency = max latency self.models = {"model": "qwen2.5-1.5b", "latency": 0.5}, {"model": "qwen2.5-7b", "latency": 2.0}, {"model": "qwen2.5-32b", "latency": 10.0}, {"model": "claude-sonnet-4", "latency": 5.0}, def route self, prompt: str - str: for model config in sorted self.models, key=lambda x: x "latency" : if model config "latency" <= self.max latency: return self.call model model config "model" , prompt return self.call model self.models 0 "model" , prompt Caching Caching is the most underrated cost optimization. Identical prompts happen more often than you think — classification requests, FAQ-style queries, repeated tool calls. Strategy 1: Prompt Caching Exact prompt caching is simple: python import hashlib class PromptCache: def init self, max size: int = 1000 : self.cache = {} self.max size = max size def get self, prompt: str - str | None: key = hashlib.sha256 prompt.encode .hexdigest return self.cache.get key def set self, prompt: str, response: str : key = hashlib.sha256 prompt.encode .hexdigest if len self.cache = self.max size: self.cache.pop next iter self.cache self.cache key = response Strategy 2: Semantic Caching Semantic caching is more useful. It catches prompts that are different but mean the same thing: python from sentence transformers import SentenceTransformer class SemanticCache: def init self, similarity threshold: float = 0.95 : self.model = SentenceTransformer 'all-MiniLM-L6-v2' self.cache = {} self.threshold = similarity threshold def get self, prompt: str - str | None: prompt embedding = self.model.encode prompt 0 for cached prompt, cached response in self.cache.items : cached embedding = self.model.encode cached prompt 0 similarity = self.cosine similarity prompt embedding, cached embedding if similarity = self.threshold: return cached response return None def set self, prompt: str, response: str : self.cache prompt = response The threshold matters. 0.95 is aggressive — only very similar prompts match. 0.85 is more forgiving but risks returning wrong answers. Measure your miss rate and adjust. Response caching for common queries is worth it too. If users ask “what’s the weather” or “what time is it” repeatedly, cache the pattern, not just the exact prompt: python class ResponseCache: def init self : self.common queries = { "what is the weather": "Check weather API", "what is the time": "Check system time", "who is the president": "Check current president", } def get self, query: str - str | None: query lower = query.lower for common query, response in self.common queries.items : if common query in query lower: return response return None This isn’t sophisticated, but it works. Common queries are common for a reason. When optimization helps Optimization matters when you’re processing high volumes, running mixed workloads, or paying API costs that add up. It doesn’t matter when you’re prototyping, using a single model, or processing low volumes. The complexity of budgeting, fallback, and caching isn’t worth it for a system that makes 100 requests a day. Get the basic flow working first. Add optimization when the bill comes in. Tradeoffs | Strategy | Cost | Quality | Complexity | |---|---|---|---| | No optimization | Highest | Consistent | Lowest | | Token budgeting | Moderate | Variable | Medium | | Fallback models | Low-Medium | Variable | Medium | | Caching | Lowest | High for cache hits | Medium | | Hybrid | Optimized | Optimized | Highest | Production systems usually run hybrid. Budget per session, fall back on quality or latency, cache what you can. The complexity is real, but so are the savings. Related Model Routing Strategies https://www.glukhov.org/llm-architecture/model-routing/model-routing-strategies/ — capability-based, cost-aware, latency-aware routing LLM Guardrails in Practice https://www.glukhov.org/llm-architecture/guardrails/llm-guardrails-in-practice/ — input validation, output filtering, safety Multi-Model System Design https://www.glukhov.org/llm-architecture/model-routing/multi-model-system-design/ — architecture for multiple models LLM Architecture https://www.glukhov.org/llm-architecture/ — system design pillar: routing, cost, guardrails, and orchestration