How to Put an LLM in Your Product Without Wrecking Your Costs or Your Latency A developer explains how to manage cost and latency when integrating large language models into products. Key strategies include constraining output tokens, caching requests, routing to cheaper models, and streaming responses to improve perceived performance. The post emphasizes that output length is the primary driver of both cost and latency. Adding an AI feature looks deceptively easy. You sign up for an API key, paste in a prompt, and within an hour you've got a working demo that makes the whole team lean over your shoulder. Then you ship it, traffic arrives, and two things happen at once: your latency graph develops a long, ugly tail, and your monthly bill arrives with a number that makes finance schedule a meeting. The gap between "impressive demo" and "production feature" is almost entirely about cost and latency engineering. The model is the easy part. Here's how to cross that gap. Most LLM APIs bill by tokens — roughly ¾ of a word each — and they bill both directions: the tokens you send input and the tokens the model generates output . Output tokens are usually several times more expensive than input tokens, which has a non-obvious consequence: a verbose prompt is cheaper than a verbose answer. This reframes optimization. People obsess over trimming their prompts while letting the model ramble for 800 tokens when 80 would do. If you want to cut cost, the highest-leverage move is almost always constraining the output : ask for JSON, ask for a single sentence, set a max tokens ceiling, and tell the model explicitly to be terse. Latency follows the same logic. Generation is sequential — the model produces one token at a time — so output length is the single biggest driver of how long a request takes. A 50-token answer is fast almost regardless of model. A 2,000-token answer is slow even on the fastest infrastructure. The cheapest, fastest LLM call is the one you never make. Two techniques eliminate a startling share of traffic. Caching identical and near-identical requests. Many real-world prompts repeat — the same FAQ-style question, the same document summarized twice, the same classification of similar inputs. A cache keyed on the normalized prompt turns a repeat request into a sub-millisecond lookup. For exact repeats, a simple key-value cache works. For similar requests, a semantic cache — where you embed the query and return a cached answer if a previous query is close enough in vector space — can absorb far more traffic, at the cost of some tuning. Routing to the right tier. You do not need your most capable model for every task. Classifying a support ticket into one of five buckets is a job for a small, cheap, fast model. Drafting a nuanced customer email is worth the premium one. A simple router — even a keyword or length heuristic before anything fancy — that sends easy work to a cheap model and hard work to an expensive one can cut spend dramatically without anyone noticing a quality drop. Sometimes you genuinely need a long, high-quality response, and it's genuinely going to take a few seconds. You can't always make it faster — but you can make it feel fast, which is often what actually matters to the user. Stream the response. Instead of waiting for the full answer and dumping it at once, stream tokens as they're generated. The user starts reading after a few hundred milliseconds, and the perceived wait collapses even though total generation time is unchanged. This is the single highest-impact UX change for any chat-style feature, and most SDKs support it with a one-line change. Show honest progress for non-streamed work. If you're doing something multi-step — retrieve, then reason, then format — tell the user what's happening "Searching your documents…", "Drafting an answer…" . A visible, truthful status beats a spinner that gives no information about whether anything is working. Your average latency is a comforting lie. LLM endpoints have heavy tails: most requests are fine, but a meaningful slice take 3–5× longer, and a few time out entirely. If your product blocks on those, a small fraction of slow requests can dominate the experience. Defend against the tail explicitly: These aren't AI-specific patterns — they're the same resilience engineering you'd apply to any external dependency. The mistake is treating the LLM as magic instead of as what it is: a slow, occasionally flaky network call to someone else's servers. You can't optimize what you don't track. From day one, log three numbers per request: input tokens, output tokens, and end-to-end latency. Tag them by feature and by model. Within a week you'll have a cost-and-latency breakdown by feature, and it will almost certainly surprise you — there's usually one endpoint quietly responsible for most of the bill, and it's rarely the one you'd guess. A useful derived metric is cost per successful user outcome , not cost per API call. A feature that calls the model twice but actually solves the user's problem is cheaper, in every way that matters, than one that calls it once and gets ignored. The teams that ship AI features sustainably stop thinking of the model as the product and start thinking of it as an expensive, high-variance dependency they're responsible for managing. The prompt gets you the demo. Caching, routing, streaming, and tail control get you a feature you can afford to keep running. None of it is exotic. It's the same discipline that makes any external service production-ready — applied to a service that happens to charge by the word and answer at the speed of thought, one token at a time.