Training gets the headlines. Inference gets the bill. If you run LLMs in production, inference is almost certainly your biggest AI line item — a meter running 24/7 on every request. The gap between naive and optimized serving is routinely 5-10x in cost and 3-5x in latency.
During token generation, LLM inference is memory-bandwidth bound. An H100 has ~3.35 TB/s bandwidth but ~989 TFLOPS FP16 compute — during autoregressive decoding you're using only ~10-20% of that compute, waiting on weights and KV-cache to stream from memory. Every optimization attacks the same root cause: move less data, use it better.
Use a real serving framework (vLLM, SGLang, TensorRT-LLM) rather than hand-rolling. Measure your actual prompt/response shapes first — long shared prefixes favour prefix caching, high concurrency favours batching, long outputs favour KV-cache and quantization work. Track cost-per-1k-tokens, throughput, and tail latency — the numbers the business actually feels. Inference optimization is where AI economics are won or lost. The techniques are well understood and together routinely cut serving cost 5-10x — often the deciding factor in whether an AI feature ships at all.
Full version on the VSBD blog.