{"slug": "inference-cost-at-scale-with-napkin-math", "title": "Inference cost at scale with napkin math", "summary": "A technical analysis calculates the dollar cost per user for serving large language models at scale using napkin math, breaking down GPU resources, matrix multiplication costs, and attention mechanisms to estimate tokens per second and user capacity per GPU.", "body_md": "# Inference cost at scale with napkin math\n\nIf you serve AI models as a part of your product stack, you've likely wondered what kind of scale your GPU cluster tops out at.\n\nWith some rudimentary knowledge about your hardware and model architecture,\nwe can work out the dollar cost-per-user on the back of a napkin 1.\n\nIf you're comfortable reasoning about GPUs and/or LLMs, use this legend to skip to sections of relevance:\n\n[Resources on a single GPU](#resources-on-a-single-gpu)[Cost of a Matrix Multiplication](#cost-of-a-matrix-multiplication)[An Overview of Language Models](#an-overview-of-language-models)[Attention in Greater Detail](#attention-in-greater-detail)[Reducing Compute with KV-Cache](#reducing-compute-with-kv-cache)[How much does a token cost?](#how-much-does-a-token-cost)[How many users can you serve realistically?](#how-many-users-can-you-serve-realistically)[Optimizing for hundreds of users on a GPU](#optimizing-for-hundreds-of-users-on-a-gpu)[Tokens Per Second](#tokens-per-second)[Dollar cost per user](#dollar-cost-per-user)\n\n## Resources on a single GPU\n\nOn any GPU's spec-sheet you can find these metrics:\n\n**Peak throughput:** Number of floating-point operations per second. Usually in TeraFLOPs (1 TFLOP/s = \\(10^{12}\\) ops/sec).**Memory bandwidth**: Amount of data that can be moved from global memory (VRAM) to registers (SRAM).Usually in TB/sec.\n\nWe'll assume FP-8 quantization to compute throughput, though it's easy to adjust the math for FP-16 as well.\n\n## Cost of a Matrix Multiplication\n\nIf you bothered to click on this article\nyou know that AI models do *many* matrix multiplications on *massive* matrices.\nThat we start by finding the cost of a matmul should be no surprise then.\n\nAssume two matrices: \\(A_{N \\times d} \\) and \\(B_{d \\times M}\\). Let their product be the matrix \\( O_{N \\times M} \\). From high school algebra, we know that each element of \\(O\\) can be computed as:\n\nIn this, we find our first insight into the \"cost\" of a matrix multiplication. For each \\( O^{i,k}\\), we need to start with an initial value of 0 and:\n\n- Load \\(A^{i,j}\\) from memory.\n- Load \\(B^{j,k}\\) from memory.\n- Multiply them.\n- Add result of #3 to the cumulative sum.\n\nAnd this is done a total of \\(d\\) times *per item.*\nSo, the cost of a `(N,d)*(d,M)`\n\nmatrix product\nis \\( 2NMd \\) memory accesses and \\(2NMd\\) floating-point operations.\n\nWith an optimization called tiling,\nthe memory access goes down to about \\( d(N+M) \\).\nThe details aren't necessary to proceed, but [Alvin's blog post](https://alvinwan.com/how-to-tile-matrix-multiplication/)\nhas them for those curious.\n\n## An Overview of Language Models.\n\nAt their core, LLMs are simple –\nthey receive a sequence of `N`\n\nwords and generate the `N+1`\n\nth.\nEach word is represented as a `d`\n\n-dimensional vector.\nUsing repeated applications of a function called \"attention\" (explained later), they predict the next word.\n\nA single forward pass looks roughly like this:\n\n```\ny = input() # y = a (N x d) matrix\nfor each layer in the network:\n  y = attention(y)\n \n# Convert the final layer's output to word-probs.\n# W_vocab = matrix of size d x vocab_len,\n# and vocab_len is the number of all words\n# in the model's vocabulary.\ntoken_probs = softmax(y * W_vocab)\nnext_tok    = token_probs(argmax(token_probs))\n# next_tok is a (1 x d) vector\n```\n\nThis is also why LLMs are called auto-regressive.\nThey can keep doing multiple forward passes over their own output until a `<stop>`\n\ntoken is generated.\n\nThis is a simplified overview of where I'm skipping RoPE, the MLP layers in between, token sampling at the end, and much more.\n\nAs a fun exercise: you should try to price them in and\nverify that our math still works by a [Fermi estimation](https://en.wikipedia.org/wiki/Fermi_problem).\n\n## Attention in Greater Detail\n\nI'm going to assume that you have some familiarity with attention, and provide only a refresher here.\n\nAs mentioned, the input is a matrix \\(X \\in \\mathbb{R}^{N \\times d}\\), and \\(X_i\\) is a single \\(d\\) dimensional vector. For every \"layer\" in the network, the model stores matrices \\( W_Q,W_K, W_V \\in \\mathbb{R}^{d \\times d} \\), and computes \"attention\" as follows:\n\n\\( Q = X.W_Q \\), \\( K = X.W_K\\) and \\( V = X.W_v\\)\n\n\\( Attention(Q,K,V) = softmax(Q.K^T/\\sqrt{d}).V\\)\n\nOr, in python:\n\n``` python\ndef attention(X, W_q, W_k, W_v):\n    Q,K,V = X @ W_q, X @ W_k, X @ W_v\n    Q_KT = Q @ K.transpose(2,1)\n    return softmax(Q_KT / sqrt(d_model)) @ V\n```\n\nWhere `@`\n\nis the dot-product of two matrices.\n\nIn reality, multiple LLM conversations are processed in parallel. So inference is batched—where we process \\(B\\) chats concurrently. This means our input sequence \\( X \\in \\mathbb{R}_{B \\times N \\times d}\\).\n\nWork the math out on paper to verify it tracks.\n\nIn our Python code, just the transpose arguments change:\n\n```\n- Q_KT = Q @ K.transpose(2, 1)\n+ Q_KT = Q @ K.transpose(0, 2, 1)\n```\n\nOnly, there's one trouble with our implementation of attention: it **reads too much data from memory.**\nLet's look at a single matmul, the \\(K = X.W_k\\).\nCompanies that serve models will allow you to chat with them for up to 200k or so tokens.\nFor a single `K@W_k`\n\nmatmul, it looks like this:\n\n```\nX   = tensor(B, N, d) # \"B\" chats, each with a maximum of \"N\" 'tokens'.\nW_k = tensor(d, d)    # weights have no batch dimension\nO   = tensor(B, N, d) # result of X @ W_k\n```\n\nNotice that the output is another \\(\\mathbb{R}_{B \\times N \\times d}\\) tensor.\n\nAs established in the matmul cost section, to compute each \\(O^b \\in \\mathbb{R}_{N \\times d}\\), we need \\(d(N+d)\\) memory reads and \\(2Nd^2\\) compute operations.\n\nFor a batch size of \\( B \\) (number of concurrent conversations), we get:\n\n**Floating-point operations:**\\(2BNd^{2}\\).** Memory accesses:**\\(Bd(N+d)\\).\n\nAssume N to be roughly 200k, and d to be `8192`\n\n(most common outside frontier labs).\nMeaning that to generate one token for a single user, we need **26 trillion floating-point ops** and **1.7 billion memory accesses.** This is *with* the tiled matmuls. That's **way** more compute ops than memory reads. In fact, we're doing four orders of magnitude more compute than memory accesses. The next batch of input will have to wait tens of thousands of cycles for the GPU to finish with the current batch.\n\nOn diagramming the above matmuls out on paper, you'll notice a key detail— we're wasting far too many resources to re-compute the matmul products for tokens that **were already processed in a previous iteration**.\n\nRecall that LLMs are auto-regressive. They:\n\n- Take a list of tokens \\(X\\), do a bunch of matmuls.\n- Repeatedly do\n`attention(X, weights)`\n\nat L (for L layers), and generate a new token \\(x\\) - append \\(x\\) to \\(X\\) (the chat thus far).\n- Put the output of 3 back into step 1, until a \"STOP\" token is generated.\n\nTo avoid re-processing the entire chat history *again* for every new word,\ninference engines will cache the \\(K,V\\) pairs for reuse.\n\n## Reducing Compute with KV-Cache\n\nThe intermediate output on every chat, namely \\(K\\) and \\(V\\),\nis cached at every layer, and stored in a region of VRAM called the\n**KV Cache.**\nInference engines like vLLM allow programmers to decide what\npercentage of VRAM should be pre-allocated for this.\n\nOf course, it's not as easy as I made it sound. There's a lot of cleverness applied to make optimal use of the memory vLLM is handed, the details for which you can find in [this presentation](https://youtu.be/5ZlavKF_98U) by the original authors.\n\nFor our napkin math,\nthe existence of KV-cache allows one simplification:\n**for every forward pass, we get to process only the most recently generated word, rather than the entire history**.\ni.e., instead of processing a \\(X \\in R_{N \\times d}\\),\nwe get \\(X \\in R_{1 \\times d}\\) (the most recent token).\n\nThe math for `X @ W_k`\n\nnow becomes:\n\n```\nX   = tensor(B, 1, d)\nW_k = tensor(d, d)\nO   = tensor(B, 1, d)\n```\n\nFor a batch size of \\( B \\) (number of concurrent conversations), we get:\n\n- ~26.2 million memory accesses\n- ~52.4 million ops\n\nMeaning that for every memory access made, we need only perform *two* operations rather than 10 thousand.\nFor the entire batch, we're doing **2*B operations per memory access.**\nThis is fantastic! Now, let's pull out the spec-sheet for the fastest GPU available and figure out how many tokens we can generate per second (and for how many users).\n\n## How much does a token cost?\n\nLet's take the NVIDIA B200 as our leading example for the remainder of this. From a web search, you'll find that it has the following specs:\n\n- Memory bandwidth: 8 TB/s (Or \\(8*10^{12} \\) bytes accessed per second).\n- Compute intensity: 4500 TFLOP/s (Or \\(4500 * 10^{12}\\) bytes crunched per second).\n[2](#fn-2)\n\nSee that?\nA Blackwell class GPU can crunch bytes **562 times faster** than it can load them.\nPut differently, to get the most out of such a chip, we should be doing **562 computations for every byte loaded.**\nAny more, and we have memory bandwidth sitting idle (e.g: without a KV-cache).\nAny less, and we have compute cores sitting idle.\n\nCurrently, we're doing **2*B** compute-ops per byte read.\nSo, how many users should we serve to fully exhaust a B200's compute and bandwidth budget?\n\n\\( 2B = 562 \\implies B = 331 \\)\n\nWith a single NVIDIA B200 GPU, we should be serving **331 users concurrently** to get the most out of our investment.\nOf course, this is a theoretical ceiling.\nIn reality, VRAM is limited. We'll have to squeeze the model weights in there along with the huge KV-cache.\n\n## How many users can you serve realistically?\n\nWe'll assume a 32B dense model, as they've have gotten quite good for production use and a B200 can comfortably serve them. This could be a Gemma, Qwen, DeepSeek, whatever.\n\nNote that we're assuming a pure transformer architecture,\neven though a lot of open-weight models use \"tricks\" to reduce KV-cache pressure on long contexts\n(see: [Gated Delta-Nets](https://sebastianraschka.com/llms-from-scratch/ch04/08_deltanet/), and the [Gemma 3 technical report](https://arxiv.org/pdf/2503.19786)).\nYou can chat with your favorite LLM to figure out how this affects your inference math.\n\nBack to our problem: we have a 32B model.\nThis is 32GB (`32*10^9`\n\nbytes) in VRAM.\nLet's assume a context window of \\(N\\)=200k tokens.\nThe input is \\(N \\times d\\)–dimensional at every layer.\n\nFor each layer, we need to store \\(2Nd\\) bytes for a pair of K and V matrices.\nA model of our size will typically have `d=8192`\n\nand `L=64`\n\n. Giving us:\n\n```\nKV cache size = 2 *    N    *  L *  d\n              = 2 * 200_000 * 64 * 8196\n              = 210 GB (!!)\n```\n\nThat's more VRAM than our GPU has!\n\nHere, I'll invoke another optimization that models of this size use: [Grouped-Query-Attention](https://arxiv.org/pdf/2305.13245). If attention was new to you, you may save this for future reading and rely on my claim that it cuts down the KV cache size by about 8x.\n\nBut if you're familiar with Multi-Head-Attention then GQA is simple: It shares the same KV-head across multiple Query heads. So for 64 query heads, we'll use a total of only 8 KV-heads; i.e: Q-heads 0-7 share the first KV-head, Q-heads 8-15 the next one, and so on.\n\nWith GQA our KV-cache is now at ~**26GB** *per chat sequence (or per user)*.\n\nWe're already using **32GB** for weights,\nso how many concurrent chat contexts can we store in the KV-cache in the remaining 160GB?\nThat's 160/26 = 6.\n\nSo about six chat's going parallely. That seems… low.\n\n## Optimizing for hundreds of users on a GPU.\n\nMost contexts will never reach the 200k length. Depending on your product, the median LLM-conversation can be anywhere between 4-40k tokens.\n\nTo account for variable-length conversations,\nwe can split the KV-cache into chunks,\nand incrementally allocate those chunks to different users as their token use grows.\nConversation threads that are abandoned/cold can be flushed out of the cache.\nThis is what vLLM does with [PagedAttention](https://hamzaelshafie.bearblog.dev/paged-attention-from-first-principles-a-view-inside-vllm/).\nDepending on the median user activity,\nyou can serve anywhere between **40-60 users per Blackwell chip**.\n\nRemember that the nature of your product matters too. In most ChatGPT-style apps the user spends more time reading than prompting. For a median chat session, a user will likely have 80% idle time. Here, the GPU has a duty cycle of 20% (!).\n\nRealistically, one chip can serve ~300-800 users comfortably depending on the style your app. For non-chat apps, measuring duty-cycles is not optional.\n\n## Tokens Per Second\n\nEarlier, we saw that we can comfortably support 6 users at 100% duty cycle. But would their experience be snappy?\n\nAgain, this is a direct consequence of our memory-to-compute ratio.\nFor a single forward pass we move all the model weights + KV-cache from\nVRAM to registers *once*.\nThen, we do 2*B operations for every byte loaded.\n\nSo the total time spent is:\n\n```\ntime spent moving data\n    = memory in GB  / bandwidth in GBps \n    = 190GB / (8*10**3) GBps \n    = 0.02375 seconds\n    = 23.75 ms \n\ntime spent computing \n    = 190 * 2 * 6 / 4500 TFLOPs\n    = 0.5ms\n```\n\nSince both happen in parallel, the compute cores are idle 98% of the time.\n\nEvery 24ms, we generate B=6 tokens. For 1s (=1000ms), we generate roughly 250 tokens for 6 users, or about 40 tokens per user per second.\n\nAssuming the LLM output is meant for reading (unlike, say, building SQL queries in the background), 40 tokens per second is beyond most people's reading speed.\n\n## Dollar cost per user\n\nThis largely depends on whether you own or rent your hardware.\nAt $40,000 per B200, your lifetime cost per user is `40_000/num_users`\n\n.\n\nIn the 100% duty cycle case (worst for cost), that's 6k$ per user. Realistically, serving 300 users per GPU you'll spend a lifetime cost of about $133 per user, plus the datacenter/upkeep bill.\n\nIf you rent the GPU, the cost is more straightforward.\nAt an hourly rate of $4 3, your hourly cost per user is\n\n`4/num_users`\n\n.\nFor `num_users=300`\n\nyou get an hourly rate of about $0.013 per user,\nor `$9.36`\n\nper month.## Ballpark accuracy of our estimate\n\nFor a 32B model on a B200, this is a rather conservative estimate.I've left some headroom for workflows with high duty cycles, like an agent loops over tool-calls and runs queries.\n\nAs an AI company, you'll have more than one GPU (I pray). For model-sizes that span multiple GPUs, our math is still directionally valid, but the use of napkins is ill-advised.\n\n## Backmatter\n\nOr two napkins, realistically. This is how I worked it out over coffee with a friend, which is how this post came to be.\n\nThis is the *dense* compute throughput of the B200. For sparse matrices, it goes upto 9000 TFLOPs.\n\nAt the time of writing, you'll only find this price on bulk rentals with time commitments.", "url": "https://wpnews.pro/news/inference-cost-at-scale-with-napkin-math", "canonical_source": "https://injuly.in/blog/napkin-inference-cost/index.html", "published_at": "2026-06-16 18:57:29+00:00", "updated_at": "2026-06-20 21:36:10.827233+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-chips", "ai-products"], "entities": ["GPU", "LLM", "FP-8", "FP-16", "VRAM", "SRAM", "KV-Cache", "RoPE"], "alternates": {"html": "https://wpnews.pro/news/inference-cost-at-scale-with-napkin-math", "markdown": "https://wpnews.pro/news/inference-cost-at-scale-with-napkin-math.md", "text": "https://wpnews.pro/news/inference-cost-at-scale-with-napkin-math.txt", "jsonld": "https://wpnews.pro/news/inference-cost-at-scale-with-napkin-math.jsonld"}}