Prompt Processing vs Generation: Why Your Box Is Fast at One and Slow at the Other

Local LLM inference splits into two phases—prompt processing (compute-bound) and generation (memory-bandwidth-bound)—explaining why hardware with identical token generation speeds can have vastly different time-to-first-token. Memory bandwidth determines generation speed, while raw compute determines prompt processing speed, making each hardware class (Mac, Strix Halo, DGX Spark) optimal for different workloads.

Here's a result that confuses almost everyone comparing local-LLM hardware: two machines can generate tokens at nearly identical speed, yet one takes three times longer to start replying to a long prompt. Same models, same quant — wildly different feel. People conclude the benchmarks are broken. They're not. They're measuring two different things. Running a local LLM happens in two phases , and they have opposite bottlenecks . Once you understand the split, the entire local-hardware market stops being confusing — you'll know which spec actually matters for your workload, and why a Mac, a Strix Halo box, and a DGX Spark each win and lose at different things. The two phases Every request is processed in two distinct stages: Prompt processing a.k.a. "prefill" / "reading" . Before the model writes anything, it has to read your entire input — system prompt, document, chat history — and build the keys and values for it. This is the wait before the first word appears: time to first token TTFT . Generation a.k.a. "decode" / "writing" . Then the model produces the reply one token at a time, each step depending on the last. This is the tokens-per-second you watch stream out. They feel similar from the outside — both are "the model working" — but under the hood they stress completely different parts of your hardware. Why generation is limited by memory bandwidth not compute This is the single most important idea for buying local-LLM hardware, so here's the intuition. To generate one token, the model must pass that token through every layer — which means reading the relevant model weights out of memory. Then to generate the next token, it reads them all over again . Decode is one token at a time, so the weights get streamed from memory on every single step. That makes generation memory-bandwidth-bound : the bottleneck isn't how fast your chip can do math, it's how fast it can move weights out of memory. The foundational analysis of transformer inference Pope et al., "Efficiently Scaling Transformer Inference," https://arxiv.org/abs/2211.05102?ref=vettedconsumer.com 2022 makes this precise; the practical upshot is a back-of-envelope rule: max tokens/sec ≈ memory bandwidth ÷ bytes read per token Plug in a 70B model quantized to 4-bit ~40 GB of weights to stream per token : | Memory bandwidth | Example hardware class | ~Ceiling on a 70B Q4 | |---|---|---| | ~800 GB/s | Mac Studio Ultra / high-end GPU | ~20 tok/s | | ~256 GB/s | Strix Halo / DGX Spark unified memory | ~6 tok/s | | ~1000 GB/s | RTX 4090-class GDDR | ~25 tok/s | Notice what's missing from that table: raw compute TFLOPS . For single-user generation it barely matters — you could double the chip's math throughput and the tokens-per-second would hardly move, because the chip is sitting idle waiting on memory. This is why memory bandwidth is the headline spec for local generation, and why Apple Silicon and unified-memory boxes — which pair big memory with high bandwidth — punch so far above their raw-compute weight. It's also the deep reason Mixture-of-Experts https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/ models generate so fast: only the active parameters get read per token, so "bytes read per token" shrinks dramatically. Why prompt processing is the opposite: compute-bound Prefill flips the equation. Instead of one token at a time, the model processes all your prompt tokens in parallel — a big matrix-times-matrix multiply. That keeps the math units saturated, so prompt processing is compute-bound : now the FLOPS and tensor cores you ignored for generation are exactly what determine your time to first token. This is why a chip with strong tensor compute like the GPU inside a DGX Spark can produce a much faster first token on a long prompt, while a high-bandwidth-but-modest-compute box a Mac, a Strix Halo has to grind through it. And it gets worse with context length: prefill cost grows with prompt size, so the gap widens the longer your input. The distinction is so fundamental that researchers showed prefill and decode interfere when mixed Agrawal et al., SARATHI https://arxiv.org/abs/2308.16369?ref=vettedconsumer.com , 2023 , and datacenters now literally run the two phases on different machines with different hardware Patel et al., Splitwise https://arxiv.org/abs/2311.18677?ref=vettedconsumer.com , 2023 . Two phases, two bottlenecks, two ideal chips. How to read a local-LLM benchmark Once you know there are two phases, benchmark numbers stop being noise. Nearly every serious local-LLM benchmark — the llama-bench output people post on r/LocalLLaMA, for instance — reports two figures: pp prompt processing, often written "pp512" and tg token generation, "tg128" . The pp number is your prefill/compute speed; the tg number is your decode/bandwidth speed. A box can post a huge pp and a modest tg compute-rich, bandwidth-limited or the reverse a Mac: middling pp, healthy tg . So when someone says a machine "does 40 tok/s," always ask which number — a single figure hides exactly the trade-off that decides whether it fits your workload. The honest comparisons report both, at a stated context length, because pp also degrades as the prompt grows. What owners measure This isn't a lab abstraction — it's the lived experience of anyone who owns more than one box. In a detailed "Strix Halo vs DGX Spark" owner write-up https://redlib.catsarch.com/r/LocalLLaMA/comments/1odk11r/strix halo vs dgx spark initial impressions long/?ref=vettedconsumer.com , u/Eugr — who runs both — reports exactly the split the theory predicts: "The token generation is nearly identical to Strix Halo both in llama.cpp and vLLM."— u/Eugr owns both — i.e. decode is bandwidth-bound, and the two boxes have similar bandwidth "Strix Halo performance in prompt processing degrades much faster with context."— u/Eugr — i.e. prefill is compute-bound, and the Spark's stronger compute pulls ahead Generation: a tie bandwidth parity . Prompt processing: a gap that widens with context compute difference . That's the whole framework, confirmed by someone with both machines on the desk — and it matches independent lab testing too, where the Spark's GPU runs several times faster on time-to-first-token while token generation stays close. Which phase dominates your wait? For any single request, the two phases split your total wait — and the ratio depends entirely on the shape of the job. A short prompt with a long reply a quick question, a long essay is decode-dominated: nearly all the time is bandwidth-bound generation, so memory bandwidth sets the experience. A long prompt with a short reply — summarizing a document, answering over a big RAG context, classifying or extracting — is prefill-dominated: most of the wall-clock goes into compute-bound prompt processing before a short answer pops out. Agentic workloads are the punishing case: they pile up long contexts heavy prefill and emit lots of tokens heavy decode , so they expose a weak box on both axes at once. Before you buy, picture your actual prompts: are they mostly reading or mostly writing ? That one question maps you straight onto the spec — compute or bandwidth — that decides how the machine will feel. What this means for buying hardware Stop asking "which box is fastest?" and start asking "fastest at which phase ?" — because your workload decides which one you should pay for: | Your workload | Dominant phase | Buy for… | |---|---|---| | Chat, short prompts, long replies | Generation decode | Memory bandwidth Mac Ultra, high-bandwidth box | | Long documents, big codebases, RAG | Prompt processing prefill | Compute strong tensor cores / GPU | | Agents long context + lots of tokens | Both | Bandwidth and compute — the expensive case | | Many concurrent users | Both, batched | Compute + a serving engine vLLM | This single distinction resolves the most common local-hardware arguments. "Is the Mac Studio good for local AI?" — brilliant for generation, weaker on prompt processing. "Why is my Strix Halo box slow on a 50k-token codebase?" — prefill is compute-bound and that's its softer spot. "Why does a DGX Spark feel snappy to start but not faster overall?" — compute wins prefill, bandwidth ties decode. None of these machines is simply "faster"; they're faster at different halves of the job. The plot twist: batching changes the rules Everything above assumes one user . The moment you serve several requests at once, generation stops being purely bandwidth-bound. When the model reads its weights out of memory to produce a token, it can apply them to every sequence in the batch at the same time — so that one expensive memory read is shared across all of them. Batching amortizes the bandwidth cost and starts using the compute that was sitting idle in single-user decode. Total throughput tokens/sec across all users then climbs with compute, even though any one user's speed doesn't change much. This is why a serving engine with continuous batching like vLLM https://arxiv.org/abs/2309.06180?ref=vettedconsumer.com can saturate a GPU that looked "bandwidth-starved" for a single chat — and why datacenter inference economics look nothing like your desktop's. For local, single-user use, though, the bandwidth rule holds: you are the batch of one, and the chip spends most of its time waiting on memory. How this stacks with quantization and MoE The two phases interact with the other levers in this series. Quantization shrinks "bytes read per token," so a smaller quant directly raises your generation ceiling — roughly, halving the bits can nearly double tok/s on the same bandwidth see our quantization guide https://vettedconsumer.com/gguf-vs-gptq-vs-awq-the-plain-english-guide-to-llm-quantization-and-which-one-to-pick/ . MoE shrinks it a different way: only the active experts are read per token, so a 30B-A3B model generates at roughly 3B-model speed while still needing enough memory to hold all 30B. Neither trick does much for prefill — prompt processing still has to push every input token through the active path — which is exactly why a heavily-quantized MoE can stream replies fast yet still make you wait on a giant prompt. The cheat sheet | Symptom | Cause | Lever | |---|---|---| | Long wait before the first word | Prefill compute-bound | More compute; shorter prompts; prompt caching | | Reply streams out slowly | Decode bandwidth-bound | More memory bandwidth; smaller/quantized model; an MoE | | Fine short, painful on long inputs | Prefill scales with prompt size | Compute, or trim/chunk the context | The one-line version: generation speed is bought with memory bandwidth; prompt-processing speed is bought with compute. Know which half your workload leans on, and you'll never be surprised by a benchmark again. Sources & how we researched this This explainer synthesizes the primary literature on transformer inference efficiency — Pope et al., "Efficiently Scaling Transformer Inference" https://arxiv.org/abs/2211.05102?ref=vettedconsumer.com 2022 for the memory-bandwidth-bound nature of decode; Agrawal et al., SARATHI https://arxiv.org/abs/2308.16369?ref=vettedconsumer.com 2023 for the compute-bound-prefill vs memory-bound-decode distinction; and Patel et al., Splitwise https://arxiv.org/abs/2311.18677?ref=vettedconsumer.com 2023 for production systems that physically separate the two phases. The owner measurements are from r/LocalLLaMA https://redlib.catsarch.com/r/LocalLLaMA/comments/1odk11r/strix halo vs dgx spark initial impressions long/?ref=vettedconsumer.com , linked so you can verify; we have not benchmarked these machines first-hand. The tokens-per-second figures are theoretical bandwidth ceilings bandwidth ÷ model bytes , rounded; real throughput is lower due to overhead. Related guides The KV cache, explained https://vettedconsumer.com/the-kv-cache-explained-why-long-context-eats-your-vram-and-how-to-fit-more/ why long prompts cost memory Mixture-of-Experts, explained https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/ why MoE decodes fast Strix Halo vs DGX Spark, according to owners of both https://vettedconsumer.com/strix-halo-vs-dgx-spark-running-70b-locally-according-to-people-who-own-both/