{"slug": "prompt-processing-vs-generation-why-your-box-is-fast-at-one-and-slow-at-the", "title": "Prompt Processing vs Generation: Why Your Box Is Fast at One and Slow at the Other", "summary": "Local LLM inference splits into two phases—prompt processing (compute-bound) and generation (memory-bandwidth-bound)—explaining why hardware with identical token generation speeds can have vastly different time-to-first-token. Memory bandwidth determines generation speed, while raw compute determines prompt processing speed, making each hardware class (Mac, Strix Halo, DGX Spark) optimal for different workloads.", "body_md": "Here's a result that confuses almost everyone comparing local-LLM hardware: two machines can generate tokens at nearly identical speed, yet one takes *three times longer* to start replying to a long prompt. Same models, same quant — wildly different feel. People conclude the benchmarks are broken. They're not. They're measuring two different things.\n\nRunning a local LLM happens in **two phases**, and they have **opposite bottlenecks**. Once you understand the split, the entire local-hardware market stops being confusing — you'll know which spec actually matters for *your* workload, and why a Mac, a Strix Halo box, and a DGX Spark each win and lose at different things.\n\n## The two phases\n\nEvery request is processed in two distinct stages:\n\n**Prompt processing (a.k.a. \"prefill\" / \"reading\").** Before the model writes anything, it has to read your entire input — system prompt, document, chat history — and build the keys and values for it. This is the wait before the first word appears:**time to first token (TTFT)**.** Generation (a.k.a. \"decode\" / \"writing\").**Then the model produces the reply one token at a time, each step depending on the last. This is the** tokens-per-second**you watch stream out.\n\nThey feel similar from the outside — both are \"the model working\" — but under the hood they stress completely different parts of your hardware.\n\n## Why generation is limited by memory bandwidth (not compute)\n\nThis is the single most important idea for buying local-LLM hardware, so here's the intuition. To generate *one* token, the model must pass that token through every layer — which means reading the relevant model weights out of memory. Then to generate the *next* token, it reads them **all over again**. Decode is one token at a time, so the weights get streamed from memory on every single step.\n\nThat makes generation **memory-bandwidth-bound**: the bottleneck isn't how fast your chip can do math, it's how fast it can move weights out of memory. The foundational analysis of transformer inference (Pope et al., [\"Efficiently Scaling Transformer Inference,\"](https://arxiv.org/abs/2211.05102?ref=vettedconsumer.com) 2022) makes this precise; the practical upshot is a back-of-envelope rule:\n\n**max tokens/sec ≈ memory bandwidth ÷ bytes read per token**\n\nPlug in a 70B model quantized to 4-bit (~40 GB of weights to stream per token):\n\n| Memory bandwidth | Example hardware class | ~Ceiling on a 70B Q4 |\n|---|---|---|\n| ~800 GB/s | Mac Studio Ultra / high-end GPU | ~20 tok/s |\n| ~256 GB/s | Strix Halo / DGX Spark unified memory | ~6 tok/s |\n| ~1000 GB/s | RTX 4090-class GDDR | ~25 tok/s |\n\nNotice what's *missing* from that table: raw compute (TFLOPS). For single-user generation it barely matters — you could double the chip's math throughput and the tokens-per-second would hardly move, because the chip is sitting idle waiting on memory. This is why **memory bandwidth is the headline spec** for local generation, and why Apple Silicon and unified-memory boxes — which pair big memory with high bandwidth — punch so far above their raw-compute weight.\n\n(It's also the deep reason [Mixture-of-Experts](https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/) models generate so fast: only the *active* parameters get read per token, so \"bytes read per token\" shrinks dramatically.)\n\n## Why prompt processing is the opposite: compute-bound\n\nPrefill flips the equation. Instead of one token at a time, the model processes *all* your prompt tokens **in parallel** — a big matrix-times-matrix multiply. That keeps the math units saturated, so prompt processing is **compute-bound**: now the FLOPS and tensor cores you ignored for generation are exactly what determine your time to first token.\n\nThis is why a chip with strong tensor compute (like the GPU inside a DGX Spark) can produce a much faster first token on a long prompt, while a high-bandwidth-but-modest-compute box (a Mac, a Strix Halo) has to grind through it. And it gets worse with context length: prefill cost grows with prompt size, so the gap widens the longer your input. The distinction is so fundamental that researchers showed prefill and decode *interfere* when mixed (Agrawal et al., [SARATHI](https://arxiv.org/abs/2308.16369?ref=vettedconsumer.com), 2023), and datacenters now literally run the two phases on **different machines** with different hardware (Patel et al., [Splitwise](https://arxiv.org/abs/2311.18677?ref=vettedconsumer.com), 2023). Two phases, two bottlenecks, two ideal chips.\n\n## How to read a local-LLM benchmark\n\nOnce you know there are two phases, benchmark numbers stop being noise. Nearly every serious local-LLM benchmark — the `llama-bench`\n\noutput people post on r/LocalLLaMA, for instance — reports **two** figures: **pp** (prompt processing, often written \"pp512\") and **tg** (token generation, \"tg128\"). The pp number is your prefill/compute speed; the tg number is your decode/bandwidth speed. A box can post a huge pp and a modest tg (compute-rich, bandwidth-limited) or the reverse (a Mac: middling pp, healthy tg). So when someone says a machine \"does 40 tok/s,\" always ask *which number* — a single figure hides exactly the trade-off that decides whether it fits your workload. The honest comparisons report both, at a stated context length, because pp also degrades as the prompt grows.\n\n## What owners measure\n\nThis isn't a lab abstraction — it's the lived experience of anyone who owns more than one box. In a detailed [\"Strix Halo vs DGX Spark\" owner write-up](https://redlib.catsarch.com/r/LocalLLaMA/comments/1odk11r/strix_halo_vs_dgx_spark_initial_impressions_long/?ref=vettedconsumer.com), u/Eugr — who runs both — reports exactly the split the theory predicts:\n\n\"The token generation is nearly identical to Strix Halo both in llama.cpp and vLLM.\"— u/Eugr (owns both) — i.e. decode is bandwidth-bound, and the two boxes have similar bandwidth\n\n\"Strix Halo performance in prompt processing degrades much faster with context.\"— u/Eugr — i.e. prefill is compute-bound, and the Spark's stronger compute pulls ahead\n\nGeneration: a tie (bandwidth parity). Prompt processing: a gap that widens with context (compute difference). That's the whole framework, confirmed by someone with both machines on the desk — and it matches independent lab testing too, where the Spark's GPU runs several times faster on time-to-first-token while token generation stays close.\n\n## Which phase dominates your wait?\n\nFor any single request, the two phases split your total wait — and the ratio depends entirely on the shape of the job. A **short prompt with a long reply** (a quick question, a long essay) is decode-dominated: nearly all the time is bandwidth-bound generation, so memory bandwidth sets the experience. A **long prompt with a short reply** — summarizing a document, answering over a big RAG context, classifying or extracting — is prefill-dominated: most of the wall-clock goes into compute-bound prompt processing before a short answer pops out. Agentic workloads are the punishing case: they pile up long contexts (heavy prefill) *and* emit lots of tokens (heavy decode), so they expose a weak box on both axes at once. Before you buy, picture your actual prompts: are they mostly *reading* or mostly *writing*? That one question maps you straight onto the spec — compute or bandwidth — that decides how the machine will feel.\n\n## What this means for buying hardware\n\nStop asking \"which box is fastest?\" and start asking \"fastest at *which phase*?\" — because your workload decides which one you should pay for:\n\n| Your workload | Dominant phase | Buy for… |\n|---|---|---|\n| Chat, short prompts, long replies | Generation (decode) | Memory bandwidth (Mac Ultra, high-bandwidth box) |\n| Long documents, big codebases, RAG | Prompt processing (prefill) | Compute (strong tensor cores / GPU) |\n| Agents (long context + lots of tokens) | Both | Bandwidth and compute — the expensive case |\n| Many concurrent users | Both, batched | Compute + a serving engine (vLLM) |\n\nThis single distinction resolves the most common local-hardware arguments. \"Is the Mac Studio good for local AI?\" — brilliant for generation, weaker on prompt processing. \"Why is my Strix Halo box slow on a 50k-token codebase?\" — prefill is compute-bound and that's its softer spot. \"Why does a DGX Spark feel snappy to start but not faster overall?\" — compute wins prefill, bandwidth ties decode. None of these machines is simply \"faster\"; they're faster at different halves of the job.\n\n## The plot twist: batching changes the rules\n\nEverything above assumes **one user**. The moment you serve several requests at once, generation stops being purely bandwidth-bound. When the model reads its weights out of memory to produce a token, it can apply them to *every* sequence in the batch at the same time — so that one expensive memory read is shared across all of them. Batching amortizes the bandwidth cost and starts using the compute that was sitting idle in single-user decode. Total throughput (tokens/sec across all users) then climbs with compute, even though any one user's speed doesn't change much.\n\nThis is why a serving engine with continuous batching like [vLLM](https://arxiv.org/abs/2309.06180?ref=vettedconsumer.com) can saturate a GPU that looked \"bandwidth-starved\" for a single chat — and why datacenter inference economics look nothing like your desktop's. For local, single-user use, though, the bandwidth rule holds: you are the batch of one, and the chip spends most of its time waiting on memory.\n\n## How this stacks with quantization and MoE\n\nThe two phases interact with the other levers in this series. **Quantization** shrinks \"bytes read per token,\" so a smaller quant directly raises your generation ceiling — roughly, halving the bits can nearly double tok/s on the same bandwidth (see our [quantization guide](https://vettedconsumer.com/gguf-vs-gptq-vs-awq-the-plain-english-guide-to-llm-quantization-and-which-one-to-pick/)). **MoE** shrinks it a different way: only the active experts are read per token, so a 30B-A3B model generates at roughly 3B-model speed while still needing enough memory to *hold* all 30B. Neither trick does much for prefill — prompt processing still has to push every input token through the active path — which is exactly why a heavily-quantized MoE can stream replies fast yet still make you wait on a giant prompt.\n\n## The cheat sheet\n\n| Symptom | Cause | Lever |\n|---|---|---|\n| Long wait before the first word | Prefill (compute-bound) | More compute; shorter prompts; prompt caching |\n| Reply streams out slowly | Decode (bandwidth-bound) | More memory bandwidth; smaller/quantized model; an MoE |\n| Fine short, painful on long inputs | Prefill scales with prompt size | Compute, or trim/chunk the context |\n\nThe one-line version: **generation speed is bought with memory bandwidth; prompt-processing speed is bought with compute.** Know which half your workload leans on, and you'll never be surprised by a benchmark again.\n\n## Sources & how we researched this\n\nThis explainer synthesizes the primary literature on transformer inference efficiency — Pope et al., [\"Efficiently Scaling Transformer Inference\"](https://arxiv.org/abs/2211.05102?ref=vettedconsumer.com) (2022) for the memory-bandwidth-bound nature of decode; Agrawal et al., [SARATHI](https://arxiv.org/abs/2308.16369?ref=vettedconsumer.com) (2023) for the compute-bound-prefill vs memory-bound-decode distinction; and Patel et al., [Splitwise](https://arxiv.org/abs/2311.18677?ref=vettedconsumer.com) (2023) for production systems that physically separate the two phases. The owner measurements are from [r/LocalLLaMA](https://redlib.catsarch.com/r/LocalLLaMA/comments/1odk11r/strix_halo_vs_dgx_spark_initial_impressions_long/?ref=vettedconsumer.com), linked so you can verify; we have not benchmarked these machines first-hand. The tokens-per-second figures are theoretical bandwidth ceilings (bandwidth ÷ model bytes), rounded; real throughput is lower due to overhead.\n\n## Related guides\n\n[The KV cache, explained](https://vettedconsumer.com/the-kv-cache-explained-why-long-context-eats-your-vram-and-how-to-fit-more/)(why long prompts cost memory)[Mixture-of-Experts, explained](https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/)(why MoE decodes fast)[Strix Halo vs DGX Spark, according to owners of both](https://vettedconsumer.com/strix-halo-vs-dgx-spark-running-70b-locally-according-to-people-who-own-both/)", "url": "https://wpnews.pro/news/prompt-processing-vs-generation-why-your-box-is-fast-at-one-and-slow-at-the", "canonical_source": "https://vettedconsumer.com/prompt-processing-vs-generation-why-your-box-is-fast-at-one-and-slow-at-the-other/", "published_at": "2026-06-17 13:00:00+00:00", "updated_at": "2026-06-17 13:30:28.277696+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-chips", "ai-research"], "entities": ["Apple Silicon", "Strix Halo", "DGX Spark", "RTX 4090", "Mac Studio Ultra"], "alternates": {"html": "https://wpnews.pro/news/prompt-processing-vs-generation-why-your-box-is-fast-at-one-and-slow-at-the", "markdown": "https://wpnews.pro/news/prompt-processing-vs-generation-why-your-box-is-fast-at-one-and-slow-at-the.md", "text": "https://wpnews.pro/news/prompt-processing-vs-generation-why-your-box-is-fast-at-one-and-slow-at-the.txt", "jsonld": "https://wpnews.pro/news/prompt-processing-vs-generation-why-your-box-is-fast-at-one-and-slow-at-the.jsonld"}}