# Prompt Processing vs Generation: Why Your Box Is Fast at One and Slow at the Other

> Source: <https://vettedconsumer.com/prompt-processing-vs-generation-why-your-box-is-fast-at-one-and-slow-at-the-other/>
> Published: 2026-06-17 13:00:00+00:00

Here's a result that confuses almost everyone comparing local-LLM hardware: two machines can generate tokens at nearly identical speed, yet one takes *three times longer* to start replying to a long prompt. Same models, same quant — wildly different feel. People conclude the benchmarks are broken. They're not. They're measuring two different things.

Running a local LLM happens in **two phases**, and they have **opposite bottlenecks**. Once you understand the split, the entire local-hardware market stops being confusing — you'll know which spec actually matters for *your* workload, and why a Mac, a Strix Halo box, and a DGX Spark each win and lose at different things.

## The two phases

Every request is processed in two distinct stages:

**Prompt processing (a.k.a. "prefill" / "reading").** Before the model writes anything, it has to read your entire input — system prompt, document, chat history — and build the keys and values for it. This is the wait before the first word appears:**time to first token (TTFT)**.** Generation (a.k.a. "decode" / "writing").**Then the model produces the reply one token at a time, each step depending on the last. This is the** tokens-per-second**you watch stream out.

They feel similar from the outside — both are "the model working" — but under the hood they stress completely different parts of your hardware.

## Why generation is limited by memory bandwidth (not compute)

This is the single most important idea for buying local-LLM hardware, so here's the intuition. To generate *one* token, the model must pass that token through every layer — which means reading the relevant model weights out of memory. Then to generate the *next* token, it reads them **all over again**. Decode is one token at a time, so the weights get streamed from memory on every single step.

That makes generation **memory-bandwidth-bound**: the bottleneck isn't how fast your chip can do math, it's how fast it can move weights out of memory. The foundational analysis of transformer inference (Pope et al., ["Efficiently Scaling Transformer Inference,"](https://arxiv.org/abs/2211.05102?ref=vettedconsumer.com) 2022) makes this precise; the practical upshot is a back-of-envelope rule:

**max tokens/sec ≈ memory bandwidth ÷ bytes read per token**

Plug in a 70B model quantized to 4-bit (~40 GB of weights to stream per token):

| Memory bandwidth | Example hardware class | ~Ceiling on a 70B Q4 |
|---|---|---|
| ~800 GB/s | Mac Studio Ultra / high-end GPU | ~20 tok/s |
| ~256 GB/s | Strix Halo / DGX Spark unified memory | ~6 tok/s |
| ~1000 GB/s | RTX 4090-class GDDR | ~25 tok/s |

Notice what's *missing* from that table: raw compute (TFLOPS). For single-user generation it barely matters — you could double the chip's math throughput and the tokens-per-second would hardly move, because the chip is sitting idle waiting on memory. This is why **memory bandwidth is the headline spec** for local generation, and why Apple Silicon and unified-memory boxes — which pair big memory with high bandwidth — punch so far above their raw-compute weight.

(It's also the deep reason [Mixture-of-Experts](https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/) models generate so fast: only the *active* parameters get read per token, so "bytes read per token" shrinks dramatically.)

## Why prompt processing is the opposite: compute-bound

Prefill flips the equation. Instead of one token at a time, the model processes *all* your prompt tokens **in parallel** — a big matrix-times-matrix multiply. That keeps the math units saturated, so prompt processing is **compute-bound**: now the FLOPS and tensor cores you ignored for generation are exactly what determine your time to first token.

This is why a chip with strong tensor compute (like the GPU inside a DGX Spark) can produce a much faster first token on a long prompt, while a high-bandwidth-but-modest-compute box (a Mac, a Strix Halo) has to grind through it. And it gets worse with context length: prefill cost grows with prompt size, so the gap widens the longer your input. The distinction is so fundamental that researchers showed prefill and decode *interfere* when mixed (Agrawal et al., [SARATHI](https://arxiv.org/abs/2308.16369?ref=vettedconsumer.com), 2023), and datacenters now literally run the two phases on **different machines** with different hardware (Patel et al., [Splitwise](https://arxiv.org/abs/2311.18677?ref=vettedconsumer.com), 2023). Two phases, two bottlenecks, two ideal chips.

## How to read a local-LLM benchmark

Once you know there are two phases, benchmark numbers stop being noise. Nearly every serious local-LLM benchmark — the `llama-bench`

output people post on r/LocalLLaMA, for instance — reports **two** figures: **pp** (prompt processing, often written "pp512") and **tg** (token generation, "tg128"). The pp number is your prefill/compute speed; the tg number is your decode/bandwidth speed. A box can post a huge pp and a modest tg (compute-rich, bandwidth-limited) or the reverse (a Mac: middling pp, healthy tg). So when someone says a machine "does 40 tok/s," always ask *which number* — a single figure hides exactly the trade-off that decides whether it fits your workload. The honest comparisons report both, at a stated context length, because pp also degrades as the prompt grows.

## What owners measure

This isn't a lab abstraction — it's the lived experience of anyone who owns more than one box. In a detailed ["Strix Halo vs DGX Spark" owner write-up](https://redlib.catsarch.com/r/LocalLLaMA/comments/1odk11r/strix_halo_vs_dgx_spark_initial_impressions_long/?ref=vettedconsumer.com), u/Eugr — who runs both — reports exactly the split the theory predicts:

"The token generation is nearly identical to Strix Halo both in llama.cpp and vLLM."— u/Eugr (owns both) — i.e. decode is bandwidth-bound, and the two boxes have similar bandwidth

"Strix Halo performance in prompt processing degrades much faster with context."— u/Eugr — i.e. prefill is compute-bound, and the Spark's stronger compute pulls ahead

Generation: a tie (bandwidth parity). Prompt processing: a gap that widens with context (compute difference). That's the whole framework, confirmed by someone with both machines on the desk — and it matches independent lab testing too, where the Spark's GPU runs several times faster on time-to-first-token while token generation stays close.

## Which phase dominates your wait?

For any single request, the two phases split your total wait — and the ratio depends entirely on the shape of the job. A **short prompt with a long reply** (a quick question, a long essay) is decode-dominated: nearly all the time is bandwidth-bound generation, so memory bandwidth sets the experience. A **long prompt with a short reply** — summarizing a document, answering over a big RAG context, classifying or extracting — is prefill-dominated: most of the wall-clock goes into compute-bound prompt processing before a short answer pops out. Agentic workloads are the punishing case: they pile up long contexts (heavy prefill) *and* emit lots of tokens (heavy decode), so they expose a weak box on both axes at once. Before you buy, picture your actual prompts: are they mostly *reading* or mostly *writing*? That one question maps you straight onto the spec — compute or bandwidth — that decides how the machine will feel.

## What this means for buying hardware

Stop asking "which box is fastest?" and start asking "fastest at *which phase*?" — because your workload decides which one you should pay for:

| Your workload | Dominant phase | Buy for… |
|---|---|---|
| Chat, short prompts, long replies | Generation (decode) | Memory bandwidth (Mac Ultra, high-bandwidth box) |
| Long documents, big codebases, RAG | Prompt processing (prefill) | Compute (strong tensor cores / GPU) |
| Agents (long context + lots of tokens) | Both | Bandwidth and compute — the expensive case |
| Many concurrent users | Both, batched | Compute + a serving engine (vLLM) |

This single distinction resolves the most common local-hardware arguments. "Is the Mac Studio good for local AI?" — brilliant for generation, weaker on prompt processing. "Why is my Strix Halo box slow on a 50k-token codebase?" — prefill is compute-bound and that's its softer spot. "Why does a DGX Spark feel snappy to start but not faster overall?" — compute wins prefill, bandwidth ties decode. None of these machines is simply "faster"; they're faster at different halves of the job.

## The plot twist: batching changes the rules

Everything above assumes **one user**. The moment you serve several requests at once, generation stops being purely bandwidth-bound. When the model reads its weights out of memory to produce a token, it can apply them to *every* sequence in the batch at the same time — so that one expensive memory read is shared across all of them. Batching amortizes the bandwidth cost and starts using the compute that was sitting idle in single-user decode. Total throughput (tokens/sec across all users) then climbs with compute, even though any one user's speed doesn't change much.

This is why a serving engine with continuous batching like [vLLM](https://arxiv.org/abs/2309.06180?ref=vettedconsumer.com) can saturate a GPU that looked "bandwidth-starved" for a single chat — and why datacenter inference economics look nothing like your desktop's. For local, single-user use, though, the bandwidth rule holds: you are the batch of one, and the chip spends most of its time waiting on memory.

## How this stacks with quantization and MoE

The two phases interact with the other levers in this series. **Quantization** shrinks "bytes read per token," so a smaller quant directly raises your generation ceiling — roughly, halving the bits can nearly double tok/s on the same bandwidth (see our [quantization guide](https://vettedconsumer.com/gguf-vs-gptq-vs-awq-the-plain-english-guide-to-llm-quantization-and-which-one-to-pick/)). **MoE** shrinks it a different way: only the active experts are read per token, so a 30B-A3B model generates at roughly 3B-model speed while still needing enough memory to *hold* all 30B. Neither trick does much for prefill — prompt processing still has to push every input token through the active path — which is exactly why a heavily-quantized MoE can stream replies fast yet still make you wait on a giant prompt.

## The cheat sheet

| Symptom | Cause | Lever |
|---|---|---|
| Long wait before the first word | Prefill (compute-bound) | More compute; shorter prompts; prompt caching |
| Reply streams out slowly | Decode (bandwidth-bound) | More memory bandwidth; smaller/quantized model; an MoE |
| Fine short, painful on long inputs | Prefill scales with prompt size | Compute, or trim/chunk the context |

The one-line version: **generation speed is bought with memory bandwidth; prompt-processing speed is bought with compute.** Know which half your workload leans on, and you'll never be surprised by a benchmark again.

## Sources & how we researched this

This explainer synthesizes the primary literature on transformer inference efficiency — Pope et al., ["Efficiently Scaling Transformer Inference"](https://arxiv.org/abs/2211.05102?ref=vettedconsumer.com) (2022) for the memory-bandwidth-bound nature of decode; Agrawal et al., [SARATHI](https://arxiv.org/abs/2308.16369?ref=vettedconsumer.com) (2023) for the compute-bound-prefill vs memory-bound-decode distinction; and Patel et al., [Splitwise](https://arxiv.org/abs/2311.18677?ref=vettedconsumer.com) (2023) for production systems that physically separate the two phases. The owner measurements are from [r/LocalLLaMA](https://redlib.catsarch.com/r/LocalLLaMA/comments/1odk11r/strix_halo_vs_dgx_spark_initial_impressions_long/?ref=vettedconsumer.com), linked so you can verify; we have not benchmarked these machines first-hand. The tokens-per-second figures are theoretical bandwidth ceilings (bandwidth ÷ model bytes), rounded; real throughput is lower due to overhead.

## Related guides

[The KV cache, explained](https://vettedconsumer.com/the-kv-cache-explained-why-long-context-eats-your-vram-and-how-to-fit-more/)(why long prompts cost memory)[Mixture-of-Experts, explained](https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/)(why MoE decodes fast)[Strix Halo vs DGX Spark, according to owners of both](https://vettedconsumer.com/strix-halo-vs-dgx-spark-running-70b-locally-according-to-people-who-own-both/)
