# Bandwidth, Not TFLOPS: What Sets Your Local LLM Speed (and Why the Newest Card Isn't Always Fastest)

> Source: <https://vettedconsumer.com/bandwidth-not-tflops-what-sets-your-local-llm-speed-and-why-the-newest-card-isnt-always-fastest/>
> Published: 2026-06-30 13:00:00+00:00

Here is a result that should not happen if you believe GPU spec sheets. The AMD RX 7900 XTX has 122 TFLOPS of FP16 compute, more than three times an old RTX 3090's 36. Yet owners report the 3090 generating text at about 95 tokens per second on a standard 8B model, and the 7900 XTX at around 39. A Mac with roughly 7 TFLOPS, a fraction of either, keeps pace at 30. The number on the box (TFLOPS) predicted the wrong winner.

We pulled owner-submitted benchmarks for 16 GPUs and Macs to find the spec that does predict local LLM speed. The short answer: memory bandwidth, not compute. The longer answer has a catch that can cost you money, and it is the part most "just buy more bandwidth" advice skips. We have not tested these cards first-hand; every speed below is a real owner's benchmark, linked, and we explain exactly how much to trust each one.

## Why generation speed is a memory problem, not a compute problem

To generate one token, a model has to read every one of its weights out of memory, run them through the math, and produce the next token. Then it does the whole thing again for the next token. A 5 GB quantized 8B model means moving about 5 GB from memory into the chip for every single token.

So the bottleneck is not how fast the chip can multiply (TFLOPS). It is how fast it can *read memory* (bandwidth, measured in GB/s). The compute units mostly sit idle, waiting for weights to arrive. This is what "memory-bandwidth-bound" means, and it is why token generation behaves so differently from the prompt-processing phase, which we break down in [prompt processing vs generation](https://vettedconsumer.com/prompt-processing-vs-generation-why-your-box-is-fast-at-one-and-slow-at-the-other/).

This is not our theory. It is the math our own [calculator](https://vettedconsumer.com/can-i-run-it/) runs to estimate speed: it divides a machine's memory bandwidth by the bytes it has to read per token. The rough rule is **tokens per second is about equal to memory bandwidth divided by the model size in memory**. Faster memory, more tokens. More TFLOPS, no change.

## The clean test: NVIDIA's mature lineup

The cleanest way to see this is within one mature software stack, where the drivers are not a variable. Here are owner-measured generation speeds for Llama 3.1 8B (Q4_K_M) on NVIDIA's Ampere and Ada cards, from [LocalScore](https://www.localscore.ai/model/1?ref=vettedconsumer.com), a crowd-sourced benchmark where owners run the same standardized test on their own machines. Memory bandwidth and the standard FP16 figure come from [TechPowerUp](https://www.techpowerup.com/?ref=vettedconsumer.com).

| GPU | Memory bandwidth | FP16 TFLOPS | Owner tok/s (8B Q4) |
|---|---|---|---|
| RTX 4060 Ti 16GB | 288 GB/s | 22 | 48 |
| RTX 3060 12GB | 360 GB/s | 13 | 52 |
| RTX 4070 12GB | 504 GB/s | 29 | 76 |
| RTX 4080 16GB | 717 GB/s | 49 | 88 |
|

Order the cards by bandwidth and you order them by speed. The rank correlation between bandwidth and tokens per second across these six cards is **0.93**, near perfect. Now look at the TFLOPS column and try to find the same pattern. The RTX 3090 has *less* compute than the RTX 4080 (36 vs 49 TFLOPS) but generates *faster*, because it has more memory bandwidth (936 vs 717). The RTX 4090 has nearly double the 4080's TFLOPS and is not close to double the speed, because its bandwidth is only about 40% higher. Compute is along for the ride. Bandwidth is driving.

This is also why a used 3090, with its 936 GB/s, remains a value pick for local AI: it out-generates every new mid-range card on this list, for less money. The full sortable table is on our [all-hardware chart](https://vettedconsumer.com/all-hardware/) (sort by bandwidth and watch the speed ranking follow).

## Where TFLOPS gets exposed

Step outside one vendor and the compute number falls apart completely. Across all 16 devices we checked, the correlation between TFLOPS and generation speed is **0.09**, statistically nothing. Bandwidth still leads at 0.54, even with everything mixed together. The proof points are the extremes:

- The
**7900 XTX** has the most FP16 compute in the entire set (122 TFLOPS) and one of the lowest generation speeds (~39 tok/s). - A
**Mac with about 7 TFLOPS** generates at ~30 tok/s, keeping up with discrete cards that have five to fifteen times its compute, because its unified memory bandwidth (273 GB/s) is in the same league.

If TFLOPS decided generation speed, neither of those could be true. Memory bandwidth explains both.

## The catch: bandwidth is a ceiling, not a promise

Here is the part that can cost you money, and the reason we will not just tell you to "buy the card with the most GB/s." Bandwidth sets the *ceiling* on generation speed. A card only reaches that ceiling if its software stack is mature enough to use the memory efficiently. When we measure how much of their bandwidth each device converts into tokens, the gaps are large, and they line up with how new or how well-supported the platform is.

| GPU | Memory bandwidth | Owner tok/s (8B Q4) | Why it falls short of its bandwidth |
|---|---|---|---|
|

Read that top row again. The RTX 5090 has the most memory bandwidth of any card in this test, nearly double a 3090's, and owners currently measure it generating an 8B model *slower* than that 3090 (66 vs 95 tok/s). Its bandwidth predicts something like 120 to 130 tok/s. The shortfall is not the silicon, it is that the software to drive brand-new Blackwell cards at full speed was not ready when owners ran the test. It will improve. But the lesson stands: **the newest, highest-bandwidth card is not automatically the fastest on day one.** A mature platform at 936 GB/s can beat an immature one at 1792.

This is why the owner numbers matter more than the spec sheet. The spec sheet shows the ceiling; only real owners running the model show what the ceiling is worth today. It is also why we report ranges and flag single submissions rather than one tidy figure: these are crowd-sourced runs on different drivers, builds, and systems, and the same card can vary by 1.5x to 2x between submissions.

## Where the expensive cards earn their price

None of this means a cheap card is all you need. An 8B model that fits in any card's memory is the easiest possible case, and it is exactly the case where bandwidth alone decides and price barely matters. Change the workload and the expensive cards pull decisively ahead, for two reasons the generation number hides.

**Prompt processing is compute-bound, and that is where TFLOPS lives.** Reading a long prompt (prefill) is a different job from generating a reply, and it scales with compute. On the same benchmark, an RTX 4090 chews through a prompt at about 6,600 tokens per second and a 5090 around 6,300, while a Mac manages roughly 300 to 1,100. That is a six to twenty times gap. If you feed long documents, run RAG, or drive agents that re-read big contexts, prompt speed sets your time-to-first-token, and there the high-TFLOPS cards win outright.

**Capacity decides whether the model runs at all.** Bandwidth is a speed question only after the model fits. A 70B model or a long-context KV cache needs the memory a 32GB RTX 5090, a big unified-memory Mac, or a 128GB Strix Halo box has and a 12GB card does not. The moment a model spills off a small card into system RAM, its speed collapses far below anything in these tables. We cover that sizing in [how much VRAM you need for a 70B](https://vettedconsumer.com/how-much-vram-do-you-actually-need-to-run-a-70b-model-locally/) and the memory cost in [the KV cache explainer](https://vettedconsumer.com/the-kv-cache-explained-why-long-context-eats-your-vram-and-how-to-fit-more/).

## What this means for your next card

| If your goal is... | Buy for... |
|---|---|
| Fastest generation on a model that fits | Memory bandwidth, on a mature stack. A used RTX 3090 (936 GB/s, CUDA) beats most new mid-range cards. Give brand-new architectures and AMD/Intel a few months for the software to catch up. |
| Running big models or long context at all | Capacity first. VRAM or unified memory decides what loads; a 32GB card or a 64GB-plus unified box before raw speed. |
| Long prompts, RAG, or agents | Compute and bandwidth together. Prefill is TFLOPS-bound, so a high-end card's prompt speed is the real win here. |
| Best value per token | The highest bandwidth you can get used, on CUDA, that fits your model. Check current prices on the
|

Do not buy a card off a single number, in either direction. Bandwidth predicts generation speed better than anything else on the spec sheet, but only the model you run, on the software that exists today, tells you the real figure. Put your exact GPU or Mac, model, and quant into the [Can I run it? calculator](https://vettedconsumer.com/can-i-run-it/) to see the bandwidth-based estimate, and the [quant picker](https://vettedconsumer.com/quant-picker/) to keep the model small enough that bandwidth, not capacity, stays the only thing that matters.

## Sources and how we researched this

We have not tested these devices first-hand. Generation and prompt speeds are owner-submitted benchmarks from [LocalScore](https://www.localscore.ai/model/1?ref=vettedconsumer.com), a crowd-sourced test where owners run the same standardized Llama 3.1 8B (Q4_K_M) workload on their own hardware; each figure is a real submission, and we verified it is the generation (decode) rate, not the prompt-processing rate, before using it. Memory bandwidth and FP16 figures are from [TechPowerUp](https://www.techpowerup.com/?ref=vettedconsumer.com) and, for Macs, [Apple](https://www.apple.com/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/?ref=vettedconsumer.com). Three caveats we held ourselves to: the FP16 numbers are the standard dense figures, not the inflated tensor or sparse marketing numbers a spec sheet leads with; these are single crowd-sourced runs that vary 1.5x to 2x by driver, build, and system, so treat them as representative, not exact; and every figure is for one small model that fits entirely in memory, which is the specific case where generation is purely bandwidth-bound. Scale the model, the context, or the batch and compute and capacity reassert themselves.

## Related guides

[Prompt processing vs generation](https://vettedconsumer.com/prompt-processing-vs-generation-why-your-box-is-fast-at-one-and-slow-at-the-other/), why the same box is fast at one and slow at the other[How much VRAM you need for a 70B](https://vettedconsumer.com/how-much-vram-do-you-actually-need-to-run-a-70b-model-locally/)[Mixture-of-Experts, explained](https://vettedconsumer.com/mixture-of-experts-moe-explained-why-active-parameters-decide-what-runs-on-your-machine/), how active parameters change the bandwidth math[All hardware, one sortable chart](https://vettedconsumer.com/all-hardware/), sort by bandwidth and watch speed follow
