# Strix Halo vs DGX Spark: Running 70B Locally, According to People Who Own Both

> Source: <https://vettedconsumer.com/strix-halo-vs-dgx-spark-running-70b-locally-according-to-people-who-own-both/>
> Published: 2026-06-06 20:11:17+00:00

They look like the same machine on paper: a tiny box with **128 GB of unified memory** built to run large language models locally. So the question buyers keep asking on r/LocalLLaMA isn't "is unified memory good" — it's **which one actually runs a 70B model better, and is NVIDIA's premium worth roughly double the price?** Rather than guess, we pulled together reports from people who own both the **AMD Strix Halo** (Ryzen AI Max+ 395) and the **NVIDIA DGX Spark** — including one AI developer who benchmarked them side by side. Here's what they found.

## The two machines

Both pack 128 GB of unified memory, which is what lets either one hold a 70B-class model (or a big MoE model) that won't fit on a normal 16–24 GB GPU.

**NVIDIA DGX Spark**— Grace-Blackwell (GB10) silicon, 128 GB unified, the full** CUDA**stack, and a 200 Gbps QSFP network interface for clustering two of them. Roughly**$3,000–$4,000**. (See our deeper dive:[the DGX Spark, according to people who own one](https://vettedconsumer.com/the-nvidia-dgx-spark-according-to-the-people-who-own-one/).)**AMD Strix Halo**— the Ryzen AI Max+ 395 with Radeon 8060S, 128 GB of LPDDR5X, sold in mini PCs like the[GMKtec EVO-X2](https://www.amazon.com/s?k=GMKtec+EVO-X2&tag=57eqvt-20&ref=vettedconsumer.com), Framework Desktop, and Beelink GTR9 Pro, typically**$1,500–$2,000**.

## Head-to-head, from someone who owns both

The most useful single account comes from u/Eugr, an AI developer who bought a Strix Halo (GMKtec EVO-X2 128GB) and later a DGX Spark and benchmarked them. His [side-by-side impressions](https://www.reddit.com/r/LocalLLaMA/comments/1odk11r/?ref=vettedconsumer.com) are blunt and specific:

"Inference-wise, token generation is nearly identical to Strix Halo both in llama.cpp and vLLM — but prompt processing is 2–5x higher [on the DGX Spark]. Strix Halo prompt-processing performance degrades much faster with context." — u/Eugr, r/LocalLLaMA

That single sentence is the whole decision. **For raw token generation, they're a wash** — if you just want a 70B model answering at a few tokens/sec, a Strix Halo box does it for half the money. **Where the DGX Spark pulls ahead is prompt processing** — the "time to first token" when you feed it a long document or a big agent context. If your workload is long-context RAG, agentic loops, or fine-tuning experiments, that 2–5x matters a lot. If you're mostly chatting with short prompts, you'll barely notice it.

## Why the benchmarks look that way — the research behind it

Those owner numbers aren't random; they follow directly from a well-established principle in LLM-inference research. Running a model has two phases with *opposite* hardware demands. **Prefill** (processing your prompt) is **compute-bound** — it's highly parallel, so it's limited by raw math throughput. **Decode** (generating each new token) is **memory-bandwidth-bound** — to produce even one token the hardware must stream the entire model's weights through memory, so it's limited by memory speed, not compute ([Agrawal et al., "SARATHI," 2023](https://arxiv.org/abs/2308.16369?ref=vettedconsumer.com); [Towards Data Science](https://towardsdatascience.com/prefill-is-compute-bound-decode-is-memory-bound-why-your-gpu-shouldnt-do-both/?ref=vettedconsumer.com)).

Map that onto the two boxes and the owner results snap into focus:

**Token generation (decode) was a tie** because it's memory-bound and the two machines have similar memory bandwidth — roughly**256 GB/s** on Strix Halo's LPDDR5X versus**~273 GB/s** on the DGX Spark. Similar bandwidth → similar tokens/sec. (It's also why a Mac Studio M3 Ultra, at ~800 GB/s, pulls*ahead*on decode for big models.)**Prompt processing (prefill) favored the DGX Spark by 2–5x** because it's compute-bound, and NVIDIA's Blackwell tensor cores have far more raw compute than Strix Halo's RDNA 3.5 iGPU. The same principle explains why Strix Halo's prefill "degrades faster with context" — a longer prompt is simply more prefill compute, exactly where it's weakest.

The practical takeaway: **if your workload is short prompts and lots of generation** (chat), bandwidth rules and the cheaper Strix Halo box keeps pace. **If it's long prompts, RAG, or agents** (heavy prefill), you're paying for compute — the DGX Spark's home turf. The serving engines both owners used, like [vLLM (PagedAttention, Kwon et al., 2023)](https://arxiv.org/abs/2309.06180?ref=vettedconsumer.com), are built around exactly this prefill/decode split.

## What each actually fits (capacity is a tie)

A point that trips up a lot of buyers: at 128 GB, **both machines fit the same models** — so capacity is *not* the differentiator here; speed is. The quick math: a model's memory footprint is roughly parameters × bytes-per-weight. At the common 4-bit (Q4) quantization (~0.5 bytes/param), a 70B model needs ~40 GB and a 120B-class MoE model ~60 GB — both fit comfortably inside 128 GB with room for context. Capacity only becomes the deciding factor at the extremes: a 512 GB Mac Studio can hold a 400B-class model neither of these boxes can touch. So if your target models top out around 70–120B, choose on speed and ecosystem (above); if you need to hold something enormous in a single box, that's a different [unified-memory](https://vettedconsumer.com/tag/unified-memory-ai/) conversation.

## Beyond speed: ecosystem, I/O, and the catch

Two more differences owners stress. First, **CUDA**: on the DGX Spark "the whole ecosystem just works" — vLLM, TRT-LLM, fine-tuning libraries — whereas Strix Halo runs on ROCm/Vulkan, which has improved fast but still throws the occasional compatibility wall. Second, **I/O**: Eugr notes the DGX Spark is "the most minimalist mini-PC I've ever used" — a single M.2 (2242) slot, no USB4, one HDMI — but it has that **200 Gbps networking** for linking two units. Strix Halo boxes are far more expandable (multiple M.2, USB4) but cap out at 10 GbE.

And the honest counterweight, because the DGX Spark isn't magic — a widely-upvoted owner report titled ["Disappointed by DGX Spark"](https://www.reddit.com/r/LocalLLaMA/comments/1oo6226/?ref=vettedconsumer.com) sums up the skeptic case:

"128GB shared RAM still underperforms running Qwen 30B with context on vLLM. For $5k, the 3090 is still king if you value raw speed over design — won't replace my Mac anytime soon." — u/RockstarVP, r/LocalLLaMA

The reality check both camps agree on: neither box is a substitute for a real GPU on raw speed. As u/No-Refrigerator-1672 put it in that thread, "one glance over the specs is enough to understand it won't outperform real GPUs — the niche is incredibly small." These machines win on *capacity* (fitting big models cheaply and quietly), not peak throughput.

## Who should buy which

**Buy a Strix Halo box** (EVO-X2 / Framework Desktop / GTR9 Pro) if you want the cheapest way to run 70B-class and big MoE models locally, you mostly do short-to-medium-context inference, and you value price, quiet, and expandability. As one r/LocalLLaMA owner's cost analysis showed, a DIY equivalent of a 128 GB Strix Halo board ran [~$2,240 — more than the prebuilt](https://www.reddit.com/r/LocalLLaMA/comments/1nozz23/?ref=vettedconsumer.com), so the value is real.

**Buy the DGX Spark** if you live in the NVIDIA/CUDA ecosystem, you need fast prompt processing for long-context or agentic work, you want the option to cluster two units over 200 Gbps, and the ~2x price premium is justified by your time. **Buy neither** if raw tokens/sec is all you care about — a used 3090/4090 (or a Mac Studio with more bandwidth) may serve you better.

## Price, power, and the cloud alternative

The other half of the decision is what you pay to own and run it. A Strix Halo box (EVO-X2, Framework Desktop) runs roughly **$1,500–$2,000**; a DGX Spark is closer to **$3,000–$4,000** — and owners note NVIDIA has nudged that price up since launch. Both sip power next to a multi-GPU rig — low hundreds of watts under load rather than a 1 kW space heater — which genuinely matters for a machine that may run all day. And it's worth naming the third option: if you only need a 70B model *occasionally*, renting a cloud GPU by the hour can beat either purchase. The rough rule is "daily use → buy, bursty use → rent." For steady daily inference, a one-time box wins — and the Strix Halo route gets you there for the least money, while the DGX Spark premium only pays off for the CUDA-bound, prefill-heavy work it's built for.

## The bottom line

Same 128 GB, same "fits a 70B model" headline — but the owners who've run both are clear: **token generation is a tie, and the DGX Spark's real edge is prompt processing and CUDA, paid for with roughly double the price and less expandability.** For most people getting into local LLMs, a [Ryzen AI Max+ 395 box](https://www.amazon.com/s?k=Ryzen+AI+Max+395+mini+PC&tag=57eqvt-20&ref=vettedconsumer.com) is the value-maximizing entry point; the DGX Spark earns its premium only for CUDA-bound, long-context, or multi-node workloads.

## Sources & how we researched this

This guide aggregates real owner reports — we have not tested these machines first-hand; everything below is owner sentiment and owner-run benchmarks, linked so you can verify. We prioritized accounts from people who own *both* devices, and we deliberately include critical threads for balance.

- u/Eugr —
["Strix Halo vs DGX Spark — Initial Impressions"](https://www.reddit.com/r/LocalLLaMA/comments/1odk11r/?ref=vettedconsumer.com)(owns both; token-gen and prompt-processing benchmarks) - u/RockstarVP —
["Disappointed by DGX Spark"](https://www.reddit.com/r/LocalLLaMA/comments/1oo6226/?ref=vettedconsumer.com)(critical, for balance) - u/simracerman —
["The Ryzen AI MAX+ 395 is a true unicorn"](https://www.reddit.com/r/LocalLLaMA/comments/1nozz23/?ref=vettedconsumer.com)(Strix Halo value / cost analysis)

**Technical & research sources** — the principles behind the numbers:

- Agrawal et al.,
["SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills"](https://arxiv.org/abs/2308.16369?ref=vettedconsumer.com)(2023) — prefill is compute-bound, decode is memory-bound. - Kwon et al.,
["Efficient Memory Management for LLM Serving with PagedAttention"](https://arxiv.org/abs/2309.06180?ref=vettedconsumer.com)(vLLM, 2023) — the serving engine both owners benchmarked. ["Prefill Is Compute-Bound, Decode Is Memory-Bound"](https://towardsdatascience.com/prefill-is-compute-bound-decode-is-memory-bound-why-your-gpu-shouldnt-do-both/?ref=vettedconsumer.com)— a plain-English explainer of the split.
