Strix Halo vs DGX Spark: Running 70B Locally, According to People Who Own Both

Two machines with 128 GB of unified memory — the AMD Strix Halo and the NVIDIA DGX Spark — are being compared by owners who run 70B-parameter language models locally. AI developer u/Eugr benchmarked both and found token generation speeds nearly identical, but prompt processing on the DGX Spark was 2–5 times faster due to NVIDIA's superior compute throughput. The Strix Halo costs roughly half the price at $1,500–$2,000, making the choice depend on whether users prioritize long-context tasks or simple chat workloads.

They look like the same machine on paper: a tiny box with 128 GB of unified memory built to run large language models locally. So the question buyers keep asking on r/LocalLLaMA isn't "is unified memory good" — it's which one actually runs a 70B model better, and is NVIDIA's premium worth roughly double the price? Rather than guess, we pulled together reports from people who own both the AMD Strix Halo Ryzen AI Max+ 395 and the NVIDIA DGX Spark — including one AI developer who benchmarked them side by side. Here's what they found. The two machines Both pack 128 GB of unified memory, which is what lets either one hold a 70B-class model or a big MoE model that won't fit on a normal 16–24 GB GPU. NVIDIA DGX Spark — Grace-Blackwell GB10 silicon, 128 GB unified, the full CUDA stack, and a 200 Gbps QSFP network interface for clustering two of them. Roughly $3,000–$4,000 . See our deeper dive: the DGX Spark, according to people who own one https://vettedconsumer.com/the-nvidia-dgx-spark-according-to-the-people-who-own-one/ . AMD Strix Halo — the Ryzen AI Max+ 395 with Radeon 8060S, 128 GB of LPDDR5X, sold in mini PCs like the GMKtec EVO-X2 https://www.amazon.com/s?k=GMKtec+EVO-X2&tag=57eqvt-20&ref=vettedconsumer.com , Framework Desktop, and Beelink GTR9 Pro, typically $1,500–$2,000 . Head-to-head, from someone who owns both The most useful single account comes from u/Eugr, an AI developer who bought a Strix Halo GMKtec EVO-X2 128GB and later a DGX Spark and benchmarked them. His side-by-side impressions https://www.reddit.com/r/LocalLLaMA/comments/1odk11r/?ref=vettedconsumer.com are blunt and specific: "Inference-wise, token generation is nearly identical to Strix Halo both in llama.cpp and vLLM — but prompt processing is 2–5x higher on the DGX Spark . Strix Halo prompt-processing performance degrades much faster with context." — u/Eugr, r/LocalLLaMA That single sentence is the whole decision. For raw token generation, they're a wash — if you just want a 70B model answering at a few tokens/sec, a Strix Halo box does it for half the money. Where the DGX Spark pulls ahead is prompt processing — the "time to first token" when you feed it a long document or a big agent context. If your workload is long-context RAG, agentic loops, or fine-tuning experiments, that 2–5x matters a lot. If you're mostly chatting with short prompts, you'll barely notice it. Why the benchmarks look that way — the research behind it Those owner numbers aren't random; they follow directly from a well-established principle in LLM-inference research. Running a model has two phases with opposite hardware demands. Prefill processing your prompt is compute-bound — it's highly parallel, so it's limited by raw math throughput. Decode generating each new token is memory-bandwidth-bound — to produce even one token the hardware must stream the entire model's weights through memory, so it's limited by memory speed, not compute Agrawal et al., "SARATHI," 2023 https://arxiv.org/abs/2308.16369?ref=vettedconsumer.com ; Towards Data Science https://towardsdatascience.com/prefill-is-compute-bound-decode-is-memory-bound-why-your-gpu-shouldnt-do-both/?ref=vettedconsumer.com . Map that onto the two boxes and the owner results snap into focus: Token generation decode was a tie because it's memory-bound and the two machines have similar memory bandwidth — roughly 256 GB/s on Strix Halo's LPDDR5X versus ~273 GB/s on the DGX Spark. Similar bandwidth → similar tokens/sec. It's also why a Mac Studio M3 Ultra, at ~800 GB/s, pulls ahead on decode for big models. Prompt processing prefill favored the DGX Spark by 2–5x because it's compute-bound, and NVIDIA's Blackwell tensor cores have far more raw compute than Strix Halo's RDNA 3.5 iGPU. The same principle explains why Strix Halo's prefill "degrades faster with context" — a longer prompt is simply more prefill compute, exactly where it's weakest. The practical takeaway: if your workload is short prompts and lots of generation chat , bandwidth rules and the cheaper Strix Halo box keeps pace. If it's long prompts, RAG, or agents heavy prefill , you're paying for compute — the DGX Spark's home turf. The serving engines both owners used, like vLLM PagedAttention, Kwon et al., 2023 https://arxiv.org/abs/2309.06180?ref=vettedconsumer.com , are built around exactly this prefill/decode split. What each actually fits capacity is a tie A point that trips up a lot of buyers: at 128 GB, both machines fit the same models — so capacity is not the differentiator here; speed is. The quick math: a model's memory footprint is roughly parameters × bytes-per-weight. At the common 4-bit Q4 quantization ~0.5 bytes/param , a 70B model needs ~40 GB and a 120B-class MoE model ~60 GB — both fit comfortably inside 128 GB with room for context. Capacity only becomes the deciding factor at the extremes: a 512 GB Mac Studio can hold a 400B-class model neither of these boxes can touch. So if your target models top out around 70–120B, choose on speed and ecosystem above ; if you need to hold something enormous in a single box, that's a different unified-memory https://vettedconsumer.com/tag/unified-memory-ai/ conversation. Beyond speed: ecosystem, I/O, and the catch Two more differences owners stress. First, CUDA : on the DGX Spark "the whole ecosystem just works" — vLLM, TRT-LLM, fine-tuning libraries — whereas Strix Halo runs on ROCm/Vulkan, which has improved fast but still throws the occasional compatibility wall. Second, I/O : Eugr notes the DGX Spark is "the most minimalist mini-PC I've ever used" — a single M.2 2242 slot, no USB4, one HDMI — but it has that 200 Gbps networking for linking two units. Strix Halo boxes are far more expandable multiple M.2, USB4 but cap out at 10 GbE. And the honest counterweight, because the DGX Spark isn't magic — a widely-upvoted owner report titled "Disappointed by DGX Spark" https://www.reddit.com/r/LocalLLaMA/comments/1oo6226/?ref=vettedconsumer.com sums up the skeptic case: "128GB shared RAM still underperforms running Qwen 30B with context on vLLM. For $5k, the 3090 is still king if you value raw speed over design — won't replace my Mac anytime soon." — u/RockstarVP, r/LocalLLaMA The reality check both camps agree on: neither box is a substitute for a real GPU on raw speed. As u/No-Refrigerator-1672 put it in that thread, "one glance over the specs is enough to understand it won't outperform real GPUs — the niche is incredibly small." These machines win on capacity fitting big models cheaply and quietly , not peak throughput. Who should buy which Buy a Strix Halo box EVO-X2 / Framework Desktop / GTR9 Pro if you want the cheapest way to run 70B-class and big MoE models locally, you mostly do short-to-medium-context inference, and you value price, quiet, and expandability. As one r/LocalLLaMA owner's cost analysis showed, a DIY equivalent of a 128 GB Strix Halo board ran ~$2,240 — more than the prebuilt https://www.reddit.com/r/LocalLLaMA/comments/1nozz23/?ref=vettedconsumer.com , so the value is real. Buy the DGX Spark if you live in the NVIDIA/CUDA ecosystem, you need fast prompt processing for long-context or agentic work, you want the option to cluster two units over 200 Gbps, and the ~2x price premium is justified by your time. Buy neither if raw tokens/sec is all you care about — a used 3090/4090 or a Mac Studio with more bandwidth may serve you better. Price, power, and the cloud alternative The other half of the decision is what you pay to own and run it. A Strix Halo box EVO-X2, Framework Desktop runs roughly $1,500–$2,000 ; a DGX Spark is closer to $3,000–$4,000 — and owners note NVIDIA has nudged that price up since launch. Both sip power next to a multi-GPU rig — low hundreds of watts under load rather than a 1 kW space heater — which genuinely matters for a machine that may run all day. And it's worth naming the third option: if you only need a 70B model occasionally , renting a cloud GPU by the hour can beat either purchase. The rough rule is "daily use → buy, bursty use → rent." For steady daily inference, a one-time box wins — and the Strix Halo route gets you there for the least money, while the DGX Spark premium only pays off for the CUDA-bound, prefill-heavy work it's built for. The bottom line Same 128 GB, same "fits a 70B model" headline — but the owners who've run both are clear: token generation is a tie, and the DGX Spark's real edge is prompt processing and CUDA, paid for with roughly double the price and less expandability. For most people getting into local LLMs, a Ryzen AI Max+ 395 box https://www.amazon.com/s?k=Ryzen+AI+Max+395+mini+PC&tag=57eqvt-20&ref=vettedconsumer.com is the value-maximizing entry point; the DGX Spark earns its premium only for CUDA-bound, long-context, or multi-node workloads. Sources & how we researched this This guide aggregates real owner reports — we have not tested these machines first-hand; everything below is owner sentiment and owner-run benchmarks, linked so you can verify. We prioritized accounts from people who own both devices, and we deliberately include critical threads for balance. - u/Eugr — "Strix Halo vs DGX Spark — Initial Impressions" https://www.reddit.com/r/LocalLLaMA/comments/1odk11r/?ref=vettedconsumer.com owns both; token-gen and prompt-processing benchmarks - u/RockstarVP — "Disappointed by DGX Spark" https://www.reddit.com/r/LocalLLaMA/comments/1oo6226/?ref=vettedconsumer.com critical, for balance - u/simracerman — "The Ryzen AI MAX+ 395 is a true unicorn" https://www.reddit.com/r/LocalLLaMA/comments/1nozz23/?ref=vettedconsumer.com Strix Halo value / cost analysis Technical & research sources — the principles behind the numbers: - Agrawal et al., "SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills" https://arxiv.org/abs/2308.16369?ref=vettedconsumer.com 2023 — prefill is compute-bound, decode is memory-bound. - Kwon et al., "Efficient Memory Management for LLM Serving with PagedAttention" https://arxiv.org/abs/2309.06180?ref=vettedconsumer.com vLLM, 2023 — the serving engine both owners benchmarked. "Prefill Is Compute-Bound, Decode Is Memory-Bound" https://towardsdatascience.com/prefill-is-compute-bound-decode-is-memory-bound-why-your-gpu-shouldnt-do-both/?ref=vettedconsumer.com — a plain-English explainer of the split.