{"slug": "two-sparks-one-cluster-why-stacking-nvidia-dgx-spark-units-unlocks-local-scale", "title": "Two Sparks, One Cluster: Why Stacking NVIDIA DGX Spark Units Unlocks Local Frontier-Scale Inference", "summary": "NVIDIA's DGX Spark workstation can be linked via a single 200 GbE cable to form a two-node AI cluster, aggregating 256 GB of unified memory to run frontier-scale models like Llama 3.1 405B locally. The setup uses RoCE and NCCL for high-throughput distributed inference, enabling models that exceed single-node capacity and providing additional KV-cache headroom for mid-size models.", "body_md": "# Two Sparks, One Cluster: Why Stacking NVIDIA DGX Spark Units Unlocks Local Frontier-Scale Inference\n\nThe NVIDIA DGX Spark put a Grace Blackwell superchip on the desk for the price of a high-end workstation. A single unit is already a capable local-inference box — 128 GB of unified memory, FP4 tensor cores, a full NVIDIA software stack. But the feature that quietly changes the platform's ceiling is the one most people skip past at unboxing: the pair of **ConnectX-7 200 GbE QSFP ports** on the back. Connect two Sparks through them and you stop owning two workstations and start owning a two-node AI cluster.\n\nThis post walks through what \"Spark Stacking\" actually does at the hardware and software level, and where it earns its keep.\n\n## The one cable that makes a cluster\n\nThere is no proprietary backplane and no switch involved in a two-node setup. Each DGX Spark carries an onboard NVIDIA ConnectX-7 SmartNIC running at 200 GbE, and you link two units with a single **200G QSFP56 passive Direct Attach Copper (DAC) cable**, 0.5 m long, plugged port-to-port. No transceivers, no SFP adapters — just direct copper between two boxes sitting side by side.\n\nThat simplicity is itself an advantage. The interconnect is a point-to-point **RoCE (RDMA over Converged Ethernet)** link, which gives the two GPUs a high-throughput, low-latency path for the collective operations that distributed inference depends on. NCCL — NVIDIA's collective communication library — runs its all-reduce and all-gather traffic straight over that 200 Gb/s link while MPI handles inter-process coordination on the CPU side.\n\nOne nuance worth understanding, because it shapes expectations: on the GB10 board the ConnectX-7 is wired as two PCIe Gen5 x4 links rather than a single x8. A single x4 link is roughly 100 Gb/s, so the NIC reaches the full 200 Gb/s by aggregating both x4 paths in multi-host mode. The practical takeaway is that a single cable on a single port can carry full bandwidth, and the OS will surface four logical interface names for the two physical ports (each port has two names). It's a quirk, not a limitation — but it's the kind of detail that separates a clean bring-up from an afternoon of debugging.\n\n## Advantage 1: You can run models that simply don't fit on one node\n\nThis is the headline reason to stack. A single Spark's 128 GB of unified memory already lets it hold models that would never fit in a standard GPU's VRAM — a 70B-parameter model in FP16, or a ~120B model in FP4, runs on one box. But the moment you want to go bigger, you hit a wall that no amount of quantization on a single node can climb.\n\nLinking two units aggregates the memory to **256 GB**, and that is enough to host frontier-scale models locally. NVIDIA's marquee claim for the two-node configuration is **Llama 3.1 405B in FP4** — a 405-billion-parameter model served across the pair using tensor parallelism. Large mixture-of-experts models in the ~200B–235B class (Qwen3-235B-style architectures, MiniMax-M2.5 at 229B) land in the same category: too large for one node, comfortable across two.\n\nThe important mental model: the two nodes do **not** fuse into a single 256 GB GPU. The model's weights are *partitioned*across both Sparks — tensor parallelism splits each layer's matrices, pipeline parallelism splits the layer stack — and the nodes exchange activations over the QSFP link every forward pass. What you gain is **capacity**: the ability to load a model whose weights plus KV cache exceed any single node's memory.\n\n## Advantage 2: Tensor-parallel compute and KV-cache headroom for mid-size models\n\nStacking isn't only for 405B monsters. Even a model that fits on one node benefits from being served across two, for reasons that have nothing to do with fitting the weights:\n\n**More KV-cache space.** Long-context workloads and high concurrency are bottlenecked by KV-cache memory, not weights. Spreading a 120B model across two nodes frees memory on each for a larger cache, which means longer context windows and more simultaneous sequences before you hit an out-of-memory wall.**Tensor-parallel throughput.** With`--tensor-parallel-size 2`\n\nin vLLM, both Blackwell GPUs share the matrix multiplications for every token. For concurrent, batched serving this raises aggregate tokens/sec meaningfully.**Continuous batching across the cluster.** vLLM's PagedAttention and continuous batching operate over the distributed setup, so the second node contributes to serving many requests in parallel rather than sitting idle.\n\nReported figures bear this out: a ~120B-class model (GPT-OSS-120B, MXFP4) that runs around 35–50 tok/s single-stream on one node lands roughly in the 55–75 tok/s range on a stacked pair depending on the engine (vLLM, SGLang, or TensorRT-LLM), with the larger gains showing up under concurrency rather than in a single isolated request.\n\n## Advantage 3: A documented, repeatable software path\n\nA clustered setup is only an advantage if it's reliable to stand up. NVIDIA publishes the full procedure — physical connection, netplan-based network configuration, passwordless SSH discovery, and a vLLM + Ray cluster launched with tensor parallelism across both nodes. The serving layer exposes an **OpenAI-compatible API**, so anything that already talks to OpenAI's endpoint — Open WebUI, a local chat frontend, an agent framework — points at the head node's `:8000/v1`\n\nand works unchanged.\n\nThe orchestration is conventional, not exotic: Ray coordinates the cluster and places the vLLM workers, a Ray dashboard gives live GPU and actor visibility, and a set of environment variables pins every collective library (`NCCL_SOCKET_IFNAME`\n\n, `UCX_NET_DEVICES`\n\n, `GLOO_SOCKET_IFNAME`\n\n, `TP_SOCKET_IFNAME`\n\n) to the high-speed QSFP interface so traffic never falls back to the slow management NIC. The same Ray-based pattern also underpins TensorRT-LLM and SGLang multi-node deployments, so the skills transfer.\n\n## Advantage 4: Frontier-scale capability without the cloud\n\nFor teams whose interest in large local models is driven by data residency, privacy, or simply not metering every token through a cloud API, the two-node Spark is a compelling proposition. A pair of compact desktop units — each roughly 150 mm square — gives you a private endpoint capable of 405B-class inference, sitting under a desk, in a lab, or in a location where sending data to a third-party API is off the table. No egress, no per-token billing, no waiting on shared cloud capacity.\n\nIt's also a genuine **develop-to-deploy** path. The DGX Spark runs the same CUDA / NVIDIA AI stack as datacenter Grace Blackwell systems, so a model validated and tuned across two Sparks behaves consistently when promoted to a larger DGX deployment or the cloud. You prototype at frontier scale locally, then scale out without rewriting the stack.\n\n## The honest caveat: capacity scales, single-stream speed doesn't\n\nA technical post owes you the limitation alongside the upside. The GB10's unified memory is LPDDR5x with a bandwidth around 273 GB/s **per node**, and linking two units does not pool that bandwidth — each node still reads weights at its own rate. Token generation on memory-bound autoregressive decoding is governed largely by memory bandwidth, so stacking raises the *ceiling on model size* far more than it raises *single-token decode speed*. The very largest models (405B) will run, and that's remarkable for a desk-side pair, but they run at modest tokens/sec, and you'll need to constrain context length and KV-cache settings to load them at all.\n\nIn other words: stack two Sparks to run **bigger** models, to serve **more concurrent** requests, and to get **more KV-cache headroom** — not to make a single chat response stream dramatically faster. Frame the purchase around capacity and concurrency, and the two-node Spark is one of the most cost-effective ways to put frontier-scale inference on local hardware.\n\n## How to set up: stacking two Sparks step by step\n\nTheory aside, here's the full bring-up. The whole process takes well under an hour, and the commands below follow NVIDIA's official *Connect Two Sparks* procedure and the `dgx-spark-playbooks`\n\nvLLM multi-node guide. Conventions used throughout: **Node 1 = head = 192.168.100.10**,\n\n**Node 2 = worker =**, multi-node interface\n\n`192.168.100.11`\n\n`enP2p1s0f1np1`\n\n. Adapt IPs and the interface name to your own `ibdev2netdev`\n\noutput.### Step 0 — What you need\n\n**2 × DGX Spark**(or an OEM GB10 variant), both on the same, up-to-date DGX OS image. Update the ConnectX-7 /`mlx5`\n\nfirmware and the`dgx-spark-mlnx-hotplug`\n\npackage before you start.**1 × 200G QSFP56 passive DAC cable, 0.5 m**(part number`Q56-200G-CU0-5`\n\n, or a vendor's DGX-Spark-validated equivalent). No switch, no transceivers.\n\n### Step 1 — Connect the cable\n\nPlug the DAC into **port 1 on Node 1 and the matching port 1 on Node 2** — always connect the *same* port number on both units, or the link won't come up. Then confirm on both nodes:\n\n```\nibdev2netdev\n```\n\nYou want one interface showing `(Up)`\n\n:\n\n``` js\nroceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)\nrocep1s0f1   port 1 ==> enp1s0f1np1   (Up)\n```\n\nEach physical port has two names; use the `enp1...`\n\nnames for configuration and ignore the `enP2p...`\n\nduplicates. If nothing shows `(Up)`\n\n, reseat the cable, verify matching ports, and reboot both nodes.\n\n### Step 2 — Match the username on both nodes\n\nThe cluster scripts assume an identical login user. Check with `whoami`\n\non each; if they differ, create a common user (e.g. `nvidia`\n\n) on both boxes.\n\n### Step 3 — Configure the network (static IPs)\n\nWith a single cable, static netplan addresses give you a stable cluster.\n\n**Node 1:**\n\n```\nsudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF\nnetwork:\n  version: 2\n  ethernets:\n    enp1s0f1np1:\n      addresses: [192.168.100.10/24]\n      dhcp4: no\nEOF\nsudo chmod 600 /etc/netplan/40-cx7.yaml\nsudo netplan apply\n```\n\n**Node 2:** identical, but with `192.168.100.11/24`\n\n. Then verify connectivity:\n\n```\nping -c3 192.168.100.11   # from Node 1\n```\n\nIf you prefer zero-config, netplan`link-local: [ ipv4 ]`\n\non both nodes auto-assigns`169.254.x.x`\n\naddresses — convenient, but the IPs can change on reboot, which complicates a static cluster config.\n\n### Step 4 — Passwordless SSH\n\n```\nssh-keygen -t ed25519        # if you don't already have a key\nssh-copy-id -i ~/.ssh/id_ed25519.pub nvidia@192.168.100.10\nssh-copy-id -i ~/.ssh/id_ed25519.pub nvidia@192.168.100.11\n```\n\nConfirm with `ssh 192.168.100.11 hostname`\n\n. (On some images NVIDIA's `discover-sparks`\n\nscript automates this discovery and key exchange.)\n\n### Step 5 — Prepare the vLLM containers\n\nOn **both** nodes: install Docker, add your user to the `docker`\n\ngroup, pull a Blackwell/sm100-capable NGC vLLM container (CUDA 13.0+, e.g. the `26.02-py3`\n\nimage or newer), and authenticate to Hugging Face (`huggingface-cli login`\n\n) for model downloads.\n\n### Step 6 — Pin every collective library to the QSFP link\n\nThis is the step that most often makes the difference between a cluster that works and one that hangs. On **both** nodes, export:\n\n```\nexport MN_IF_NAME=enP2p1s0f1np1\nexport NCCL_SOCKET_IFNAME=$MN_IF_NAME\nexport GLOO_SOCKET_IFNAME=$MN_IF_NAME\nexport TP_SOCKET_IFNAME=$MN_IF_NAME\nexport UCX_NET_DEVICES=$MN_IF_NAME\nexport OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME\nexport RAY_memory_monitor_refresh_ms=0\nexport MASTER_ADDR=192.168.100.10\n```\n\nAlso set `VLLM_HOST_IP=192.168.100.10`\n\non the head and `VLLM_HOST_IP=192.168.100.11`\n\non the worker.\n\n### Step 7 — Start the Ray cluster\n\n**Head (Node 1):**\n\n```\nray start --head --node-ip-address=192.168.100.10 --port=6379 --dashboard-host=0.0.0.0\n```\n\n**Worker (Node 2):**\n\n```\nray start --address=192.168.100.10:6379 --node-ip-address=192.168.100.11\n```\n\nVerify from the head node — you should see two nodes and two Blackwell GPUs:\n\n```\nray status\n```\n\n### Step 8 — Serve the model with tensor parallelism\n\nStart with GPT-OSS-120B to validate the cluster end to end:\n\n```\nvllm serve openai/gpt-oss-120b \\\n  --tensor-parallel-size 2 \\\n  --host 0.0.0.0 --port 8000\n```\n\nFor the maximum-capability case — Llama 3.1 405B in FP4 — keep memory in check; even 256 GB is tight, so constrain context length and KV cache:\n\n```\nvllm serve <hf-org>/Llama-3.1-405B-Instruct-FP4 \\\n  --tensor-parallel-size 2 \\\n  --max-model-len 4096 \\\n  --gpu-memory-utilization 0.92 \\\n  --kv-cache-dtype fp8 \\\n  --host 0.0.0.0 --port 8000\n```\n\n### Step 9 — Test the endpoint\n\nvLLM serves an OpenAI-compatible API on the head node:\n\n```\ncurl http://192.168.100.10:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\":\"openai/gpt-oss-120b\",\"messages\":[{\"role\":\"user\",\"content\":\"Say hello from a two-node Spark cluster.\"}]}'\n```\n\nPoint any OpenAI-compatible client at `http://192.168.100.10:8000/v1`\n\n, and watch the **Ray dashboard** at `http://192.168.100.10:8265`\n\nfor live GPU utilization and worker placement across both Sparks.\n\n### Quick troubleshooting\n\n**No**(`(Up)`\n\ninterface / QSFP cage won't power`insufficient power on PCIe slot (27W)`\n\n): the known hotplug issue — toggle`dgx-spark-mlnx-hotplug`\n\n, update firmware, and reboot both nodes.**NCCL timeout or hang at model load:**`NCCL_SOCKET_IFNAME`\n\nisn't set to the QSFP interface on*both*nodes.the worker can't reach`Connection refused`\n\non Ray join:`192.168.100.10:6379`\n\nover the QSFP link — recheck IPs and routing.**Out-of-memory at load:** flush the cache with`sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'`\n\n, then lower`--max-model-len`\n\nand`--gpu-memory-utilization`\n\n.\n\n## When stacking is the right call\n\nLink two DGX Spark units if any of these describe you:\n\n- You need to run a model that exceeds 128 GB — 405B in FP4, or a large MoE in the 200B+ class — entirely on local hardware.\n- You're serving a 70B–120B model to multiple users and want more concurrency and longer contexts than one node's KV cache allows.\n- You want a private, frontier-capable inference endpoint with no cloud egress and predictable cost.\n- You're building a develop-to-deploy pipeline and want local behavior to match datacenter Grace Blackwell systems.\n\nIf your workload comfortably fits one node and you only care about fastest single-stream latency, a single Spark — or a higher-bandwidth GPU — may serve you better. But for anyone whose constraint is *model size* or *concurrency* rather than raw per-token speed, the second Spark and a 0.5 m copper cable are the cheapest path to a meaningfully larger local AI ceiling.", "url": "https://wpnews.pro/news/two-sparks-one-cluster-why-stacking-nvidia-dgx-spark-units-unlocks-local-scale", "canonical_source": "https://corti.com/two-sparks-one-cluster-why-stacking-nvidia-dgx-spark-units-unlocks-local-frontier-scale-inference/", "published_at": "2026-06-01 10:40:41+00:00", "updated_at": "2026-06-26 12:03:59.023335+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-chips", "ai-products", "large-language-models", "ai-research"], "entities": ["NVIDIA", "DGX Spark", "Grace Blackwell", "ConnectX-7", "Llama 3.1 405B", "Qwen3-235B", "MiniMax-M2.5", "RoCE"], "alternates": {"html": "https://wpnews.pro/news/two-sparks-one-cluster-why-stacking-nvidia-dgx-spark-units-unlocks-local-scale", "markdown": "https://wpnews.pro/news/two-sparks-one-cluster-why-stacking-nvidia-dgx-spark-units-unlocks-local-scale.md", "text": "https://wpnews.pro/news/two-sparks-one-cluster-why-stacking-nvidia-dgx-spark-units-unlocks-local-scale.txt", "jsonld": "https://wpnews.pro/news/two-sparks-one-cluster-why-stacking-nvidia-dgx-spark-units-unlocks-local-scale.jsonld"}}