{"slug": "pipeline-parallel-llm-inference-across-gpus-on-separate-machines", "title": "Pipeline-parallel LLM inference across GPUs on separate machines", "summary": "A 744-billion-parameter GLM-5.2 model was served at ~30 tokens per second across six prosumer Blackwell GPUs in six US states over a wide-area network using pipeline parallelism and speculative decoding. The system, built on the Shard inference engine, splits the model into contiguous layer blocks across separate machines, with a coordinator holding only the token embedding and a small draft model. This demonstrates that frontier-size models can be run across geographically distributed consumer hardware at usable speeds by hiding WAN latency through async pipelining and CUDA-graphed draft models.", "body_md": "Pipeline-parallel LLM inference across GPUs on separate machines. A model too large for any single card is split into contiguous blocks of layers — one shard per GPU — and a request is served by streaming activations through the shards in order. No datacenter, no single host, and no node ever holds the whole model.\n\nShard is the inference engine for [c0mpute](https://c0mpute.ai).\n\n**A 744-billion-parameter frontier model, served at ~30 tok/s across seven\nprosumer Blackwell GPUs in six US states — over WAN, greedy, deterministic.**\nGLM-5.2 (NVFP4, 78 layers) is split **13 layers per node** across 6× RTX PRO 6000;\nno single card holds it, six do. Each node loads **only its own block**. A coordinator\nholds no model layers — just the token embedding/head and a small CUDA-graphed\nGLM-4-9B draft that proposes tokens, which the distributed 744B verifies.\n\n| Setup | tok/s (warm) | Output |\n|---|---|---|\n| GLM-5.2 744B NVFP4, 6× RTX PRO 6000 across 6 US states (NV · TX · MN · MO · UT + WA coord), WAN, pipelined spec-decode + CUDA-graphed draft | ~30 |\ngreedy, deterministic |\n\nEvery run emits a **verifiable receipt** — distinct GPU UUIDs / public IPs / regions,\nmeasured WAN edge RTTs (22–75 ms), the output token hash, and a lossless-optimization\ncheck. This run's receipt: [ docs/receipts/glm52-nvfp4-wan-20260618.json](/leyten/shard/blob/master/docs/receipts/glm52-nvfp4-wan-20260618.json)\n(see\n\n[docs/PROOF.md](/leyten/shard/blob/master/docs/PROOF.md)for how a skeptic checks it).\n\nThat is the whole thesis in one line: a frontier-size model, far too big for any single card, served across machines on different networks — activations crossing the country on every traversal — at a speed that is actually usable.\n\nPlain pipeline decode over WAN is latency-bound: one round-trip per token, ~1–2 tok/s, unusable. The path to 30 was a sequence of measured steps, each committed:\n\n| Step | tok/s | What changed |\n|---|---|---|\n| plain KV decode | 1.87 | latency-bound baseline (one token per round-trip) |\n| + deep-draft spec-decode (GLM-4-9B), relay-back | 1.99 | one traversal commits several tokens |\n+ ring direct-return |\n2.94 | tail returns to the coordinator in one hop — 7 ring hops, not a 12-hop relay-back |\n+ async pipelining |\n16.6 | overlap many verify traversals in flight → throughput-bound, not latency-bound; the WAN drops to ~5% of the loop |\n+ CUDA-graphed draft |\n~30 |\nwith the WAN hidden, the draft was 94% of the loop; CUDA-graphing it (3.8×) lifts the whole pipeline |\n\n**The key insight: over WAN the round-trip is the scarce resource, not compute** —\nso speculative decoding, marginal in a datacenter, becomes the whole game. A small\ndraft proposes K tokens; the distributed 744B verifies them in a single pipeline\ntraversal; greedy acceptance commits the verified prefix. Then two compounding wins:\n\n-\n**Async pipelining over the ring.** Because the ring is direct-return, multiple verify chunks can be in flight at once. The coordinator drafts a continuous stream and pumps overlapping chunks into the pipeline without waiting — so the loop runs at the pipeline's*throughput*, not its*latency*. The WAN, which dominated every prior attempt, drops to ~5% of the loop. -\n**CUDA-graphed draft.** Once the WAN is hidden, the GLM-4-9B draft (single-token decode, launch-overhead-bound) becomes 94% of the loop. Capturing it as a CUDA graph cuts it 3.8× (49.7→13.1 ms/tok). The hard part was making the static KV cache honor speculative rollback under graph capture — solved by driving the write slot through a static-address position tensor; the result is**byte-identical to the eager path**, so the optimization is provably lossless. (`research/glm_swarm_nvfp4_cg.py`\n\n,`research/glm_swarm_nvfp4_cg_diff.py`\n\n.)\n\nA transformer is a stack of layers. Shard splits the stack into contiguous blocks, one block per GPU. A token is produced by passing activations through the blocks in order; each node keeps a KV-cache for its own layers.\n\n```\ncoordinator (WA) ── GLM-4-9B draft (CUDA-graphed) + embed / lm_head\n     │\n     ├─► stage0 ─► stage1 ─► stage2 ─► stage3 ─► stage4 ─► stage5 ─┐  (verify chunks, pipelined)\n     │   NV         TX         (·)        MN         MO        UT    │\n     │   0–12       13–25      26–38      39–51      52–64     65–77 │\n     └──────────────── direct return (tail → coordinator, 1 hop) ────┘\n```\n\nThe coordinator (entry node) holds **no** 744B layers — only the draft and a thin\ndriver. Each round: the draft proposes K tokens; the coordinator ships `[cur, d₁..dₖ]`\n\ninto stage 0, which embeds them; the chain verifies all K+1 in one forward traversal;\nthe tail returns the argmaxes straight to the coordinator (one hop, not relayed back);\nthe coordinator greedy-accepts the longest matching prefix. Many such chunks are in\nflight at once (the pipeline), and the draft replays a captured CUDA graph against a\nstatic KV cache.\n\nSplitting a model across co-located GPUs is well understood. Doing it across machines on the open internet, fast enough to be usable, is not — and that is the part Shard owns.\n\n**Latency.** Every token traverses the whole pipeline. Speculative decoding amortizes one round-trip over many committed tokens; pipelining overlaps the traversals so the WAN stops being the floor; the CUDA-graphed draft keeps what's left cheap.**Transport.** The activation tensor crosses the public internet on every step. Shard owns this layer — supervised edges that fail fast and reconnect, per-edge health logging, no opaque \"broken pipe.\" The wire is authenticated and encrypted with pickle-free framing (`phase0/wire.py`\n\n; ChaCha20-Poly1305 under a shared`SHARD_PSK`\n\n), so a passive observer learns nothing and a forged frame is a parse error, not code execution. (NAT hole-punching + relay fallback for home routers is the remaining Phase 1 work; a direct open port stands in today.)\n\nShard is c0mpute infrastructure, held to its three guarantees:\n\n**Uncensored.** The engine runs models as-is. No content filter in the inference path.**Decentralized.** Anyone can join a GPU with one command and be assigned a block of layers. No central inference server.**Private.** No node holds the whole model — a real start, not the whole story. The wire is sealed (authenticated encryption, pickle-free), so the leak is not on the path; but a*participating*node must decrypt to run its layer, so it sees the activations it processes. Intermediate activations can still leak a fraction of a user's tokens to a malicious node. The plan — pin leaky boundary layers to trusted nodes, per-request trusted routing, never overclaim — is in[docs/ARCHITECTURE.md](/leyten/shard/blob/master/docs/ARCHITECTURE.md). It is the number-one open problem and is treated as one.\n\n120B (MXFP4, 36 layers) across **3 scattered RTX 4090s in different US states** + a\ncoordinator, **~40 tok/s (peak ~42), greedy, exact**. This is the rig the permissionless\nwork (Phase 3+) is built on — plain 24GB consumer cards, the hardware a real volunteer runs.\nThis run's verifiable receipt (distinct GPU UUIDs / IPs / states, WAN edge RTTs, output\nhash, sync-vs-pipelined token match): [ docs/receipts/gpt-oss-120b-wan-20260619.json](/leyten/shard/blob/master/docs/receipts/gpt-oss-120b-wan-20260619.json).\n\nThe climb from a latency-bound ~18 tok/s, each step measured:\n\n| Step | tok/s | What changed |\n|---|---|---|\n| pipelined spec-decode (4-stage) | 25.8 | async-draft overlap + many verify chunks in flight + RTT-optimal ring order |\n+ 3-stage (12-layer) ring |\n28.8 | fatter stages → 4 WAN hops instead of 5 (12 layers fits a 24GB card) |\n+ coordinator placed in-region |\n~40 (peak ~42) |\nthe coordinator holds no model layers, so it can live anywhere; moving it off the cross-country leg cut the ring 174→102 ms |\n\nThe last step is the one nobody looks for: the cheapest node in the system — the\nlayer-less coordinator — was sitting a continent away from the swarm, paying two long\nround-trips on every token. Putting it next to the stages, on the same scattered nodes,\nwas a ~40% latency cut for free. Full record:\n[docs/research/wan-speculative-decoding.md](/leyten/shard/blob/master/docs/research/wan-speculative-decoding.md).\n\nGLM-5.2 (above) remains the **frontier-size** flagship — 6× the parameters at 744B;\ngpt-oss-120B is the faster, consumer-card build target the network is bootstrapped on.\n\n```\nphase0/   transport + deploy: wire.py (sealed framing), mesh.py (edge RTTs),\n          proof_receipt.py (run-receipt build/verify), launch + bench tooling\nresearch/ the swarm drivers — glm_swarm_nvfp4_kv.py (NVFP4 KV-cached stages),\n          glm_swarm_nvfp4_pipe.py (pipelined spec-decode), glm_swarm_nvfp4_cg.py\n          (CUDA-graphed draft), *_cg_diff.py / *_fwdcmp.py (correctness diagnostics)\ndocs/     ARCHITECTURE, ROADMAP, PROOF.md, receipts/, and the research records\nshard/    engine module scaffolding (node, transport, specdec, topology)\n```\n\n**Phase 0 — Transport, proven.** Reliable serving through a multi-stage split.**Phase 1 — WAN.** Different networks behind NAT: hole-punching, relay fallback, activation quantization, edge supervision.**Phase 2 — Speculative decoding.** Draft-and-verify over the swarm —**done at GLM-5.2 744B scale, ~30 tok/s greedy over WAN**(and gpt-oss-120B at ~18–25, above).** Phase 3 — Permissionless swarm.**One-command join, dynamic layer allocation across heterogeneous GPUs, per-token payouts, fault tolerance.\n\nFull detail, pass/fail criteria, and risks: [docs/ROADMAP.md](/leyten/shard/blob/master/docs/ROADMAP.md).\n\n[Apache License 2.0](/leyten/shard/blob/master/LICENSE) © 2026 leyten", "url": "https://wpnews.pro/news/pipeline-parallel-llm-inference-across-gpus-on-separate-machines", "canonical_source": "https://github.com/leyten/shard", "published_at": "2026-06-19 19:14:34+00:00", "updated_at": "2026-06-19 19:37:33.654465+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-research", "ai-products", "ai-tools"], "entities": ["Shard", "c0mpute", "GLM-5.2", "RTX PRO 6000", "NVIDIA", "GLM-4-9B", "Blackwell"], "alternates": {"html": "https://wpnews.pro/news/pipeline-parallel-llm-inference-across-gpus-on-separate-machines", "markdown": "https://wpnews.pro/news/pipeline-parallel-llm-inference-across-gpus-on-separate-machines.md", "text": "https://wpnews.pro/news/pipeline-parallel-llm-inference-across-gpus-on-separate-machines.txt", "jsonld": "https://wpnews.pro/news/pipeline-parallel-llm-inference-across-gpus-on-separate-machines.jsonld"}}