Jamesob's guide to running SOTA LLMs locally Jamesob published a guide on building a local system to run state-of-the-art large language models, detailing hardware configurations ranging from $2k to $40k. The setup uses multiple RTX GPUs and PCIe switches for peer-to-peer communication, enabling high-performance inference of models like GLM-5.2-594B and Qwen3.6-27B locally. The guide emphasizes cost-effective VRAM investment over expensive PCIe5/DDR5 hardware. Note: nothing in this README aside from the tables was written by AI. Have $2k burning a hole in your pocket and want some local, state-of-the-art machine intelligence? How about $40k? If Dario and Altman are giving you heartburn they should be , read on to figure out how to run this new kind of computing locally. In this repo you'll find - the hardware I use to run SOTA locally, - why I bought what and little-known secrets for configuring it, - why I bought what and little-known - how I run speech-to-text STT locally, - ready-to-run configuration for running models I think are good within Docker containers. | Section | TL;DR | |---|---| | Base system base-system GPUs gpus c-payne switch sub-BOM c-payne-pcie-gen4-switch-sub-bom-c-paynecom c-payne.com https://c-payne.com so GPUs talk peer-to-peer GPU mount gpu-mount Making the switch behave getting-the-pci-switches-to-work-properly Kernel / GRUB params kernel--grub-parameters iommu=off or NCCL hangs ACS disable acs-disable-critical-for-switch-p2p GPU power limiting gpu-power-limiting Result result runners/ GLM-5.2-594B /jamesob/local-llm/blob/master/runners/GLM-5.2-594B : vLLM docker-compose, DCP4+MTP5, ~80 t/s @ 460k ctx runners/stt whisper-large-v3 tools/ : P2P bandwidth/latency benchmark /jamesob/local-llm/blob/master/tools/measure-gpu-speed.sh measure-gpu-speed.sh Resources resources I was lucky/dumb enough to buy 4x RTX Pro 6000s back when they were cheaper. Because RAM is now so expensive, I opted to build a last-gen DDR4 system to host these cards, the parts for which I got off eBay. This allowed me to keep base system cost reasonable while still getting a lot of VRAM. Another somewhat unusual thing I did was to use PCIe4 switches from c-payne.com /jamesob/local-llm/blob/master/c-payne.com . This allows the GPUs to communicate to one another "directly" at wire speeds during the allreduce step in tensor parallelism, rather than having to send all data through the PCI root complex. The upshot of this is reduced latency between the cards with less of a need for expensive PCIe5 hardware. Consequently, I'm spending money on VRAM where it counts rather than on a PCIe5/DDR5 base system, which is terrifically expensive as of July 2026. My particular BOM is detailed below. A great way to go is 2x RTX 3090s for a total of 48GB VRAM total. You can then run Qwen3.6-27B https://huggingface.co/Qwen/Qwen3.6-27B , which is an awesome model. You can also run SOTA speech-to-text STT with whisper-large-v3 https://huggingface.co/openai/whisper-large-v3 , which I find very useful. That's the model - you'd then access it with my cross-platform . https://github.com/jamesob/stt stt harnessI've found local STT surprisingly useful - and I feel comfortable using it, unlike a hosted equivalent. You can find a ready-to-run config in ./runners/stt /jamesob/local-llm/blob/master/runners/stt that only assumes the presence of ~11GB of VRAM on an Nvidia GPU. At this price level, you get the next step up in model intelligence. Something pretty close to Claude Opus. You'd buy 4x RTX 6000 Pros for a total of 384GB of VRAM . | Date | Best model | My config | |---|---|---| | 2026-07 | GLM-5.2-Int8Mix-NVFP4-REAP-594B | Runner config /jamesob/local-llm/blob/master/runners/GLM-5.2-594B Note: these are my recommendations, but there are other completely valid ways to spend your money. For example, there's probably also some regime where rather than getting 4 rtx6kpros, you allocate most of your money to building out a linked 4x DGX Spark cluster https://youtu.be/QJqKqxQR36Y?si=MiKNYtIzut 5pnXy for a total of 512GB VRAM and use that as the slow, big brain to drive Qwen3.7-27b to do the rote tasks quickly. Here's the hardware I wound up purchasing for the 4x RTX 6000 pro machine. A modest, last-gen EPYC system purchased in parts almost entirely from eBay. | Component | Spec | Price | |---|---|---| | Motherboard | ASRock Rack ROMED8-2T SP3, 7× PCIe 4.0 x16, dual 10GbE | $715 | | CPU | AMD EPYC Milan 7313P 16-core 3.0GHz | $504 | | RAM | 8× 16GB Crucial CT16G4RFD4213 DDR4 ECC RDIMM 128GB total, eBay | $642 | | CPU Cooler | Dynatron T17 SP3 tower, 280W TDP | $40 | | Case | AAAWave Sluice V2 open frame | $100 | | PSUs | 2× Super Flower 1700W | $750 | | PCIe Switch | c-payne Microchip Switchtec PM40100 Gen4 see sub-BOM below | ~$1,330 | | Boot NVMe | 4TB M.2 | $291 | | Storage NVMe | 2x 8TB M.2 model weights | $1,200 | | Fans | 3× 120mm PWM | $15 | Total | $5,587 | | Component | Spec | Price | |---|---|---| | GPUs | 4× NVIDIA RTX PRO 6000 Blackwell Workstation 96GB each, 384GB VRAM total | ~$46,000 | | Part | Qty | Unit € | Notes | |---|---|---|---| | PCIe gen4 Switch 5× x16 — Microchip Switchtec PM40100 | 1 | 1.050 | 2× SlimSAS 8i upstream, 5× x16 quad-width-spaced downstream, aux x4 SlimSAS, 3× 8-pin EPS power | | SlimSAS PCIe gen4 Host Adapter x16 — REDRIVER AIC DS160PR810 | 1 | 140 | Plugs into ROMED8-2T x16 slot, feeds switch upstream | | SlimSAS SFF-8654 8i cable — PCIe gen4 | 2 | ~30 | Each carries x8; pair = x16 upstream | Total | I had to custom fabricate a wood enclosure for the PCI switch and GPUs, which took about a day. I found the PCI switch's builtin fan very loud and seemingly useless, so I simply unplugged that from the board. I save all model weights locally on a ZFS filesystem that's replicated across the two 8TB drives, which is mounted at ~/storage . For any model I want to run, I first download the model using hf download