{"slug": "jamesob-s-guide-to-running-sota-llms-locally", "title": "Jamesob's guide to running SOTA LLMs locally", "summary": "Jamesob published a guide on building a local system to run state-of-the-art large language models, detailing hardware configurations ranging from $2k to $40k. The setup uses multiple RTX GPUs and PCIe switches for peer-to-peer communication, enabling high-performance inference of models like GLM-5.2-594B and Qwen3.6-27B locally. The guide emphasizes cost-effective VRAM investment over expensive PCIe5/DDR5 hardware.", "body_md": "*Note: nothing in this README aside from the tables was written by AI.*\n\nHave $2k burning a hole in your pocket and want some local, state-of-the-art machine intelligence?\n\nHow about $40k?\n\nIf Dario and Altman are giving you heartburn (they should be), read on to figure out how to run this new kind of computing locally.\n\nIn this repo you'll find\n\n- the hardware I use to run SOTA locally,\n- why I bought what and little-known\n*secrets*for configuring it,\n\n- why I bought what and little-known\n- how I run speech-to-text (STT) locally,\n- ready-to-run configuration for running models I think are good within Docker containers.\n\n| Section | TL;DR |\n|---|---|\n|\n\n[Base system](#base-system)[GPUs](#gpus)[c-payne switch sub-BOM](#c-payne-pcie-gen4-switch-sub-bom-c-paynecom)[c-payne.com](https://c-payne.com)so GPUs talk peer-to-peer[GPU mount](#gpu-mount)[Making the switch behave](#getting-the-pci-switches-to-work-properly)[Kernel / GRUB params](#kernel--grub-parameters)`iommu=off`\n\nor NCCL hangs[ACS disable](#acs-disable-critical-for-switch-p2p)[GPU power limiting](#gpu-power-limiting)[Result](#result)`runners/`\n\n[GLM-5.2-594B](/jamesob/local-llm/blob/master/runners/GLM-5.2-594B): vLLM docker-compose, DCP4+MTP5, ~80 t/s @ 460k ctx`runners/stt`\n\n`whisper-large-v3`\n\n`tools/`\n\n[: P2P bandwidth/latency benchmark](/jamesob/local-llm/blob/master/tools/measure-gpu-speed.sh)`measure-gpu-speed.sh`\n\n[Resources](#resources)I was lucky/dumb enough to buy 4x RTX Pro 6000s back when they were cheaper. Because RAM is now so expensive, I opted to build a last-gen DDR4 system to host these cards, the parts for which I got off eBay. This allowed me to keep base system cost reasonable while still getting a lot of VRAM.\n\nAnother somewhat unusual thing I did was to use PCIe4 switches (from\n[c-payne.com](/jamesob/local-llm/blob/master/c-payne.com)). This allows the GPUs to communicate to one another\n\"directly\" at wire speeds during the allreduce step in tensor parallelism, rather than\nhaving to send all data through the PCI root complex. The upshot of this is reduced\nlatency between the cards with less of a need for expensive PCIe5 hardware.\n\nConsequently, I'm spending money on VRAM (where it counts) rather than on a PCIe5/DDR5 base system, which is terrifically expensive as of July 2026.\n\nMy particular BOM is detailed below.\n\nA great way to go is 2x RTX 3090s for a total of **48GB VRAM** total. You can then run\n[Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B), which is an awesome model.\n\nYou can also run SOTA speech-to-text (STT) with\n[ whisper-large-v3](https://huggingface.co/openai/whisper-large-v3), which I find very\nuseful. That's the model - you'd then access it with my cross-platform\n\n[.](https://github.com/jamesob/stt)\n\n`stt`\n\nharnessI've found local STT surprisingly useful - and I feel comfortable using it, unlike a\nhosted equivalent. You can find a ready-to-run config in\n[ ./runners/stt](/jamesob/local-llm/blob/master/runners/stt) that only assumes the presence of ~11GB of VRAM on an\nNvidia GPU.\n\nAt this price level, you get the next step up in model intelligence. Something pretty close to Claude Opus.\n\nYou'd buy 4x RTX 6000 Pros for a total of **384GB of VRAM**.\n\n| Date | Best model | My config |\n|---|---|---|\n| 2026-07 |\n`GLM-5.2-Int8Mix-NVFP4-REAP-594B` |\n\n[Runner config](/jamesob/local-llm/blob/master/runners/GLM-5.2-594B)Note: these are my recommendations, but there are other completely valid ways to spend\nyour money. For example, there's probably also some regime where rather than getting 4\nrtx6kpros, you allocate most of your money to building out a [linked 4x DGX Spark\ncluster](https://youtu.be/QJqKqxQR36Y?si=MiKNYtIzut_5pnXy) for a total of 512GB VRAM\nand use that as the slow, big brain to drive Qwen3.7-27b to do the rote tasks quickly.\n\nHere's the hardware I wound up purchasing for the 4x RTX 6000 pro machine.\n\nA modest, last-gen EPYC system purchased in parts almost entirely from eBay.\n\n| Component | Spec | Price |\n|---|---|---|\n| Motherboard | ASRock Rack ROMED8-2T (SP3, 7× PCIe 4.0 x16, dual 10GbE) | $715 |\n| CPU | AMD EPYC Milan 7313P (16-core 3.0GHz) | $504 |\n| RAM | 8× 16GB Crucial CT16G4RFD4213 DDR4 ECC RDIMM (128GB total, eBay) | $642 |\n| CPU Cooler | Dynatron T17 SP3 tower, 280W TDP | $40 |\n| Case | AAAWave Sluice V2 open frame | $100 |\n| PSUs | 2× Super Flower 1700W | $750 |\n| PCIe Switch | c-payne Microchip Switchtec PM40100 Gen4 (see sub-BOM below) | ~$1,330 |\n| Boot NVMe | 4TB M.2 | $291 |\n| Storage NVMe | (2x) 8TB M.2 (model weights) | $1,200 |\n| Fans | 3× 120mm PWM | $15 |\nTotal |\n$5,587 |\n\n| Component | Spec | Price |\n|---|---|---|\n| GPUs | 4× NVIDIA RTX PRO 6000 Blackwell Workstation (96GB each, 384GB VRAM total) |\n~$46,000 |\n\n| Part | Qty | Unit (€) | Notes |\n|---|---|---|---|\n| PCIe gen4 Switch 5× x16 — Microchip Switchtec PM40100 | 1 | 1.050 | 2× SlimSAS 8i upstream, 5× x16 quad-width-spaced downstream, aux x4 SlimSAS, 3× 8-pin EPS power |\n| SlimSAS PCIe gen4 Host Adapter x16 — REDRIVER AIC (DS160PR810) | 1 | 140 | Plugs into ROMED8-2T x16 slot, feeds switch upstream |\n| SlimSAS SFF-8654 8i cable — PCIe gen4 | 2 | ~30 | Each carries x8; pair = x16 upstream |\nTotal |\n\nI had to custom fabricate a wood enclosure for the PCI switch and GPUs, which took about a day.\n\nI found the PCI switch's builtin fan very loud and seemingly useless, so I simply unplugged that from the board.\n\nI save all model weights locally on a ZFS filesystem that's replicated across the two\n8TB drives, which is mounted at `~/storage`\n\n.\n\nFor any model I want to run, I first download the model using\n\n```\nhf download <model-name> --local-dir ~/storage/<model-name>\n```\n\nOnce the model weights are cached locally, I have a specific directory for each model\nthat contains a `docker-compose.yml`\n\nfile that cordones off the running of each model\nto its own Docker container.\n\nYou can find these configurations in [ ./runners/](/jamesob/local-llm/blob/master/runners).\n\nEach container mounts in `~/storage/models`\n\nin read-only mode to obtain the weights\nthat I've cached locally.\n\nI then use `opencode`\n\nhosted on a VM on another machine to access the models once\nthey're serving on `http://clank.j.co:5000`\n\n.\n\nI use a network-internal DNS server to point `clank.j.co`\n\nto the LLM machine, but you\ncould simply do `http://<llm-machine-ip>:5000`\n\ntoo.\n\nI created a VM and clanked up an application that basically just creates a tmux session\nfor each directory within the VM's `~/src`\n\ntree, which then runs an `opencode`\n\ninstance\nthat backs up to the inference machine's HTTP API (`http://clank.j.co:5000`\n\n).\n\nOne key to making the opensource models good is tooling them properly; a summary of my\n`skills/`\n\nis:\n\n- camofox, kagi.com API key, and searXNG for web browsing and search,\n- Telegram bot for communication and alerting,\n- a local private Gitea instance for collaborating on source code.\n\nThe clanker will either work with me interactively in a session, or can be farmed off to work on Gitea issues and file PRs there.\n\nAll this happens in a sandboxed VM where the only communication back to the host system happens via a shared filesystem mount, so the thing can go ham and install whatever it wants.\n\nThere was a lot of fiddling with the BIOS in order to make sure the motherboard wasn't downregulating the PCI switch speeds.\n\n| Setting | Value | Why |\n|---|---|---|\n`Chipset Configuration → AMD PCIE Link Width` (switch slot) |\nx16 (was x8/x8) |\nBifurcation was splitting the slot; upstream link trained at Gen4 x8. Requires both SlimSAS 8i cables connected (each carries x8). |\n| PCIe Link Speed (switch slot) | Gen4 (not Auto) |\nBlackwell Gen5 devices auto-negotiating down through the Gen4 switch could fail training and fall to Gen1. Forcing Gen4 stabilizes it. |\n| ASPM | Disabled |\nASPM L1 drops idle links to 2.5GT/s. This turned out to be the explanation for the \"Gen1 downgraded\" lspci readings — links were actually running Gen4 under load (verified via p2pBandwidthLatencyTest), but disabling ASPM removes the cosmetic scare and any re-train latency. |\n| Re-Size BAR | Enabled |\nRequired for full 96GB VRAM BAR exposure and GPU P2P. |\n| SR-IOV | Disabled |\nBare-metal inference; avoids IOMMU overhead and P2P interference. |\n| Preferred IO | Auto |\nOptionally set Manual → bus `81` (the c-payne switch) for marginal latency gains, but left at Auto — it's a squeeze-more optimization, not a fix, and bus numbers shift after BIOS changes. |\n\nPer c-payne's advice, I did reduce the gain to \"lvl 3\" using [his\ntool](https://c-payne.com/c-payne-tool), which was probably the most finicky part of\nthe process.\n\nThe gain level is going to be a function of how long your SAS connector cables are.\n\nI screwed up and ordered too few of the cables from c-payne directly, so I bought what I thought was the same SAS cable off of Amazon. There was actually a slight difference that was causing issues, and I had to reorder cables - so double-check that you're getting the right stuff!\n\n```\n# /etc/default/grub\nGRUB_CMDLINE_LINUX=\"iommu=off amd_iommu=off nomodeset\"\nsudo update-grub\n\n# nvidia_uvm P2P fix\necho 'options nvidia_uvm uvm_disable_hmm=1' | sudo tee /etc/modprobe.d/uvm.conf\nsudo update-initramfs -u\n```\n\nWithout `iommu=off`\n\n, NCCL hangs on multi-GPU P2P.\n\nWith ACS enabled (default), P2P traffic gets bounced through the CPU root port\ninstead of staying inside the switch fabric, negating the switch entirely.\n`pcie_acs_override`\n\nrequires a patched kernel, so we disable via setpci at runtime.\n\n``` bash\n# /usr/local/bin/disable-acs.sh\n#!/bin/bash\nif [ \"$EUID\" -ne 0 ]; then\n  echo \"ERROR: must be run as root\"\n  exit 1\nfi\n\nfor BDF in $(lspci -d \"*:*:*\" | awk '{print $1}'); do\n  setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1\n  if [ $? -ne 0 ]; then\n    continue\n  fi\n  echo \"Disabling ACS on $(lspci -s ${BDF})\"\n  setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000\ndone\n```\n\nRun on every boot via systemd oneshot:\n\n```\n# /etc/systemd/system/disable-acs.service\n[Unit]\nDescription=Disable PCIe ACS for GPU P2P\nAfter=multi-user.target\n\n[Service]\nType=oneshot\nExecStart=/usr/local/bin/disable-acs.sh\n\n[Install]\nWantedBy=multi-user.target\n```\n\nVerify: `lspci -vvv | grep ACSCtl`\n\nshould show all minus signs, and\n`nvidia-smi topo -m`\n\nshould show **PIX** between all four GPUs (not PHB/NODE).\n\nUse [ ./tools/measure-gpu-speed.sh](/jamesob/local-llm/blob/master/tools) to measure this easily.\n\nIn order to avoid installing a 220V circuit, I (probably unwisely) run this rig on a single 110V circuit, but I power regulate the cards.\n\nPersistence mode + power cap applied at boot via systemd (install-gpu-power-limit.sh):\n\n```\nsudo nvidia-smi -pm 1\nsudo nvidia-smi -pl 350    # 350W per GPU (default 600W)\n```\n\n350W/GPU = 1,400W GPU load, sized for the PSU budget. During the interim single-1700W-PSU phase (before the 240V circuit), cards ran at ~260W (4×260 = 1,040W GPUs + ~280W system ≈ 1,320W total).\n\nVerify: `nvidia-smi --query-gpu=index,power.limit,power.draw --format=csv`\n\nUpstream: Gen4 x16 (~30 GB/s to CPU). P2P through switch: **27.5 GB/s\nunidirectional / 50.4 GB/s bidirectional, 0.37–0.45 µs latency**, i.e. Gen4 line\nrate. Note: lspci may still show downstream GPU links as \"2.5GT/s (downgraded)\"\nat idle if ASPM is active anywhere; this is cosmetic. Links retrain to Gen4\nunder load.\n\n- A frequently updated repo on getting the most out of 4, 6, or 8 RTX 6000 Pro cards:\n[https://github.com/local-inference-lab/rtx6kpro](https://github.com/local-inference-lab/rtx6kpro) - Indie PCI switches that I use:\n[https://c-payne.com](https://c-payne.com) - RTX6kPRO discord server; lotta guys benching and testing new models:\n[https://discord.gg/QMNvFkuDN](https://discord.gg/QMNvFkuDN)", "url": "https://wpnews.pro/news/jamesob-s-guide-to-running-sota-llms-locally", "canonical_source": "https://github.com/jamesob/local-llm", "published_at": "2026-07-03 15:03:43+00:00", "updated_at": "2026-07-03 21:33:47.951764+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools"], "entities": ["Jamesob", "RTX Pro 6000", "RTX 3090", "Nvidia", "GLM-5.2-594B", "Qwen3.6-27B", "whisper-large-v3", "c-payne.com"], "alternates": {"html": "https://wpnews.pro/news/jamesob-s-guide-to-running-sota-llms-locally", "markdown": "https://wpnews.pro/news/jamesob-s-guide-to-running-sota-llms-locally.md", "text": "https://wpnews.pro/news/jamesob-s-guide-to-running-sota-llms-locally.txt", "jsonld": "https://wpnews.pro/news/jamesob-s-guide-to-running-sota-llms-locally.jsonld"}}