{"slug": "building-vllm-from-source-a-field-guide-with-all-the-pitfalls", "title": "Building vLLM from Source: A Field Guide (with all the pitfalls)", "summary": "A developer building vLLM from source on an AWS g5 instance with Ubuntu 26.04 and Python 3.14 encountered multiple version-skew, driver, and toolchain issues, including a pitfall where missing nvidia-smi falsely indicated no GPU. The field guide provides a working recipe and explains each step to avoid cryptic build failures.", "body_md": "# Building vLLM from Source: A Field Guide (with all the pitfalls)\n\nA step-by-step field guide to building vLLM from source on Ubuntu 26.04, covering Python 3.14 compatibility, CUDA driver issues, and toolchain pitfalls.\n\nBuilding vLLM 1 from source sounds like a\n\n`pip install -e .`\n\naway. In practice, on a fresh machine with a recent OS and a recent Python, you hit a chain of version-skew, driver, and toolchain issues that each fail with a cryptic message. This post walks through a real end-to-end build on an **AWS g5 instance (NVIDIA A10G)** running\n\n**Ubuntu 26.04 + Python 3.14**, documenting every error encountered and the fix.\n\nThe target was a CUDA build of a vLLM fork. The same playbook applies to a stock `vllm-project/vllm`\n\ncheckout.\n\n## TL;DR — the working recipe\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n# 1. Confirm you actually have a GPU (see \"Pitfall 1\" — easy to get wrong)\nlspci | grep -i nvidia        # hardware present?\nnvidia-smi                    # driver working?\n\n# 2. Driver (if nvidia-smi fails but lspci shows the GPU)\nsudo apt-get install -y nvidia-driver-575-open nvidia-modprobe dkms\nsudo modprobe -r nouveau && sudo modprobe nvidia   # or reboot\n\n# 3. Virtual env\npython3 -m venv ~/go/venv && source ~/go/venv/bin/activate\npip install --upgrade pip\n\n# 4. CUDA torch + a CONSISTENT pip CUDA toolkit (critical: one minor version)\npip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0   # default index = CUDA build\npip install \"cuda-toolkit[nvcc]==13.3.0\" \"nvidia-cuda-runtime==13.3.29\" \\\n            \"nvidia-cuda-nvrtc==13.3.33\" \"nvidia-cublas==13.3.0.5\"\n\n# 5. Assemble CUDA_HOME from the pip layout\nexport CUDA_HOME=$VIRTUAL_ENV/lib/python3.*/site-packages/nvidia/cu13\nln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64\n( cd $CUDA_HOME/lib && for f in lib*.so.*; do ln -sf \"$f\" \"${f%%.so.*}.so\"; done )\nmkdir -p $CUDA_HOME/lib/stubs\nln -sf /usr/lib/x86_64-linux-gnu/libcuda.so $CUDA_HOME/lib/stubs/libcuda.so\n\n# 6. Build (scope arch to YOUR GPU — A10G is sm_86)\nexport PATH=$CUDA_HOME/bin:$PATH CUDACXX=$CUDA_HOME/bin/nvcc\nexport VLLM_TARGET_DEVICE=cuda TORCH_CUDA_ARCH_LIST=\"8.6+PTX\"\nexport MAX_JOBS=12 NVCC_THREADS=2\nexport CMAKE_ARGS=\"-DCUDAToolkit_ROOT=$CUDA_HOME -DCMAKE_CUDA_COMPILER=$CUDA_HOME/bin/nvcc\"\npip install -v -e . --no-build-isolation\n```\n\nRead on for *why* each line is there and what breaks without it.\n\n## Prerequisites & how to check them\n\nBefore anything else, take an inventory. Getting this wrong wastes the most time — including the most embarrassing pitfall of all.\n\n| Requirement | How to check | Notes |\n|---|---|---|\nA GPU (and which one) | `lspci \\| grep -i nvidia` | Determines CUDA vs CPU build. Don’t trust — see Pitfall 1.`nvidia-smi` alone |\n| GPU driver loaded | `nvidia-smi` | If it fails but `lspci` shows a GPU, the driver isn’t installed/loaded. |\n| Compute capability | `nvidia-smi --query-gpu=compute_cap --format=csv` | A10G = `8.6` . You build kernels for this. |\n| CPU flags (CPU build only) | `lscpu \\| grep -oE 'avx512f\\|avx2'` | vLLM CPU wants AVX512; AVX2 works with limited features. |\n| Compiler | `gcc --version` | vLLM recommends gcc 12–13; newer (15) mostly works but watch nvcc host-compiler limits. |\n| Python | `python3 --version` | Check the repo’s `requires-python` in `pyproject.toml` . |\n| RAM / cores | `nproc; free -h` | CUDA compiles are RAM-hungry (~2–3 GB per parallel job). |\n| build tools | `cmake --version; ninja --version` | vLLM needs cmake ≥ 3.26. |\n\n### Pitfall 1: “There’s no GPU here” — when there definitely is\n\nThis one cost us a whole CPU build. The very first check was:\n\n```\n1\nnvidia-smi   # → command not found\n```\n\nConclusion drawn: *no GPU, do a CPU build.* **Wrong.** `nvidia-smi`\n\nmissing only means the **driver/userspace tools aren’t installed** — it says nothing about the hardware. The actual hardware check is:\n\n```\n1\n2\nbash\n$ lspci | grep -i nvidia\n00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)\n```\n\nThe A10G was there the whole time; it just had no driver. **Always check lspci (or /proc/driver/nvidia, ls /dev/nvidia*) before concluding “no GPU.”** On cloud instances that aren’t “Deep Learning AMIs,” a bare GPU with no driver is the norm, not the exception.\n\nLesson:`lspci`\n\ndetects hardware.`nvidia-smi`\n\ndetects aworking driver. They answer different questions. Decide CPU-vs-GPU from`lspci`\n\n.\n\n## Step 2: Install and load the NVIDIA driver\n\n`lspci`\n\nshows the GPU, `nvidia-smi`\n\nis missing → install the driver.\n\n```\n1\n2\n3\nsudo apt-get update\nsudo apt-get install -y dkms build-essential linux-headers-$(uname -r) \\\n                        nvidia-driver-575-open\n```\n\nWe used the **open-kernel** variant (`-open`\n\n), which is NVIDIA’s recommendation for Ampere and newer (A10G is Ampere). The `575`\n\nmetapackage pulled driver `580.159.03`\n\n.\n\n### Pitfall 2: `modprobe nvidia`\n\n→ “No such device” (nouveau owns the GPU)\n\n```\n1\n2\n3\n4\n5\nbash\n$ sudo modprobe nvidia\nmodprobe: ERROR: could not insert 'nvidia': No such device\n\n$ dmesg | grep NVRM\nNVRM: GPU 0000:00:1e.0 is already bound to nouveau.\n```\n\nThe open-source **nouveau** driver grabs the GPU at boot. The NVIDIA module can’t bind while nouveau holds it. Fix — blacklist, unbind, and load:\n\n```\n1\n2\n3\n4\n5\n6\necho -e \"blacklist nouveau\\noptions nouveau modeset=0\" | \\\n    sudo tee /etc/modprobe.d/blacklist-nouveau.conf\necho -n \"0000:00:1e.0\" | sudo tee /sys/bus/pci/drivers/nouveau/unbind\nsudo rmmod nouveau\nsudo modprobe nvidia\nsudo update-initramfs -u    # make the blacklist survive reboots\n```\n\nIf `rmmod nouveau`\n\ncomplains it’s in use (e.g. a display manager), a reboot after the blacklist + initramfs update achieves the same thing cleanly.\n\n### Pitfall 3: `nvidia-smi`\n\nworks but CUDA returns error 999 (“unknown error”)\n\nThis is the subtle one. After loading the module:\n\n```\n1\n2\n3\npython\n$ nvidia-smi          # works, shows the A10G\n$ python -c \"import torch; print(torch.cuda.is_available())\"\nRuntimeError: CUDA unknown error ...    # False\n```\n\nA direct driver-API probe confirmed the runtime was broken even though `nvidia-smi`\n\nwas fine:\n\n```\n1\n2\npython\nimport ctypes\nctypes.CDLL(\"libcuda.so.1\").cuInit(0)   # → 999 (CUDA_ERROR_UNKNOWN)\n```\n\nTwo distinct causes, both worth knowing:\n\n**Stale/incorrect UVM device nodes.**`nvidia-smi`\n\nuses`/dev/nvidia0`\n\n+`/dev/nvidiactl`\n\n(major 195). CUDA additionally needs`/dev/nvidia-uvm`\n\n. After a manual driver bring-up those nodes can be missing or have the wrong major. Recreate them against`/proc/devices`\n\n:\n\n```\n1\n2\n3\n4\n5\nsudo modprobe nvidia_uvm\nUVM_MAJOR=$(grep nvidia-uvm /proc/devices | awk '{print $1}')\nsudo rm -f /dev/nvidia-uvm /dev/nvidia-uvm-tools\nsudo mknod -m 666 /dev/nvidia-uvm        c $UVM_MAJOR 0\nsudo mknod -m 666 /dev/nvidia-uvm-tools  c $UVM_MAJOR 1\n```\n\nThis setuid helper is what the CUDA runtime shells out to in order to create/initialize device nodes for non-root processes. Without it, raw`nvidia-modprobe`\n\nis not installed.`cuInit`\n\nmay pass but**torch’s runtime init throws 999**. This was the actual fix for us:\n\n```\n1\n2\nsudo apt-get install -y nvidia-modprobe\nsudo nvidia-modprobe -c 0 -u\n```\n\nAfter this:\n\n`torch.cuda.is_available() → True`\n\n. A reboot also installs the proper udev rules and avoids the manual`mknod`\n\ndance — but if you can’t reboot, the two steps above get you there.\n\nLesson:`nvidia-smi`\n\nworking ≠ CUDA working. They use different device nodes. If`cuInit`\n\nreturns 999, look at`/dev/nvidia-uvm`\n\nand make sure`nvidia-modprobe`\n\nexists.\n\n## Step 3: The virtual environment\n\nNothing exotic here, but keep it isolated from system Python:\n\n```\n1\n2\n3\npython3 -m venv ~/go/venv\nsource ~/go/venv/bin/activate\npip install --upgrade pip\n```\n\nWe used Python **3.14**. Check the repo supports it:\n\n```\n1\n2\ngrep requires-python pyproject.toml\n# requires-python = \">=3.10,<3.15\"   ✅ 3.14 allowed\n```\n\nIt built fine — `torch==2.11.0`\n\nand every dependency had `cp314`\n\nwheels. But see Pitfall 6: a *bundled submodule* had its own narrower Python check.\n\n## Step 4: CUDA torch + a *consistent* CUDA toolkit\n\nvLLM compiles `.cu`\n\nkernels, so it needs `nvcc`\n\n— which PyTorch wheels do **not** bundle (they ship runtime libraries only). You have two options:\n\n- Install the full CUDA toolkit to\n`/usr/local/cuda`\n\nvia NVIDIA’s apt repo, or - Assemble a toolkit entirely from pip wheels.\n\nWe went pip-only (no apt repo for Ubuntu 26.04 yet, and it keeps everything in the venv). First, the CUDA build of torch:\n\n```\n1\n2\npip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0\npython -c \"import torch; print(torch.version.cuda)\"   # → 13.0  (wheel tag: 2.11.0+cu130)\n```\n\nThen nvcc and the dev components via the modern unified meta package:\n\n```\n1\npip install \"cuda-toolkit[nvcc]==13.3.0\"\n```\n\n### Pitfall 4: the `nvidia-cuda-nvcc-cu13`\n\npackage is a stub\n\nThe old naming is a trap:\n\n```\n1\n2\nbash\n$ pip install nvidia-cuda-nvcc-cu13\nERROR: ... (from versions: 0.0.0a0, 0.0.1)   # placeholder only!\n```\n\nThe real compiler ships via the ** cuda-toolkit[nvcc]** extra (which pulls\n\n`nvidia-cuda-nvcc`\n\n, `nvidia-nvvm`\n\n, `nvidia-cuda-crt`\n\n). Use the meta package’s extras, not the `*-cu13`\n\nstandalone names.### Pitfall 5: CUDA toolkit version skew (three separate failures)\n\nThis was the single biggest time sink. The pip CUDA ecosystem is split across many packages (`nvidia-cuda-nvcc`\n\n, `nvidia-nvvm`\n\n, `nvidia-cuda-crt`\n\n, `nvidia-cuda-cccl`\n\n, `nvidia-cuda-runtime`\n\n, `nvidia-cublas`\n\n, …) and pip will happily install **mismatched minor versions**. Each mismatch fails differently:\n\n**5a. ptxas can’t assemble newer PTX:**\n\n```\n1\nptxas fatal : Unsupported .version 9.3; current version is '9.0'\n```\n\nnvcc front-end was 13.3 (emits PTX 9.3) but `ptxas`\n\nwas 13.0 (≤ PTX 9.0). → align them.\n\n**5b. CMake refuses on nvcc-vs-headers mismatch** (PyTorch’s `cuda.cmake`\n\n):\n\n```\n1\n2\nCMake Error: FindCUDA says CUDA version is 13.3 (from nvcc), but the CUDA headers\nsay the version is 13.0.\n```\n\n**5c. flashinfer’s bundled cccl refuses at runtime** (its JIT compiler):\n\n```\n1\n2\ncccl/.../cuda_toolkit.h:41: error: \"CUDA compiler and CUDA toolkit headers are\nincompatible, please check your include paths\"\n```\n\nThe cccl check requires `CUDART_VERSION`\n\n’s minor to **exactly equal** nvcc’s minor.\n\n**The fix for all three:** pin the *entire* CUDA userspace to one minor version.\n\nWhy 13.3 and not 13.0 (to match torch’sBecause`cu130`\n\n)?CUDA 13.0 headers don’t compile on glibc 2.43(Ubuntu 26.04):\n\n```\n1\n2\n/usr/include/.../mathcalls.h:206: error: exception specification is incompatible\nwith that of previous function \"rsqrt\"\n```\n\nCUDA 13.1+ headers fixed this. So we align\n\nupto 13.3. torch built for`cu130`\n\nstill runs on a 13.3 runtime thanks toCUDA 13 minor-version compatibility(any 13.x toolkit runs on an R580+ driver).\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\npip install \"cuda-toolkit==13.3.0\" \"nvidia-cuda-runtime==13.3.29\" \\\n            \"nvidia-cuda-nvcc==13.3.33\" \"nvidia-nvvm==13.3.33\" \\\n            \"nvidia-cuda-crt==13.3.33\"  \"nvidia-cuda-cccl==13.3.3.3.1\" \\\n            \"nvidia-cuda-nvrtc==13.3.33\" \"nvidia-cublas==13.3.0.5\"\n\n# verify nvcc and headers agree:\nnvcc --version | grep release                                   # 13.3\ngrep CUDART_VERSION $CUDA_HOME/include/cuda_runtime_api.h        # 13030  (= 13.3)\n```\n\n`pip`\n\nprints a dependency-conflict *warning* (torch pins `cuda-toolkit==13.0.2`\n\n) — it’s cosmetic; torch runs fine via minor-version compat. **But beware:** reinstalling vLLM later re-pulls its `requirements/cuda.txt`\n\nand **silently downgrades the runtime back to 13.0**, breaking flashinfer’s JIT again. Re-run the 13.3 pins after any reinstall.\n\n## Step 5: Assemble a working `CUDA_HOME`\n\nThe pip wheels lay CUDA out under `.../site-packages/nvidia/cu13/{bin,include,lib}`\n\n, which is *almost* what CMake and downstream linkers expect — but missing three things:\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\nexport CUDA_HOME=$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13\n\n# (a) unversioned dev symlinks: wheels ship libcudart.so.13, linkers want libcudart.so\n( cd $CUDA_HOME/lib && for f in lib*.so.*; do ln -sf \"$f\" \"${f%%.so.*}.so\"; done )\n\n# (b) lib64 alias: some tools (flashinfer JIT) hardcode $CUDA_HOME/lib64\nln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64\n\n# (c) a libcuda stub for driver-API linking (pip ships no stubs/)\nmkdir -p $CUDA_HOME/lib/stubs\nln -sf /usr/lib/x86_64-linux-gnu/libcuda.so $CUDA_HOME/lib/stubs/libcuda.so\n```\n\nSanity check before the big build:\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\ncat > /tmp/t.cu <<'EOF'\n#include <cuda_runtime.h>\n__global__ void k(){}\nint main(){k<<<1,1>>>();return cudaDeviceSynchronize();}\nEOF\n$CUDA_HOME/bin/nvcc -arch=sm_86 -I$CUDA_HOME/include -L$CUDA_HOME/lib -lcudart /tmp/t.cu -o /tmp/t.out\n# Also confirm CMake finds it:\ncmake -P <(echo 'find_package(CUDAToolkit REQUIRED); message(\"CTK ${CUDAToolkit_VERSION}\")') 2>&1\n```\n\n## Step 6: Build vLLM\n\nSet the build environment and go. The most important variable is ** TORCH_CUDA_ARCH_LIST** — scope it to\n\n*your*GPU or you’ll compile every architecture and wait 5–10× longer.\n\n```\n1\n2\n3\n4\n5\n6\n7\n8\n9\ncd ~/go/vllm\nexport PATH=$CUDA_HOME/bin:$PATH\nexport CUDACXX=$CUDA_HOME/bin/nvcc\nexport VLLM_TARGET_DEVICE=cuda\nexport TORCH_CUDA_ARCH_LIST=\"8.6+PTX\"     # A10G = sm_86\nexport MAX_JOBS=12                        # ~2-3 GB RAM per job; tune to your box\nexport NVCC_THREADS=2\nexport CMAKE_ARGS=\"-DCUDAToolkit_ROOT=$CUDA_HOME -DCMAKE_CUDA_COMPILER=$CUDA_HOME/bin/nvcc\"\npip install -v -e . --no-build-isolation\n```\n\nA few notes:\n\n`--no-build-isolation`\n\nis required so the build sees the torch/CUDA you installed.`enforce_eager`\n\n-style arch warnings like`DeepGEMM/FlashMLA will not compile: unsupported CUDA architecture 8.6`\n\nare**expected** on Ampere — those kernels target Hopper (sm_90+) and are simply skipped.- On 16 cores / 62 GB this took ~30–40 min and produced\n`_C.abi3.so`\n\n(~117 MB),`_moe_C.abi3.so`\n\n, etc.\n\n### Pitfall 6: a bundled submodule rejects your Python\n\nEven though the top-level `pyproject.toml`\n\nallowed Python 3.14, the vendored flash-attention CMake had its own allow-list:\n\n```\n1\n2\nCMake Error at .deps/vllm-flash-attn-src/cmake/utils.cmake:20:\n  Python version (3.14) is not one of the supported versions: 3.9;3.10;3.11;3.12;3.13.\n```\n\nFix — add your version to the macro (vLLM points `FETCHCONTENT_BASE_DIR`\n\nat `.deps`\n\n, so edits there persist; just don’t `rm -rf .deps`\n\nbefore rebuilding):\n\n```\n1\n2\n# .deps/vllm-flash-attn-src/cmake/utils.cmake\nset(_SUPPORTED_VERSIONS_LIST ${SUPPORTED_VERSIONS} ${ARGN} \"3.14\")\n```\n\nThis patch is not permanent.flash-attn is pulled via CMake FetchContent at a pinned`GIT_TAG`\n\n. The moment you`git pull`\n\n/update vLLM and that tag changes (or you`rm -rf .deps`\n\n), FetchContent re-clones afreshcopy and your edit is gone — the 3.14 check fails again at the next configure. Re-apply the one-liner after any update that bumps the flash-attn tag.\n\n### Pitfall 7: dependency-resolver deadlock (`ResolutionImpossible`\n\n)\n\nOn a recent `main`\n\n, `pip install -e .`\n\ncan die *before compiling anything* with:\n\n```\n1\n2\n3\n4\n5\nERROR: Cannot install cuda-tile[tileiras]==1.4.0, cuda-toolkit==13.0.2 and vllm\nbecause these package versions have conflicting dependencies.\n  torch 2.11.0 depends on cuda-toolkit==13.0.2\n  cuda-tile[tileiras] 1.4.0 depends on cuda-toolkit>=13.2,<13.4\nERROR: ResolutionImpossible\n```\n\nTwo of vLLM’s own dependencies pin **incompatible** CUDA-toolkit ranges (torch wants exactly 13.0.2; a newer kernel package wants ≥13.2). pip’s strict resolver refuses to proceed. This is an upstream packaging conflict, not something you caused — and it’s exactly why we aligned the toolkit to **13.3** earlier (it satisfies the ≥13.2 side, and torch runs fine against it via minor-version compat).\n\nThe fix is to build the package **without re-resolving the whole graph**, since you’ve already curated a working CUDA stack:\n\n```\n1\npip install -v -e . --no-build-isolation --no-deps\n```\n\n`--no-deps`\n\ncompiles and installs vLLM using the environment you’ve assembled, instead of letting pip try (and fail) to reconcile every transitive pin. Afterwards, install any genuinely-missing runtime deps individually and re-run the smoke test. (Upstream’s own docs use `uv`\n\n, whose override/resolution model sidesteps this; with plain pip, `--no-deps`\n\nis the escape hatch.)\n\n### Pitfall 8: `MAX_JOBS`\n\nand parallelism\n\n`MAX_JOBS`\n\ncontrols ninja’s parallel compile jobs. CUDA compiles use ~2–3 GB each, so `MAX_JOBS × 3 GB`\n\nshould fit in RAM. On 62 GB you can run 16; we used 12 as a safe default. You’ll notice ninja drops to fewer jobs near the end (`[267/340]`\n\n) — that’s dependency ordering on the final heavy template units and the `.so`\n\nlink, not a misconfiguration. `NVCC_THREADS`\n\nparallelizes within a single nvcc invocation.\n\n## Step 7: Verify — and the runtime-only pitfalls\n\nA successful build does **not** mean inference works. vLLM’s runtime JIT-compiles more kernels on first use, which surfaces a fresh set of issues.\n\n```\n1\n2\n3\n4\n5\npython\nfrom vllm import LLM, SamplingParams\nllm = LLM(model=\"facebook/opt-125m\", enforce_eager=True,\n          gpu_memory_utilization=0.5, max_model_len=512)\nprint(llm.generate([\"The capital of France is\"],\n                   SamplingParams(temperature=0, max_tokens=20))[0].outputs[0].text)\n```\n\n### Pitfall 9: `Could not find nvcc and default cuda_home='/usr/local/cuda'`\n\n**flashinfer JIT-compiles sampling kernels at runtime** and needs `nvcc`\n\n— but at runtime nobody set `CUDA_HOME`\n\n, so it falls back to the nonexistent `/usr/local/cuda`\n\n. Because our toolkit lives in the venv, export it (and bake it into `activate`\n\nso it’s always present):\n\n```\n1\n2\n3\n4\ncat >> $VIRTUAL_ENV/bin/activate <<'EOF'\nexport CUDA_HOME=\"$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13\"\nexport PATH=\"$CUDA_HOME/bin:$PATH\"\nEOF\n```\n\nThis is also where Pitfalls 5c (cccl version check) and the `lib64`\n\nsymlink (`cannot find -lcudart`\n\n) bite — they’re runtime-JIT failures, not build failures, so they only appear here. With the 13.3 alignment + the `lib64`\n\nsymlink in place, the JIT compile succeeds and you get:\n\n```\n1\n2\nPROMPT: 'The capital of France is'\nOUTPUT: ' the capital of the French Republic...'\n```\n\n🎉\n\n## Step 8: Run the GPU test suite\n\nA `generate()`\n\nproves the happy path; the kernel tests prove the build broadly. The suite that most directly exercises what you just compiled is `tests/kernels/`\n\n. Run it with `CUDA_HOME`\n\non `PATH`\n\n(the tests JIT-compile too):\n\n```\n1\n2\n3\nexport CUDA_HOME=\"$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13\"\nexport PATH=\"$CUDA_HOME/bin:$PATH\"\npython -m pytest tests/kernels/core tests/kernels/attention -q\n```\n\nOn an A10G a focused subset (activation, layernorm, rotary/positional encoding, paged attention, cache) runs in ~1 hr and lands at **2402 passed, 583 skipped, 36 failed**. The 583 skips are arch-gated kernels (Hopper/Blackwell sm_90+) correctly opting out. The 36 failures are **all the same issue** — see Pitfall 10.\n\n### Pitfall 10: FP8 KV-cache tests *fail* (not skip) on SM < 89\n\nEvery one of those 36 failures is `test_reshape_and_cache_flash[...fp8...]`\n\nwith:\n\n```\n1\nFP8 KV cache needs native fp8e4nv (SM89+). Use --kv-cache-dtype bfloat16 ...\n```\n\nThe A10G is **sm_86**; native FP8 (`fp8e4nv`\n\n) needs **sm_89+** (Ada/Hopper). This is a hardware limit, not a broken build — but unlike the cleanly arch-gated kernels, this Triton path `assert`\n\ns on unsupported hardware instead of `skip`\n\nping, so it counts as a failure. Deselect the FP8 cases to get a fully green run:\n\n```\n1\n2\npython -m pytest tests/kernels/attention/test_cache.py -k \"not fp8\" -q\n# 335 passed, 403 skipped, 477 deselected, 0 failed\n```\n\nTakeaway: on pre-Ada GPUs, treat FP8 KV-cache test failures as expected, and gate them out with `-k \"not fp8\"`\n\nrather than chasing them.\n\n## Appendix: every error → one-line fix\n\n| Error | Root cause | Fix |\n|---|---|---|\n`nvidia-smi: command not found` (assumed no GPU) | driver not installed; hardware was there | `lspci \\| grep nvidia` to detect hardware |\n`modprobe nvidia: No such device` | nouveau owns the GPU | blacklist + unbind + `rmmod nouveau` |\n`CUDA unknown error` / `cuInit → 999` | missing/stale UVM nodes; no `nvidia-modprobe` | `apt install nvidia-modprobe` ; recreate `/dev/nvidia-uvm` |\n`nvidia-cuda-nvcc-cu13` has no real version | wrong package name | use `cuda-toolkit[nvcc]` |\n`ptxas Unsupported .version 9.3` | nvcc/ptxas minor mismatch | pin all CUDA pkgs to one minor |\nCMake: `nvcc says 13.3 but headers say 13.0` | runtime headers ≠ nvcc | align headers to nvcc version |\n`mathcalls.h: rsqrt ... incompatible` | CUDA 13.0 headers vs glibc 2.43 | use CUDA ≥ 13.1 headers |\n| flash-attn CMake: Python 3.14 not supported | submodule allow-list | patch `utils.cmake` (re-apply after any update that bumps its tag) |\n`ResolutionImpossible` (cuda-toolkit 13.0.2 vs ≥13.2) | conflicting CUDA pins across vLLM deps | build with `pip install -e . --no-deps` |\n`cccl: compiler and toolkit headers incompatible` | runtime downgraded after vLLM reinstall | re-pin CUDA runtime to nvcc’s minor |\n`cannot find -lcudart` (JIT link) | wheels use `lib/` , tool wants `lib64/` | `ln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64` |\n`Could not find nvcc ... /usr/local/cuda` | `CUDA_HOME` unset at runtime | export `CUDA_HOME` (bake into `activate` ) |\n`FP8 KV cache needs native fp8e4nv (SM89+)` (test fails) | A10G is sm_86; FP8 path asserts instead of skipping | not a build bug — deselect with `-k \"not fp8\"` |\n\n## Updating an existing checkout\n\nPulling a newer vLLM isn’t just `git pull`\n\n— an editable source build has moving parts that a pull invalidates. The sequence that works:\n\n```\n1\n2\n3\n4\n5\ngit fetch upstream && git reset --hard upstream/main   # or your target commit\nrm -rf build .deps && find vllm -name '*.abi3.so' -delete   # force a clean rebuild\n# re-apply the flash-attn 3.14 patch (the tag changed → .deps was re-fetched)\npip install -v -e . --no-build-isolation --no-deps     # --no-deps dodges resolver conflicts\n# re-pin the CUDA toolkit to 13.3 if anything got downgraded, then re-run the smoke test\n```\n\nBefore pulling, check the gap with `git diff --name-only HEAD..upstream/main | grep -E '\\.cu|CMakeLists|requirements/'`\n\n— if native/build files changed (they usually have), budget for a full recompile (~30–40 min) and re-verification. Also confirm the `torch==`\n\npin and `requires-python`\n\nin `pyproject.toml`\n\ndidn’t move; if torch’s version changed, you’re re-doing the whole CUDA/toolkit alignment, not just a rebuild.\n\n## Key takeaways\n\n**Detect hardware with** Don’t build for CPU because a tool is missing.`lspci`\n\n, not`nvidia-smi`\n\n.UVM nodes +`nvidia-smi`\n\nworking ≠ CUDA working.`nvidia-modprobe`\n\nmatter.**Pin the entire CUDA pip toolkit to one minor version.** Skew fails three different ways at three different stages.**Pick the CUDA minor that’s compatible with your glibc/compiler**, then rely on CUDA minor-version compatibility for the driver/torch.** A green build isn’t done**— runtime JIT (flashinfer) needs`CUDA_HOME`\n\nand a couple of symlinks. Verify with a real`generate()`\n\n.**Scope** to keep build times sane.`TORCH_CUDA_ARCH_LIST`\n\nto your GPU**Some test failures are hardware limits, not build bugs.** On pre-Ada GPUs the FP8 KV-cache tests`assert`\n\ninstead of`skip`\n\n— deselect them with`-k \"not fp8\"`\n\n.\n\n## References\n\n*Disclaimer: This article was generated using the Gemini 3.1 Pro model.*\n\n[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)by the author.", "url": "https://wpnews.pro/news/building-vllm-from-source-a-field-guide-with-all-the-pitfalls", "canonical_source": "https://hiraditya.github.io/posts/building-vllm-from-source/", "published_at": "2026-06-19 15:00:00+00:00", "updated_at": "2026-06-19 22:08:31.101919+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools"], "entities": ["vLLM", "NVIDIA", "AWS", "Ubuntu", "Python", "CUDA", "A10G"], "alternates": {"html": "https://wpnews.pro/news/building-vllm-from-source-a-field-guide-with-all-the-pitfalls", "markdown": "https://wpnews.pro/news/building-vllm-from-source-a-field-guide-with-all-the-pitfalls.md", "text": "https://wpnews.pro/news/building-vllm-from-source-a-field-guide-with-all-the-pitfalls.txt", "jsonld": "https://wpnews.pro/news/building-vllm-from-source-a-field-guide-with-all-the-pitfalls.jsonld"}}