Building vLLM from Source: A Field Guide (with all the pitfalls)

A developer building vLLM from source on an AWS g5 instance with Ubuntu 26.04 and Python 3.14 encountered multiple version-skew, driver, and toolchain issues, including a pitfall where missing nvidia-smi falsely indicated no GPU. The field guide provides a working recipe and explains each step to avoid cryptic build failures.

Building vLLM from Source: A Field Guide with all the pitfalls A step-by-step field guide to building vLLM from source on Ubuntu 26.04, covering Python 3.14 compatibility, CUDA driver issues, and toolchain pitfalls. Building vLLM 1 from source sounds like a pip install -e . away. In practice, on a fresh machine with a recent OS and a recent Python, you hit a chain of version-skew, driver, and toolchain issues that each fail with a cryptic message. This post walks through a real end-to-end build on an AWS g5 instance NVIDIA A10G running Ubuntu 26.04 + Python 3.14 , documenting every error encountered and the fix. The target was a CUDA build of a vLLM fork. The same playbook applies to a stock vllm-project/vllm checkout. TL;DR — the working recipe 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1. Confirm you actually have a GPU see "Pitfall 1" — easy to get wrong lspci | grep -i nvidia hardware present? nvidia-smi driver working? 2. Driver if nvidia-smi fails but lspci shows the GPU sudo apt-get install -y nvidia-driver-575-open nvidia-modprobe dkms sudo modprobe -r nouveau && sudo modprobe nvidia or reboot 3. Virtual env python3 -m venv ~/go/venv && source ~/go/venv/bin/activate pip install --upgrade pip 4. CUDA torch + a CONSISTENT pip CUDA toolkit critical: one minor version pip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0 default index = CUDA build pip install "cuda-toolkit nvcc ==13.3.0" "nvidia-cuda-runtime==13.3.29" \ "nvidia-cuda-nvrtc==13.3.33" "nvidia-cublas==13.3.0.5" 5. Assemble CUDA HOME from the pip layout export CUDA HOME=$VIRTUAL ENV/lib/python3. /site-packages/nvidia/cu13 ln -sfn $CUDA HOME/lib $CUDA HOME/lib64 cd $CUDA HOME/lib && for f in lib .so. ; do ln -sf "$f" "${f%%.so. }.so"; done mkdir -p $CUDA HOME/lib/stubs ln -sf /usr/lib/x86 64-linux-gnu/libcuda.so $CUDA HOME/lib/stubs/libcuda.so 6. Build scope arch to YOUR GPU — A10G is sm 86 export PATH=$CUDA HOME/bin:$PATH CUDACXX=$CUDA HOME/bin/nvcc export VLLM TARGET DEVICE=cuda TORCH CUDA ARCH LIST="8.6+PTX" export MAX JOBS=12 NVCC THREADS=2 export CMAKE ARGS="-DCUDAToolkit ROOT=$CUDA HOME -DCMAKE CUDA COMPILER=$CUDA HOME/bin/nvcc" pip install -v -e . --no-build-isolation Read on for why each line is there and what breaks without it. Prerequisites & how to check them Before anything else, take an inventory. Getting this wrong wastes the most time — including the most embarrassing pitfall of all. | Requirement | How to check | Notes | |---|---|---| A GPU and which one | lspci \| grep -i nvidia | Determines CUDA vs CPU build. Don’t trust — see Pitfall 1. nvidia-smi alone | | GPU driver loaded | nvidia-smi | If it fails but lspci shows a GPU, the driver isn’t installed/loaded. | | Compute capability | nvidia-smi --query-gpu=compute cap --format=csv | A10G = 8.6 . You build kernels for this. | | CPU flags CPU build only | lscpu \| grep -oE 'avx512f\|avx2' | vLLM CPU wants AVX512; AVX2 works with limited features. | | Compiler | gcc --version | vLLM recommends gcc 12–13; newer 15 mostly works but watch nvcc host-compiler limits. | | Python | python3 --version | Check the repo’s requires-python in pyproject.toml . | | RAM / cores | nproc; free -h | CUDA compiles are RAM-hungry ~2–3 GB per parallel job . | | build tools | cmake --version; ninja --version | vLLM needs cmake ≥ 3.26. | Pitfall 1: “There’s no GPU here” — when there definitely is This one cost us a whole CPU build. The very first check was: 1 nvidia-smi → command not found Conclusion drawn: no GPU, do a CPU build. Wrong. nvidia-smi missing only means the driver/userspace tools aren’t installed — it says nothing about the hardware. The actual hardware check is: 1 2 bash $ lspci | grep -i nvidia 00:1e.0 3D controller: NVIDIA Corporation GA102GL A10G rev a1 The A10G was there the whole time; it just had no driver. Always check lspci or /proc/driver/nvidia, ls /dev/nvidia before concluding “no GPU.” On cloud instances that aren’t “Deep Learning AMIs,” a bare GPU with no driver is the norm, not the exception. Lesson: lspci detects hardware. nvidia-smi detects aworking driver. They answer different questions. Decide CPU-vs-GPU from lspci . Step 2: Install and load the NVIDIA driver lspci shows the GPU, nvidia-smi is missing → install the driver. 1 2 3 sudo apt-get update sudo apt-get install -y dkms build-essential linux-headers-$ uname -r \ nvidia-driver-575-open We used the open-kernel variant -open , which is NVIDIA’s recommendation for Ampere and newer A10G is Ampere . The 575 metapackage pulled driver 580.159.03 . Pitfall 2: modprobe nvidia → “No such device” nouveau owns the GPU 1 2 3 4 5 bash $ sudo modprobe nvidia modprobe: ERROR: could not insert 'nvidia': No such device $ dmesg | grep NVRM NVRM: GPU 0000:00:1e.0 is already bound to nouveau. The open-source nouveau driver grabs the GPU at boot. The NVIDIA module can’t bind while nouveau holds it. Fix — blacklist, unbind, and load: 1 2 3 4 5 6 echo -e "blacklist nouveau\noptions nouveau modeset=0" | \ sudo tee /etc/modprobe.d/blacklist-nouveau.conf echo -n "0000:00:1e.0" | sudo tee /sys/bus/pci/drivers/nouveau/unbind sudo rmmod nouveau sudo modprobe nvidia sudo update-initramfs -u make the blacklist survive reboots If rmmod nouveau complains it’s in use e.g. a display manager , a reboot after the blacklist + initramfs update achieves the same thing cleanly. Pitfall 3: nvidia-smi works but CUDA returns error 999 “unknown error” This is the subtle one. After loading the module: 1 2 3 python $ nvidia-smi works, shows the A10G $ python -c "import torch; print torch.cuda.is available " RuntimeError: CUDA unknown error ... False A direct driver-API probe confirmed the runtime was broken even though nvidia-smi was fine: 1 2 python import ctypes ctypes.CDLL "libcuda.so.1" .cuInit 0 → 999 CUDA ERROR UNKNOWN Two distinct causes, both worth knowing: Stale/incorrect UVM device nodes. nvidia-smi uses /dev/nvidia0 + /dev/nvidiactl major 195 . CUDA additionally needs /dev/nvidia-uvm . After a manual driver bring-up those nodes can be missing or have the wrong major. Recreate them against /proc/devices : 1 2 3 4 5 sudo modprobe nvidia uvm UVM MAJOR=$ grep nvidia-uvm /proc/devices | awk '{print $1}' sudo rm -f /dev/nvidia-uvm /dev/nvidia-uvm-tools sudo mknod -m 666 /dev/nvidia-uvm c $UVM MAJOR 0 sudo mknod -m 666 /dev/nvidia-uvm-tools c $UVM MAJOR 1 This setuid helper is what the CUDA runtime shells out to in order to create/initialize device nodes for non-root processes. Without it, raw nvidia-modprobe is not installed. cuInit may pass but torch’s runtime init throws 999 . This was the actual fix for us: 1 2 sudo apt-get install -y nvidia-modprobe sudo nvidia-modprobe -c 0 -u After this: torch.cuda.is available → True . A reboot also installs the proper udev rules and avoids the manual mknod dance — but if you can’t reboot, the two steps above get you there. Lesson: nvidia-smi working ≠ CUDA working. They use different device nodes. If cuInit returns 999, look at /dev/nvidia-uvm and make sure nvidia-modprobe exists. Step 3: The virtual environment Nothing exotic here, but keep it isolated from system Python: 1 2 3 python3 -m venv ~/go/venv source ~/go/venv/bin/activate pip install --upgrade pip We used Python 3.14 . Check the repo supports it: 1 2 grep requires-python pyproject.toml requires-python = " =3.10,<3.15" ✅ 3.14 allowed It built fine — torch==2.11.0 and every dependency had cp314 wheels. But see Pitfall 6: a bundled submodule had its own narrower Python check. Step 4: CUDA torch + a consistent CUDA toolkit vLLM compiles .cu kernels, so it needs nvcc — which PyTorch wheels do not bundle they ship runtime libraries only . You have two options: - Install the full CUDA toolkit to /usr/local/cuda via NVIDIA’s apt repo, or - Assemble a toolkit entirely from pip wheels. We went pip-only no apt repo for Ubuntu 26.04 yet, and it keeps everything in the venv . First, the CUDA build of torch: 1 2 pip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0 python -c "import torch; print torch.version.cuda " → 13.0 wheel tag: 2.11.0+cu130 Then nvcc and the dev components via the modern unified meta package: 1 pip install "cuda-toolkit nvcc ==13.3.0" Pitfall 4: the nvidia-cuda-nvcc-cu13 package is a stub The old naming is a trap: 1 2 bash $ pip install nvidia-cuda-nvcc-cu13 ERROR: ... from versions: 0.0.0a0, 0.0.1 placeholder only The real compiler ships via the cuda-toolkit nvcc extra which pulls nvidia-cuda-nvcc , nvidia-nvvm , nvidia-cuda-crt . Use the meta package’s extras, not the -cu13 standalone names. Pitfall 5: CUDA toolkit version skew three separate failures This was the single biggest time sink. The pip CUDA ecosystem is split across many packages nvidia-cuda-nvcc , nvidia-nvvm , nvidia-cuda-crt , nvidia-cuda-cccl , nvidia-cuda-runtime , nvidia-cublas , … and pip will happily install mismatched minor versions . Each mismatch fails differently: 5a. ptxas can’t assemble newer PTX: 1 ptxas fatal : Unsupported .version 9.3; current version is '9.0' nvcc front-end was 13.3 emits PTX 9.3 but ptxas was 13.0 ≤ PTX 9.0 . → align them. 5b. CMake refuses on nvcc-vs-headers mismatch PyTorch’s cuda.cmake : 1 2 CMake Error: FindCUDA says CUDA version is 13.3 from nvcc , but the CUDA headers say the version is 13.0. 5c. flashinfer’s bundled cccl refuses at runtime its JIT compiler : 1 2 cccl/.../cuda toolkit.h:41: error: "CUDA compiler and CUDA toolkit headers are incompatible, please check your include paths" The cccl check requires CUDART VERSION ’s minor to exactly equal nvcc’s minor. The fix for all three: pin the entire CUDA userspace to one minor version. Why 13.3 and not 13.0 to match torch’sBecause cu130 ?CUDA 13.0 headers don’t compile on glibc 2.43 Ubuntu 26.04 : 1 2 /usr/include/.../mathcalls.h:206: error: exception specification is incompatible with that of previous function "rsqrt" CUDA 13.1+ headers fixed this. So we align upto 13.3. torch built for cu130 still runs on a 13.3 runtime thanks toCUDA 13 minor-version compatibility any 13.x toolkit runs on an R580+ driver . 1 2 3 4 5 6 7 8 pip install "cuda-toolkit==13.3.0" "nvidia-cuda-runtime==13.3.29" \ "nvidia-cuda-nvcc==13.3.33" "nvidia-nvvm==13.3.33" \ "nvidia-cuda-crt==13.3.33" "nvidia-cuda-cccl==13.3.3.3.1" \ "nvidia-cuda-nvrtc==13.3.33" "nvidia-cublas==13.3.0.5" verify nvcc and headers agree: nvcc --version | grep release 13.3 grep CUDART VERSION $CUDA HOME/include/cuda runtime api.h 13030 = 13.3 pip prints a dependency-conflict warning torch pins cuda-toolkit==13.0.2 — it’s cosmetic; torch runs fine via minor-version compat. But beware: reinstalling vLLM later re-pulls its requirements/cuda.txt and silently downgrades the runtime back to 13.0 , breaking flashinfer’s JIT again. Re-run the 13.3 pins after any reinstall. Step 5: Assemble a working CUDA HOME The pip wheels lay CUDA out under .../site-packages/nvidia/cu13/{bin,include,lib} , which is almost what CMake and downstream linkers expect — but missing three things: 1 2 3 4 5 6 7 8 9 10 11 export CUDA HOME=$VIRTUAL ENV/lib/python3.14/site-packages/nvidia/cu13 a unversioned dev symlinks: wheels ship libcudart.so.13, linkers want libcudart.so cd $CUDA HOME/lib && for f in lib .so. ; do ln -sf "$f" "${f%%.so. }.so"; done b lib64 alias: some tools flashinfer JIT hardcode $CUDA HOME/lib64 ln -sfn $CUDA HOME/lib $CUDA HOME/lib64 c a libcuda stub for driver-API linking pip ships no stubs/ mkdir -p $CUDA HOME/lib/stubs ln -sf /usr/lib/x86 64-linux-gnu/libcuda.so $CUDA HOME/lib/stubs/libcuda.so Sanity check before the big build: 1 2 3 4 5 6 7 8 cat /tmp/t.cu <<'EOF' include <cuda runtime.h global void k {} int main {k<<<1,1 ;return cudaDeviceSynchronize ;} EOF $CUDA HOME/bin/nvcc -arch=sm 86 -I$CUDA HOME/include -L$CUDA HOME/lib -lcudart /tmp/t.cu -o /tmp/t.out Also confirm CMake finds it: cmake -P < echo 'find package CUDAToolkit REQUIRED ; message "CTK ${CUDAToolkit VERSION}" ' 2 &1 Step 6: Build vLLM Set the build environment and go. The most important variable is TORCH CUDA ARCH LIST — scope it to your GPU or you’ll compile every architecture and wait 5–10× longer. 1 2 3 4 5 6 7 8 9 cd ~/go/vllm export PATH=$CUDA HOME/bin:$PATH export CUDACXX=$CUDA HOME/bin/nvcc export VLLM TARGET DEVICE=cuda export TORCH CUDA ARCH LIST="8.6+PTX" A10G = sm 86 export MAX JOBS=12 ~2-3 GB RAM per job; tune to your box export NVCC THREADS=2 export CMAKE ARGS="-DCUDAToolkit ROOT=$CUDA HOME -DCMAKE CUDA COMPILER=$CUDA HOME/bin/nvcc" pip install -v -e . --no-build-isolation A few notes: --no-build-isolation is required so the build sees the torch/CUDA you installed. enforce eager -style arch warnings like DeepGEMM/FlashMLA will not compile: unsupported CUDA architecture 8.6 are expected on Ampere — those kernels target Hopper sm 90+ and are simply skipped.- On 16 cores / 62 GB this took ~30–40 min and produced C.abi3.so ~117 MB , moe C.abi3.so , etc. Pitfall 6: a bundled submodule rejects your Python Even though the top-level pyproject.toml allowed Python 3.14, the vendored flash-attention CMake had its own allow-list: 1 2 CMake Error at .deps/vllm-flash-attn-src/cmake/utils.cmake:20: Python version 3.14 is not one of the supported versions: 3.9;3.10;3.11;3.12;3.13. Fix — add your version to the macro vLLM points FETCHCONTENT BASE DIR at .deps , so edits there persist; just don’t rm -rf .deps before rebuilding : 1 2 .deps/vllm-flash-attn-src/cmake/utils.cmake set SUPPORTED VERSIONS LIST ${SUPPORTED VERSIONS} ${ARGN} "3.14" This patch is not permanent.flash-attn is pulled via CMake FetchContent at a pinned GIT TAG . The moment you git pull /update vLLM and that tag changes or you rm -rf .deps , FetchContent re-clones afreshcopy and your edit is gone — the 3.14 check fails again at the next configure. Re-apply the one-liner after any update that bumps the flash-attn tag. Pitfall 7: dependency-resolver deadlock ResolutionImpossible On a recent main , pip install -e . can die before compiling anything with: 1 2 3 4 5 ERROR: Cannot install cuda-tile tileiras ==1.4.0, cuda-toolkit==13.0.2 and vllm because these package versions have conflicting dependencies. torch 2.11.0 depends on cuda-toolkit==13.0.2 cuda-tile tileiras 1.4.0 depends on cuda-toolkit =13.2,<13.4 ERROR: ResolutionImpossible Two of vLLM’s own dependencies pin incompatible CUDA-toolkit ranges torch wants exactly 13.0.2; a newer kernel package wants ≥13.2 . pip’s strict resolver refuses to proceed. This is an upstream packaging conflict, not something you caused — and it’s exactly why we aligned the toolkit to 13.3 earlier it satisfies the ≥13.2 side, and torch runs fine against it via minor-version compat . The fix is to build the package without re-resolving the whole graph , since you’ve already curated a working CUDA stack: 1 pip install -v -e . --no-build-isolation --no-deps --no-deps compiles and installs vLLM using the environment you’ve assembled, instead of letting pip try and fail to reconcile every transitive pin. Afterwards, install any genuinely-missing runtime deps individually and re-run the smoke test. Upstream’s own docs use uv , whose override/resolution model sidesteps this; with plain pip, --no-deps is the escape hatch. Pitfall 8: MAX JOBS and parallelism MAX JOBS controls ninja’s parallel compile jobs. CUDA compiles use ~2–3 GB each, so MAX JOBS × 3 GB should fit in RAM. On 62 GB you can run 16; we used 12 as a safe default. You’ll notice ninja drops to fewer jobs near the end 267/340 — that’s dependency ordering on the final heavy template units and the .so link, not a misconfiguration. NVCC THREADS parallelizes within a single nvcc invocation. Step 7: Verify — and the runtime-only pitfalls A successful build does not mean inference works. vLLM’s runtime JIT-compiles more kernels on first use, which surfaces a fresh set of issues. 1 2 3 4 5 python from vllm import LLM, SamplingParams llm = LLM model="facebook/opt-125m", enforce eager=True, gpu memory utilization=0.5, max model len=512 print llm.generate "The capital of France is" , SamplingParams temperature=0, max tokens=20 0 .outputs 0 .text Pitfall 9: Could not find nvcc and default cuda home='/usr/local/cuda' flashinfer JIT-compiles sampling kernels at runtime and needs nvcc — but at runtime nobody set CUDA HOME , so it falls back to the nonexistent /usr/local/cuda . Because our toolkit lives in the venv, export it and bake it into activate so it’s always present : 1 2 3 4 cat $VIRTUAL ENV/bin/activate <<'EOF' export CUDA HOME="$VIRTUAL ENV/lib/python3.14/site-packages/nvidia/cu13" export PATH="$CUDA HOME/bin:$PATH" EOF This is also where Pitfalls 5c cccl version check and the lib64 symlink cannot find -lcudart bite — they’re runtime-JIT failures, not build failures, so they only appear here. With the 13.3 alignment + the lib64 symlink in place, the JIT compile succeeds and you get: 1 2 PROMPT: 'The capital of France is' OUTPUT: ' the capital of the French Republic...' 🎉 Step 8: Run the GPU test suite A generate proves the happy path; the kernel tests prove the build broadly. The suite that most directly exercises what you just compiled is tests/kernels/ . Run it with CUDA HOME on PATH the tests JIT-compile too : 1 2 3 export CUDA HOME="$VIRTUAL ENV/lib/python3.14/site-packages/nvidia/cu13" export PATH="$CUDA HOME/bin:$PATH" python -m pytest tests/kernels/core tests/kernels/attention -q On an A10G a focused subset activation, layernorm, rotary/positional encoding, paged attention, cache runs in ~1 hr and lands at 2402 passed, 583 skipped, 36 failed . The 583 skips are arch-gated kernels Hopper/Blackwell sm 90+ correctly opting out. The 36 failures are all the same issue — see Pitfall 10. Pitfall 10: FP8 KV-cache tests fail not skip on SM < 89 Every one of those 36 failures is test reshape and cache flash ...fp8... with: 1 FP8 KV cache needs native fp8e4nv SM89+ . Use --kv-cache-dtype bfloat16 ... The A10G is sm 86 ; native FP8 fp8e4nv needs sm 89+ Ada/Hopper . This is a hardware limit, not a broken build — but unlike the cleanly arch-gated kernels, this Triton path assert s on unsupported hardware instead of skip ping, so it counts as a failure. Deselect the FP8 cases to get a fully green run: 1 2 python -m pytest tests/kernels/attention/test cache.py -k "not fp8" -q 335 passed, 403 skipped, 477 deselected, 0 failed Takeaway: on pre-Ada GPUs, treat FP8 KV-cache test failures as expected, and gate them out with -k "not fp8" rather than chasing them. Appendix: every error → one-line fix | Error | Root cause | Fix | |---|---|---| nvidia-smi: command not found assumed no GPU | driver not installed; hardware was there | lspci \| grep nvidia to detect hardware | modprobe nvidia: No such device | nouveau owns the GPU | blacklist + unbind + rmmod nouveau | CUDA unknown error / cuInit → 999 | missing/stale UVM nodes; no nvidia-modprobe | apt install nvidia-modprobe ; recreate /dev/nvidia-uvm | nvidia-cuda-nvcc-cu13 has no real version | wrong package name | use cuda-toolkit nvcc | ptxas Unsupported .version 9.3 | nvcc/ptxas minor mismatch | pin all CUDA pkgs to one minor | CMake: nvcc says 13.3 but headers say 13.0 | runtime headers ≠ nvcc | align headers to nvcc version | mathcalls.h: rsqrt ... incompatible | CUDA 13.0 headers vs glibc 2.43 | use CUDA ≥ 13.1 headers | | flash-attn CMake: Python 3.14 not supported | submodule allow-list | patch utils.cmake re-apply after any update that bumps its tag | ResolutionImpossible cuda-toolkit 13.0.2 vs ≥13.2 | conflicting CUDA pins across vLLM deps | build with pip install -e . --no-deps | cccl: compiler and toolkit headers incompatible | runtime downgraded after vLLM reinstall | re-pin CUDA runtime to nvcc’s minor | cannot find -lcudart JIT link | wheels use lib/ , tool wants lib64/ | ln -sfn $CUDA HOME/lib $CUDA HOME/lib64 | Could not find nvcc ... /usr/local/cuda | CUDA HOME unset at runtime | export CUDA HOME bake into activate | FP8 KV cache needs native fp8e4nv SM89+ test fails | A10G is sm 86; FP8 path asserts instead of skipping | not a build bug — deselect with -k "not fp8" | Updating an existing checkout Pulling a newer vLLM isn’t just git pull — an editable source build has moving parts that a pull invalidates. The sequence that works: 1 2 3 4 5 git fetch upstream && git reset --hard upstream/main or your target commit rm -rf build .deps && find vllm -name ' .abi3.so' -delete force a clean rebuild re-apply the flash-attn 3.14 patch the tag changed → .deps was re-fetched pip install -v -e . --no-build-isolation --no-deps --no-deps dodges resolver conflicts re-pin the CUDA toolkit to 13.3 if anything got downgraded, then re-run the smoke test Before pulling, check the gap with git diff --name-only HEAD..upstream/main | grep -E '\.cu|CMakeLists|requirements/' — if native/build files changed they usually have , budget for a full recompile ~30–40 min and re-verification. Also confirm the torch== pin and requires-python in pyproject.toml didn’t move; if torch’s version changed, you’re re-doing the whole CUDA/toolkit alignment, not just a rebuild. Key takeaways Detect hardware with Don’t build for CPU because a tool is missing. lspci , not nvidia-smi .UVM nodes + nvidia-smi working ≠ CUDA working. nvidia-modprobe matter. Pin the entire CUDA pip toolkit to one minor version. Skew fails three different ways at three different stages. Pick the CUDA minor that’s compatible with your glibc/compiler , then rely on CUDA minor-version compatibility for the driver/torch. A green build isn’t done — runtime JIT flashinfer needs CUDA HOME and a couple of symlinks. Verify with a real generate . Scope to keep build times sane. TORCH CUDA ARCH LIST to your GPU Some test failures are hardware limits, not build bugs. On pre-Ada GPUs the FP8 KV-cache tests assert instead of skip — deselect them with -k "not fp8" . References Disclaimer: This article was generated using the Gemini 3.1 Pro model. CC BY 4.0 https://creativecommons.org/licenses/by/4.0/ by the author.