A step-by-step field guide to building vLLM from source on Ubuntu 26.04, covering Python 3.14 compatibility, CUDA driver issues, and toolchain pitfalls.
Building vLLM 1 from source sounds like a
pip install -e .
away. In practice, on a fresh machine with a recent OS and a recent Python, you hit a chain of version-skew, driver, and toolchain issues that each fail with a cryptic message. This post walks through a real end-to-end build on an AWS g5 instance (NVIDIA A10G) running
Ubuntu 26.04 + Python 3.14, documenting every error encountered and the fix.
The target was a CUDA build of a vLLM fork. The same playbook applies to a stock vllm-project/vllm
checkout.
TL;DR — the working recipe #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
lspci | grep -i nvidia # hardware present?
nvidia-smi # driver working?
sudo apt-get install -y nvidia-driver-575-open nvidia-modprobe dkms
sudo modprobe -r nouveau && sudo modprobe nvidia # or reboot
python3 -m venv ~/go/venv && source ~/go/venv/bin/activate
pip install --upgrade pip
pip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0 # default index = CUDA build
pip install "cuda-toolkit[nvcc]==13.3.0" "nvidia-cuda-runtime==13.3.29" \
"nvidia-cuda-nvrtc==13.3.33" "nvidia-cublas==13.3.0.5"
export CUDA_HOME=$VIRTUAL_ENV/lib/python3.*/site-packages/nvidia/cu13
ln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64
( cd $CUDA_HOME/lib && for f in lib*.so.*; do ln -sf "$f" "${f%%.so.*}.so"; done )
mkdir -p $CUDA_HOME/lib/stubs
ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so $CUDA_HOME/lib/stubs/libcuda.so
export PATH=$CUDA_HOME/bin:$PATH CUDACXX=$CUDA_HOME/bin/nvcc
export VLLM_TARGET_DEVICE=cuda TORCH_CUDA_ARCH_LIST="8.6+PTX"
export MAX_JOBS=12 NVCC_THREADS=2
export CMAKE_ARGS="-DCUDAToolkit_ROOT=$CUDA_HOME -DCMAKE_CUDA_COMPILER=$CUDA_HOME/bin/nvcc"
pip install -v -e . --no-build-isolation
Read on for why each line is there and what breaks without it.
Prerequisites & how to check them #
Before anything else, take an inventory. Getting this wrong wastes the most time — including the most embarrassing pitfall of all.
| Requirement | How to check | Notes |
|---|---|---|
| A GPU (and which one) | lspci | grep -i nvidia |
Determines CUDA vs CPU build. Don’t trust — see Pitfall 1.nvidia-smi alone |
| GPU driver loaded | nvidia-smi |
If it fails but lspci shows a GPU, the driver isn’t installed/loaded. |
| Compute capability | nvidia-smi --query-gpu=compute_cap --format=csv |
A10G = 8.6 . You build kernels for this. |
| CPU flags (CPU build only) | lscpu | grep -oE 'avx512f|avx2' |
vLLM CPU wants AVX512; AVX2 works with limited features. |
| Compiler | gcc --version |
vLLM recommends gcc 12–13; newer (15) mostly works but watch nvcc host-compiler limits. |
| Python | python3 --version |
Check the repo’s requires-python in pyproject.toml . |
| RAM / cores | nproc; free -h |
CUDA compiles are RAM-hungry (~2–3 GB per parallel job). |
| build tools | cmake --version; ninja --version |
vLLM needs cmake ≥ 3.26. |
Pitfall 1: “There’s no GPU here” — when there definitely is
This one cost us a whole CPU build. The very first check was:
1
nvidia-smi # → command not found
Conclusion drawn: no GPU, do a CPU build. Wrong. nvidia-smi
missing only means the driver/userspace tools aren’t installed — it says nothing about the hardware. The actual hardware check is:
1
2
bash
$ lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
The A10G was there the whole time; it just had no driver. Always check lspci (or /proc/driver/nvidia, ls /dev/nvidia) before concluding “no GPU.”* On cloud instances that aren’t “Deep Learning AMIs,” a bare GPU with no driver is the norm, not the exception.
Lesson:lspci
detects hardware.nvidia-smi
detects aworking driver. They answer different questions. Decide CPU-vs-GPU fromlspci
.
Step 2: Install and load the NVIDIA driver #
lspci
shows the GPU, nvidia-smi
is missing → install the driver.
1
2
3
sudo apt-get update
sudo apt-get install -y dkms build-essential linux-headers-$(uname -r) \
nvidia-driver-575-open
We used the open-kernel variant (-open
), which is NVIDIA’s recommendation for Ampere and newer (A10G is Ampere). The 575
metapackage pulled driver 580.159.03
.
Pitfall 2: modprobe nvidia
→ “No such device” (nouveau owns the GPU)
1
2
3
4
5
bash
$ sudo modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': No such device
$ dmesg | grep NVRM
NVRM: GPU 0000:00:1e.0 is already bound to nouveau.
The open-source nouveau driver grabs the GPU at boot. The NVIDIA module can’t bind while nouveau holds it. Fix — blacklist, unbind, and load:
1
2
3
4
5
6
echo -e "blacklist nouveau\noptions nouveau modeset=0" | \
sudo tee /etc/modprobe.d/blacklist-nouveau.conf
echo -n "0000:00:1e.0" | sudo tee /sys/bus/pci/drivers/nouveau/unbind
sudo rmmod nouveau
sudo modprobe nvidia
sudo update-initramfs -u # make the blacklist survive reboots
If rmmod nouveau
complains it’s in use (e.g. a display manager), a reboot after the blacklist + initramfs update achieves the same thing cleanly.
Pitfall 3: nvidia-smi
works but CUDA returns error 999 (“unknown error”)
This is the subtle one. After the module:
1
2
3
python
$ nvidia-smi # works, shows the A10G
$ python -c "import torch; print(torch.cuda.is_available())"
RuntimeError: CUDA unknown error ... # False
A direct driver-API probe confirmed the runtime was broken even though nvidia-smi
was fine:
1
2
python
import ctypes
ctypes.CDLL("libcuda.so.1").cuInit(0) # → 999 (CUDA_ERROR_UNKNOWN)
Two distinct causes, both worth knowing:
Stale/incorrect UVM device nodes.nvidia-smi
uses/dev/nvidia0
+/dev/nvidiactl
(major 195). CUDA additionally needs/dev/nvidia-uvm
. After a manual driver bring-up those nodes can be missing or have the wrong major. Recreate them against/proc/devices
:
1
2
3
4
5
sudo modprobe nvidia_uvm
UVM_MAJOR=$(grep nvidia-uvm /proc/devices | awk '{print $1}')
sudo rm -f /dev/nvidia-uvm /dev/nvidia-uvm-tools
sudo mknod -m 666 /dev/nvidia-uvm c $UVM_MAJOR 0
sudo mknod -m 666 /dev/nvidia-uvm-tools c $UVM_MAJOR 1
This setuid helper is what the CUDA runtime shells out to in order to create/initialize device nodes for non-root processes. Without it, rawnvidia-modprobe
is not installed.cuInit
may pass buttorch’s runtime init throws 999. This was the actual fix for us:
1
2
sudo apt-get install -y nvidia-modprobe
sudo nvidia-modprobe -c 0 -u
After this:
torch.cuda.is_available() → True
. A reboot also installs the proper udev rules and avoids the manualmknod
dance — but if you can’t reboot, the two steps above get you there.
Lesson:nvidia-smi
working ≠ CUDA working. They use different device nodes. IfcuInit
returns 999, look at/dev/nvidia-uvm
and make surenvidia-modprobe
exists.
Step 3: The virtual environment #
Nothing exotic here, but keep it isolated from system Python:
1
2
3
python3 -m venv ~/go/venv
source ~/go/venv/bin/activate
pip install --upgrade pip
We used Python 3.14. Check the repo supports it:
1
2
grep requires-python pyproject.toml
It built fine — torch==2.11.0
and every dependency had cp314
wheels. But see Pitfall 6: a bundled submodule had its own narrower Python check.
Step 4: CUDA torch + a consistent CUDA toolkit #
vLLM compiles .cu
kernels, so it needs nvcc
— which PyTorch wheels do not bundle (they ship runtime libraries only). You have two options:
- Install the full CUDA toolkit to
/usr/local/cuda
via NVIDIA’s apt repo, or - Assemble a toolkit entirely from pip wheels.
We went pip-only (no apt repo for Ubuntu 26.04 yet, and it keeps everything in the venv). First, the CUDA build of torch:
1
2
pip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0
python -c "import torch; print(torch.version.cuda)" # → 13.0 (wheel tag: 2.11.0+cu130)
Then nvcc and the dev components via the modern unified meta package:
1
pip install "cuda-toolkit[nvcc]==13.3.0"
Pitfall 4: the nvidia-cuda-nvcc-cu13
package is a stub
The old naming is a trap:
1
2
bash
$ pip install nvidia-cuda-nvcc-cu13
ERROR: ... (from versions: 0.0.0a0, 0.0.1) # placeholder only!
The real compiler ships via the ** cuda-toolkit[nvcc]** extra (which pulls
nvidia-cuda-nvcc
, nvidia-nvvm
, nvidia-cuda-crt
). Use the meta package’s extras, not the *-cu13
standalone names.### Pitfall 5: CUDA toolkit version skew (three separate failures)
This was the single biggest time sink. The pip CUDA ecosystem is split across many packages (nvidia-cuda-nvcc
, nvidia-nvvm
, nvidia-cuda-crt
, nvidia-cuda-cccl
, nvidia-cuda-runtime
, nvidia-cublas
, …) and pip will happily install mismatched minor versions. Each mismatch fails differently:
5a. ptxas can’t assemble newer PTX:
1
ptxas fatal : Unsupported .version 9.3; current version is '9.0'
nvcc front-end was 13.3 (emits PTX 9.3) but ptxas
was 13.0 (≤ PTX 9.0). → align them.
5b. CMake refuses on nvcc-vs-headers mismatch (PyTorch’s cuda.cmake
):
1
2
CMake Error: FindCUDA says CUDA version is 13.3 (from nvcc), but the CUDA headers
say the version is 13.0.
5c. flashinfer’s bundled cccl refuses at runtime (its JIT compiler):
1
2
cccl/.../cuda_toolkit.h:41: error: "CUDA compiler and CUDA toolkit headers are
incompatible, please check your include paths"
The cccl check requires CUDART_VERSION
’s minor to exactly equal nvcc’s minor.
The fix for all three: pin the entire CUDA userspace to one minor version.
Why 13.3 and not 13.0 (to match torch’sBecausecu130
)?CUDA 13.0 headers don’t compile on glibc 2.43(Ubuntu 26.04):
1
2
/usr/include/.../mathcalls.h:206: error: exception specification is incompatible
with that of previous function "rsqrt"
CUDA 13.1+ headers fixed this. So we align
upto 13.3. torch built forcu130
still runs on a 13.3 runtime thanks toCUDA 13 minor-version compatibility(any 13.x toolkit runs on an R580+ driver).
1
2
3
4
5
6
7
8
pip install "cuda-toolkit==13.3.0" "nvidia-cuda-runtime==13.3.29" \
"nvidia-cuda-nvcc==13.3.33" "nvidia-nvvm==13.3.33" \
"nvidia-cuda-crt==13.3.33" "nvidia-cuda-cccl==13.3.3.3.1" \
"nvidia-cuda-nvrtc==13.3.33" "nvidia-cublas==13.3.0.5"
nvcc --version | grep release # 13.3
grep CUDART_VERSION $CUDA_HOME/include/cuda_runtime_api.h # 13030 (= 13.3)
pip
prints a dependency-conflict warning (torch pins cuda-toolkit==13.0.2
) — it’s cosmetic; torch runs fine via minor-version compat. But beware: reinstalling vLLM later re-pulls its requirements/cuda.txt
and silently downgrades the runtime back to 13.0, breaking flashinfer’s JIT again. Re-run the 13.3 pins after any reinstall.
Step 5: Assemble a working CUDA_HOME #
The pip wheels lay CUDA out under .../site-packages/nvidia/cu13/{bin,include,lib}
, which is almost what CMake and downstream linkers expect — but missing three things:
1
2
3
4
5
6
7
8
9
10
11
export CUDA_HOME=$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13
( cd $CUDA_HOME/lib && for f in lib*.so.*; do ln -sf "$f" "${f%%.so.*}.so"; done )
ln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64
mkdir -p $CUDA_HOME/lib/stubs
ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so $CUDA_HOME/lib/stubs/libcuda.so
Sanity check before the big build:
1
2
3
4
5
6
7
8
cat > /tmp/t.cu <<'EOF'
#include <cuda_runtime.h>
__global__ void k(){}
int main(){k<<<1,1>>>();return cudaDeviceSynchronize();}
EOF
$CUDA_HOME/bin/nvcc -arch=sm_86 -I$CUDA_HOME/include -L$CUDA_HOME/lib -lcudart /tmp/t.cu -o /tmp/t.out
cmake -P <(echo 'find_package(CUDAToolkit REQUIRED); message("CTK ${CUDAToolkit_VERSION}")') 2>&1
Step 6: Build vLLM #
Set the build environment and go. The most important variable is ** TORCH_CUDA_ARCH_LIST** — scope it to
yourGPU or you’ll compile every architecture and wait 5–10× longer.
1
2
3
4
5
6
7
8
9
cd ~/go/vllm
export PATH=$CUDA_HOME/bin:$PATH
export CUDACXX=$CUDA_HOME/bin/nvcc
export VLLM_TARGET_DEVICE=cuda
export TORCH_CUDA_ARCH_LIST="8.6+PTX" # A10G = sm_86
export MAX_JOBS=12 # ~2-3 GB RAM per job; tune to your box
export NVCC_THREADS=2
export CMAKE_ARGS="-DCUDAToolkit_ROOT=$CUDA_HOME -DCMAKE_CUDA_COMPILER=$CUDA_HOME/bin/nvcc"
pip install -v -e . --no-build-isolation
A few notes:
--no-build-isolation
is required so the build sees the torch/CUDA you installed.enforce_eager
-style arch warnings likeDeepGEMM/FlashMLA will not compile: unsupported CUDA architecture 8.6
areexpected on Ampere — those kernels target Hopper (sm_90+) and are simply skipped.- On 16 cores / 62 GB this took ~30–40 min and produced
_C.abi3.so
(~117 MB),_moe_C.abi3.so
, etc.
Pitfall 6: a bundled submodule rejects your Python
Even though the top-level pyproject.toml
allowed Python 3.14, the vendored flash-attention CMake had its own allow-list:
1
2
CMake Error at .deps/vllm-flash-attn-src/cmake/utils.cmake:20:
Python version (3.14) is not one of the supported versions: 3.9;3.10;3.11;3.12;3.13.
Fix — add your version to the macro (vLLM points FETCHCONTENT_BASE_DIR
at .deps
, so edits there persist; just don’t rm -rf .deps
before rebuilding):
1
2
set(_SUPPORTED_VERSIONS_LIST ${SUPPORTED_VERSIONS} ${ARGN} "3.14")
This patch is not permanent.flash-attn is pulled via CMake FetchContent at a pinnedGIT_TAG
. The moment yougit pull
/update vLLM and that tag changes (or yourm -rf .deps
), FetchContent re-clones afreshcopy and your edit is gone — the 3.14 check fails again at the next configure. Re-apply the one-liner after any update that bumps the flash-attn tag.
Pitfall 7: dependency-resolver deadlock (ResolutionImpossible
)
On a recent main
, pip install -e .
can die before compiling anything with:
1
2
3
4
5
ERROR: Cannot install cuda-tile[tileiras]==1.4.0, cuda-toolkit==13.0.2 and vllm
because these package versions have conflicting dependencies.
torch 2.11.0 depends on cuda-toolkit==13.0.2
cuda-tile[tileiras] 1.4.0 depends on cuda-toolkit>=13.2,<13.4
ERROR: ResolutionImpossible
Two of vLLM’s own dependencies pin incompatible CUDA-toolkit ranges (torch wants exactly 13.0.2; a newer kernel package wants ≥13.2). pip’s strict resolver refuses to proceed. This is an upstream packaging conflict, not something you caused — and it’s exactly why we aligned the toolkit to 13.3 earlier (it satisfies the ≥13.2 side, and torch runs fine against it via minor-version compat).
The fix is to build the package without re-resolving the whole graph, since you’ve already curated a working CUDA stack:
1
pip install -v -e . --no-build-isolation --no-deps
--no-deps
compiles and installs vLLM using the environment you’ve assembled, instead of letting pip try (and fail) to reconcile every transitive pin. Afterwards, install any genuinely-missing runtime deps individually and re-run the smoke test. (Upstream’s own docs use uv
, whose override/resolution model sidesteps this; with plain pip, --no-deps
is the escape hatch.)
Pitfall 8: MAX_JOBS
and parallelism
MAX_JOBS
controls ninja’s parallel compile jobs. CUDA compiles use ~2–3 GB each, so MAX_JOBS × 3 GB
should fit in RAM. On 62 GB you can run 16; we used 12 as a safe default. You’ll notice ninja drops to fewer jobs near the end ([267/340]
) — that’s dependency ordering on the final heavy template units and the .so
link, not a misconfiguration. NVCC_THREADS
parallelizes within a single nvcc invocation.
Step 7: Verify — and the runtime-only pitfalls #
A successful build does not mean inference works. vLLM’s runtime JIT-compiles more kernels on first use, which surfaces a fresh set of issues.
1
2
3
4
5
python
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", enforce_eager=True,
gpu_memory_utilization=0.5, max_model_len=512)
print(llm.generate(["The capital of France is"],
SamplingParams(temperature=0, max_tokens=20))[0].outputs[0].text)
Pitfall 9: Could not find nvcc and default cuda_home='/usr/local/cuda'
flashinfer JIT-compiles sampling kernels at runtime and needs nvcc
— but at runtime nobody set CUDA_HOME
, so it falls back to the nonexistent /usr/local/cuda
. Because our toolkit lives in the venv, export it (and bake it into activate
so it’s always present):
1
2
3
4
cat >> $VIRTUAL_ENV/bin/activate <<'EOF'
export CUDA_HOME="$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13"
export PATH="$CUDA_HOME/bin:$PATH"
EOF
This is also where Pitfalls 5c (cccl version check) and the lib64
symlink (cannot find -lcudart
) bite — they’re runtime-JIT failures, not build failures, so they only appear here. With the 13.3 alignment + the lib64
symlink in place, the JIT compile succeeds and you get:
1
2
PROMPT: 'The capital of France is'
OUTPUT: ' the capital of the French Republic...'
🎉
Step 8: Run the GPU test suite #
A generate()
proves the happy path; the kernel tests prove the build broadly. The suite that most directly exercises what you just compiled is tests/kernels/
. Run it with CUDA_HOME
on PATH
(the tests JIT-compile too):
1
2
3
export CUDA_HOME="$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13"
export PATH="$CUDA_HOME/bin:$PATH"
python -m pytest tests/kernels/core tests/kernels/attention -q
On an A10G a focused subset (activation, layernorm, rotary/positional encoding, paged attention, cache) runs in ~1 hr and lands at 2402 passed, 583 skipped, 36 failed. The 583 skips are arch-gated kernels (Hopper/Blackwell sm_90+) correctly opting out. The 36 failures are all the same issue — see Pitfall 10.
Pitfall 10: FP8 KV-cache tests fail (not skip) on SM < 89
Every one of those 36 failures is test_reshape_and_cache_flash[...fp8...]
with:
1
FP8 KV cache needs native fp8e4nv (SM89+). Use --kv-cache-dtype bfloat16 ...
The A10G is sm_86; native FP8 (fp8e4nv
) needs sm_89+ (Ada/Hopper). This is a hardware limit, not a broken build — but unlike the cleanly arch-gated kernels, this Triton path assert
s on unsupported hardware instead of skip
ping, so it counts as a failure. Deselect the FP8 cases to get a fully green run:
1
2
python -m pytest tests/kernels/attention/test_cache.py -k "not fp8" -q
Takeaway: on pre-Ada GPUs, treat FP8 KV-cache test failures as expected, and gate them out with -k "not fp8"
rather than chasing them.
Appendix: every error → one-line fix #
| Error | Root cause | Fix |
|---|---|---|
nvidia-smi: command not found (assumed no GPU) |
driver not installed; hardware was there | lspci | grep nvidia to detect hardware |
modprobe nvidia: No such device |
nouveau owns the GPU | blacklist + unbind + rmmod nouveau |
CUDA unknown error / cuInit → 999 |
missing/stale UVM nodes; no nvidia-modprobe |
apt install nvidia-modprobe ; recreate /dev/nvidia-uvm |
nvidia-cuda-nvcc-cu13 has no real version |
wrong package name | use cuda-toolkit[nvcc] |
ptxas Unsupported .version 9.3 |
nvcc/ptxas minor mismatch | pin all CUDA pkgs to one minor |
CMake: nvcc says 13.3 but headers say 13.0 |
runtime headers ≠ nvcc | align headers to nvcc version |
mathcalls.h: rsqrt ... incompatible |
CUDA 13.0 headers vs glibc 2.43 | use CUDA ≥ 13.1 headers |
| flash-attn CMake: Python 3.14 not supported | submodule allow-list | patch utils.cmake (re-apply after any update that bumps its tag) |
ResolutionImpossible (cuda-toolkit 13.0.2 vs ≥13.2) |
conflicting CUDA pins across vLLM deps | build with pip install -e . --no-deps |
cccl: compiler and toolkit headers incompatible |
runtime downgraded after vLLM reinstall | re-pin CUDA runtime to nvcc’s minor |
cannot find -lcudart (JIT link) |
wheels use lib/ , tool wants lib64/ |
ln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64 |
Could not find nvcc ... /usr/local/cuda |
CUDA_HOME unset at runtime |
export CUDA_HOME (bake into activate ) |
FP8 KV cache needs native fp8e4nv (SM89+) (test fails) |
A10G is sm_86; FP8 path asserts instead of skipping | not a build bug — deselect with -k "not fp8" |
Updating an existing checkout #
Pulling a newer vLLM isn’t just git pull
— an editable source build has moving parts that a pull invalidates. The sequence that works:
1
2
3
4
5
git fetch upstream && git reset --hard upstream/main # or your target commit
rm -rf build .deps && find vllm -name '*.abi3.so' -delete # force a clean rebuild
pip install -v -e . --no-build-isolation --no-deps # --no-deps dodges resolver conflicts
Before pulling, check the gap with git diff --name-only HEAD..upstream/main | grep -E '\.cu|CMakeLists|requirements/'
— if native/build files changed (they usually have), budget for a full recompile (~30–40 min) and re-verification. Also confirm the torch==
pin and requires-python
in pyproject.toml
didn’t move; if torch’s version changed, you’re re-doing the whole CUDA/toolkit alignment, not just a rebuild.
Key takeaways #
Detect hardware with Don’t build for CPU because a tool is missing.lspci
, notnvidia-smi
.UVM nodes +nvidia-smi
working ≠ CUDA working.nvidia-modprobe
matter.Pin the entire CUDA pip toolkit to one minor version. Skew fails three different ways at three different stages.Pick the CUDA minor that’s compatible with your glibc/compiler, then rely on CUDA minor-version compatibility for the driver/torch.** A green build isn’t done**— runtime JIT (flashinfer) needsCUDA_HOME
and a couple of symlinks. Verify with a realgenerate()
.Scope to keep build times sane.TORCH_CUDA_ARCH_LIST
to your GPUSome test failures are hardware limits, not build bugs. On pre-Ada GPUs the FP8 KV-cache testsassert
instead ofskip
— deselect them with-k "not fp8"
.
References #
Disclaimer: This article was generated using the Gemini 3.1 Pro model.
CC BY 4.0by the author.