Building vLLM from Source: A Field Guide (with all the pitfalls)

wpnews.pro

A step-by-step field guide to building vLLM from source on Ubuntu 26.04, covering Python 3.14 compatibility, CUDA driver issues, and toolchain pitfalls.

Building vLLM 1 from source sounds like a

pip install -e .

away. In practice, on a fresh machine with a recent OS and a recent Python, you hit a chain of version-skew, driver, and toolchain issues that each fail with a cryptic message. This post walks through a real end-to-end build on an AWS g5 instance (NVIDIA A10G) running

Ubuntu 26.04 + Python 3.14, documenting every error encountered and the fix.

The target was a CUDA build of a vLLM fork. The same playbook applies to a stock vllm-project/vllm

checkout.

TL;DR — the working recipe #

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
lspci | grep -i nvidia        # hardware present?
nvidia-smi                    # driver working?

sudo apt-get install -y nvidia-driver-575-open nvidia-modprobe dkms
sudo modprobe -r nouveau && sudo modprobe nvidia   # or reboot

python3 -m venv ~/go/venv && source ~/go/venv/bin/activate
pip install --upgrade pip

pip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0   # default index = CUDA build
pip install "cuda-toolkit[nvcc]==13.3.0" "nvidia-cuda-runtime==13.3.29" \
            "nvidia-cuda-nvrtc==13.3.33" "nvidia-cublas==13.3.0.5"

export CUDA_HOME=$VIRTUAL_ENV/lib/python3.*/site-packages/nvidia/cu13
ln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64
( cd $CUDA_HOME/lib && for f in lib*.so.*; do ln -sf "$f" "${f%%.so.*}.so"; done )
mkdir -p $CUDA_HOME/lib/stubs
ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so $CUDA_HOME/lib/stubs/libcuda.so

export PATH=$CUDA_HOME/bin:$PATH CUDACXX=$CUDA_HOME/bin/nvcc
export VLLM_TARGET_DEVICE=cuda TORCH_CUDA_ARCH_LIST="8.6+PTX"
export MAX_JOBS=12 NVCC_THREADS=2
export CMAKE_ARGS="-DCUDAToolkit_ROOT=$CUDA_HOME -DCMAKE_CUDA_COMPILER=$CUDA_HOME/bin/nvcc"
pip install -v -e . --no-build-isolation

Read on for why each line is there and what breaks without it.

Prerequisites & how to check them #

Before anything else, take an inventory. Getting this wrong wastes the most time — including the most embarrassing pitfall of all.

Requirement	How to check	Notes
A GPU (and which one)	`lspci \| grep -i nvidia`	Determines CUDA vs CPU build. Don’t trust — see Pitfall 1.`nvidia-smi` alone
GPU driver loaded	`nvidia-smi`	If it fails but `lspci` shows a GPU, the driver isn’t installed/loaded.
Compute capability	`nvidia-smi --query-gpu=compute_cap --format=csv`	A10G = `8.6` . You build kernels for this.
CPU flags (CPU build only)	`lscpu \| grep -oE 'avx512f\|avx2'`	vLLM CPU wants AVX512; AVX2 works with limited features.
Compiler	`gcc --version`	vLLM recommends gcc 12–13; newer (15) mostly works but watch nvcc host-compiler limits.
Python	`python3 --version`	Check the repo’s `requires-python` in `pyproject.toml` .
RAM / cores	`nproc; free -h`	CUDA compiles are RAM-hungry (~2–3 GB per parallel job).
build tools	`cmake --version; ninja --version`	vLLM needs cmake ≥ 3.26.

Pitfall 1: “There’s no GPU here” — when there definitely is

This one cost us a whole CPU build. The very first check was:

1
nvidia-smi   # → command not found

Conclusion drawn: no GPU, do a CPU build. Wrong. nvidia-smi

missing only means the driver/userspace tools aren’t installed — it says nothing about the hardware. The actual hardware check is:

1
2
bash
$ lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)

The A10G was there the whole time; it just had no driver. Always check lspci (or /proc/driver/nvidia, ls /dev/nvidia) before concluding “no GPU.”* On cloud instances that aren’t “Deep Learning AMIs,” a bare GPU with no driver is the norm, not the exception.

Lesson:lspci

detects hardware.nvidia-smi

detects aworking driver. They answer different questions. Decide CPU-vs-GPU fromlspci

.

Step 2: Install and load the NVIDIA driver #

lspci

shows the GPU, nvidia-smi

is missing → install the driver.

1
2
3
sudo apt-get update
sudo apt-get install -y dkms build-essential linux-headers-$(uname -r) \
                        nvidia-driver-575-open

We used the open-kernel variant (-open

), which is NVIDIA’s recommendation for Ampere and newer (A10G is Ampere). The 575

metapackage pulled driver 580.159.03

.

Pitfall 2: `modprobe nvidia`

→ “No such device” (nouveau owns the GPU)

1
2
3
4
5
bash
$ sudo modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': No such device

$ dmesg | grep NVRM
NVRM: GPU 0000:00:1e.0 is already bound to nouveau.

The open-source nouveau driver grabs the GPU at boot. The NVIDIA module can’t bind while nouveau holds it. Fix — blacklist, unbind, and load:

1
2
3
4
5
6
echo -e "blacklist nouveau\noptions nouveau modeset=0" | \
    sudo tee /etc/modprobe.d/blacklist-nouveau.conf
echo -n "0000:00:1e.0" | sudo tee /sys/bus/pci/drivers/nouveau/unbind
sudo rmmod nouveau
sudo modprobe nvidia
sudo update-initramfs -u    # make the blacklist survive reboots

If rmmod nouveau

complains it’s in use (e.g. a display manager), a reboot after the blacklist + initramfs update achieves the same thing cleanly.

Pitfall 3: `nvidia-smi`

works but CUDA returns error 999 (“unknown error”)

This is the subtle one. After the module:

1
2
3
python
$ nvidia-smi          # works, shows the A10G
$ python -c "import torch; print(torch.cuda.is_available())"
RuntimeError: CUDA unknown error ...    # False

A direct driver-API probe confirmed the runtime was broken even though nvidia-smi

was fine:

1
2
python
import ctypes
ctypes.CDLL("libcuda.so.1").cuInit(0)   # → 999 (CUDA_ERROR_UNKNOWN)

Two distinct causes, both worth knowing:

Stale/incorrect UVM device nodes.nvidia-smi

uses/dev/nvidia0

+/dev/nvidiactl

(major 195). CUDA additionally needs/dev/nvidia-uvm

. After a manual driver bring-up those nodes can be missing or have the wrong major. Recreate them against/proc/devices

:

1
2
3
4
5
sudo modprobe nvidia_uvm
UVM_MAJOR=$(grep nvidia-uvm /proc/devices | awk '{print $1}')
sudo rm -f /dev/nvidia-uvm /dev/nvidia-uvm-tools
sudo mknod -m 666 /dev/nvidia-uvm        c $UVM_MAJOR 0
sudo mknod -m 666 /dev/nvidia-uvm-tools  c $UVM_MAJOR 1

This setuid helper is what the CUDA runtime shells out to in order to create/initialize device nodes for non-root processes. Without it, rawnvidia-modprobe

is not installed.cuInit

may pass buttorch’s runtime init throws 999. This was the actual fix for us:

1
2
sudo apt-get install -y nvidia-modprobe
sudo nvidia-modprobe -c 0 -u

After this:

torch.cuda.is_available() → True

. A reboot also installs the proper udev rules and avoids the manualmknod

dance — but if you can’t reboot, the two steps above get you there.

Lesson:nvidia-smi

working ≠ CUDA working. They use different device nodes. IfcuInit

returns 999, look at/dev/nvidia-uvm

and make surenvidia-modprobe

exists.

Step 3: The virtual environment #

Nothing exotic here, but keep it isolated from system Python:

1
2
3
python3 -m venv ~/go/venv
source ~/go/venv/bin/activate
pip install --upgrade pip

We used Python 3.14. Check the repo supports it:

1
2
grep requires-python pyproject.toml

It built fine — torch==2.11.0

and every dependency had cp314

wheels. But see Pitfall 6: a bundled submodule had its own narrower Python check.

Step 4: CUDA torch + a consistent CUDA toolkit #

vLLM compiles .cu

kernels, so it needs nvcc

— which PyTorch wheels do not bundle (they ship runtime libraries only). You have two options:

Install the full CUDA toolkit to /usr/local/cuda

via NVIDIA’s apt repo, or - Assemble a toolkit entirely from pip wheels.

We went pip-only (no apt repo for Ubuntu 26.04 yet, and it keeps everything in the venv). First, the CUDA build of torch:

1
2
pip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0
python -c "import torch; print(torch.version.cuda)"   # → 13.0  (wheel tag: 2.11.0+cu130)

Then nvcc and the dev components via the modern unified meta package:

1
pip install "cuda-toolkit[nvcc]==13.3.0"

Pitfall 4: the `nvidia-cuda-nvcc-cu13`

package is a stub

The old naming is a trap:

1
2
bash
$ pip install nvidia-cuda-nvcc-cu13
ERROR: ... (from versions: 0.0.0a0, 0.0.1)   # placeholder only!

The real compiler ships via the ** cuda-toolkit[nvcc]** extra (which pulls

nvidia-cuda-nvcc

, nvidia-nvvm

, nvidia-cuda-crt

). Use the meta package’s extras, not the *-cu13

standalone names.### Pitfall 5: CUDA toolkit version skew (three separate failures)

This was the single biggest time sink. The pip CUDA ecosystem is split across many packages (nvidia-cuda-nvcc

, nvidia-nvvm

, nvidia-cuda-crt

, nvidia-cuda-cccl

, nvidia-cuda-runtime

, nvidia-cublas

, …) and pip will happily install mismatched minor versions. Each mismatch fails differently:

5a. ptxas can’t assemble newer PTX:

1
ptxas fatal : Unsupported .version 9.3; current version is '9.0'

nvcc front-end was 13.3 (emits PTX 9.3) but ptxas

was 13.0 (≤ PTX 9.0). → align them.

5b. CMake refuses on nvcc-vs-headers mismatch (PyTorch’s cuda.cmake

):

1
2
CMake Error: FindCUDA says CUDA version is 13.3 (from nvcc), but the CUDA headers
say the version is 13.0.

5c. flashinfer’s bundled cccl refuses at runtime (its JIT compiler):

1
2
cccl/.../cuda_toolkit.h:41: error: "CUDA compiler and CUDA toolkit headers are
incompatible, please check your include paths"

The cccl check requires CUDART_VERSION

’s minor to exactly equal nvcc’s minor.

The fix for all three: pin the entire CUDA userspace to one minor version.

Why 13.3 and not 13.0 (to match torch’sBecausecu130

)?CUDA 13.0 headers don’t compile on glibc 2.43(Ubuntu 26.04):

1
2
/usr/include/.../mathcalls.h:206: error: exception specification is incompatible
with that of previous function "rsqrt"

CUDA 13.1+ headers fixed this. So we align

upto 13.3. torch built forcu130

still runs on a 13.3 runtime thanks toCUDA 13 minor-version compatibility(any 13.x toolkit runs on an R580+ driver).

1
2
3
4
5
6
7
8
pip install "cuda-toolkit==13.3.0" "nvidia-cuda-runtime==13.3.29" \
            "nvidia-cuda-nvcc==13.3.33" "nvidia-nvvm==13.3.33" \
            "nvidia-cuda-crt==13.3.33"  "nvidia-cuda-cccl==13.3.3.3.1" \
            "nvidia-cuda-nvrtc==13.3.33" "nvidia-cublas==13.3.0.5"

nvcc --version | grep release                                   # 13.3
grep CUDART_VERSION $CUDA_HOME/include/cuda_runtime_api.h        # 13030  (= 13.3)

pip

prints a dependency-conflict warning (torch pins cuda-toolkit==13.0.2

) — it’s cosmetic; torch runs fine via minor-version compat. But beware: reinstalling vLLM later re-pulls its requirements/cuda.txt

and silently downgrades the runtime back to 13.0, breaking flashinfer’s JIT again. Re-run the 13.3 pins after any reinstall.

Step 5: Assemble a working `CUDA_HOME` #

The pip wheels lay CUDA out under .../site-packages/nvidia/cu13/{bin,include,lib}

, which is almost what CMake and downstream linkers expect — but missing three things:

1
2
3
4
5
6
7
8
9
10
11
export CUDA_HOME=$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13

( cd $CUDA_HOME/lib && for f in lib*.so.*; do ln -sf "$f" "${f%%.so.*}.so"; done )

ln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64

mkdir -p $CUDA_HOME/lib/stubs
ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so $CUDA_HOME/lib/stubs/libcuda.so

Sanity check before the big build:

1
2
3
4
5
6
7
8
cat > /tmp/t.cu <<'EOF'
#include <cuda_runtime.h>
__global__ void k(){}
int main(){k<<<1,1>>>();return cudaDeviceSynchronize();}
EOF
$CUDA_HOME/bin/nvcc -arch=sm_86 -I$CUDA_HOME/include -L$CUDA_HOME/lib -lcudart /tmp/t.cu -o /tmp/t.out
cmake -P <(echo 'find_package(CUDAToolkit REQUIRED); message("CTK ${CUDAToolkit_VERSION}")') 2>&1

Step 6: Build vLLM #

Set the build environment and go. The most important variable is ** TORCH_CUDA_ARCH_LIST** — scope it to

yourGPU or you’ll compile every architecture and wait 5–10× longer.

1
2
3
4
5
6
7
8
9
cd ~/go/vllm
export PATH=$CUDA_HOME/bin:$PATH
export CUDACXX=$CUDA_HOME/bin/nvcc
export VLLM_TARGET_DEVICE=cuda
export TORCH_CUDA_ARCH_LIST="8.6+PTX"     # A10G = sm_86
export MAX_JOBS=12                        # ~2-3 GB RAM per job; tune to your box
export NVCC_THREADS=2
export CMAKE_ARGS="-DCUDAToolkit_ROOT=$CUDA_HOME -DCMAKE_CUDA_COMPILER=$CUDA_HOME/bin/nvcc"
pip install -v -e . --no-build-isolation

A few notes:

--no-build-isolation

is required so the build sees the torch/CUDA you installed.enforce_eager

-style arch warnings likeDeepGEMM/FlashMLA will not compile: unsupported CUDA architecture 8.6

areexpected on Ampere — those kernels target Hopper (sm_90+) and are simply skipped.- On 16 cores / 62 GB this took ~30–40 min and produced _C.abi3.so

(~117 MB),_moe_C.abi3.so

, etc.

Pitfall 6: a bundled submodule rejects your Python

Even though the top-level pyproject.toml

allowed Python 3.14, the vendored flash-attention CMake had its own allow-list:

1
2
CMake Error at .deps/vllm-flash-attn-src/cmake/utils.cmake:20:
  Python version (3.14) is not one of the supported versions: 3.9;3.10;3.11;3.12;3.13.

Fix — add your version to the macro (vLLM points FETCHCONTENT_BASE_DIR

at .deps

, so edits there persist; just don’t rm -rf .deps

before rebuilding):

1
2
set(_SUPPORTED_VERSIONS_LIST ${SUPPORTED_VERSIONS} ${ARGN} "3.14")

This patch is not permanent.flash-attn is pulled via CMake FetchContent at a pinnedGIT_TAG

. The moment yougit pull

/update vLLM and that tag changes (or yourm -rf .deps

), FetchContent re-clones afreshcopy and your edit is gone — the 3.14 check fails again at the next configure. Re-apply the one-liner after any update that bumps the flash-attn tag.

Pitfall 7: dependency-resolver deadlock (`ResolutionImpossible`

)

On a recent main

, pip install -e .

can die before compiling anything with:

1
2
3
4
5
ERROR: Cannot install cuda-tile[tileiras]==1.4.0, cuda-toolkit==13.0.2 and vllm
because these package versions have conflicting dependencies.
  torch 2.11.0 depends on cuda-toolkit==13.0.2
  cuda-tile[tileiras] 1.4.0 depends on cuda-toolkit>=13.2,<13.4
ERROR: ResolutionImpossible

Two of vLLM’s own dependencies pin incompatible CUDA-toolkit ranges (torch wants exactly 13.0.2; a newer kernel package wants ≥13.2). pip’s strict resolver refuses to proceed. This is an upstream packaging conflict, not something you caused — and it’s exactly why we aligned the toolkit to 13.3 earlier (it satisfies the ≥13.2 side, and torch runs fine against it via minor-version compat).

The fix is to build the package without re-resolving the whole graph, since you’ve already curated a working CUDA stack:

1
pip install -v -e . --no-build-isolation --no-deps

--no-deps

compiles and installs vLLM using the environment you’ve assembled, instead of letting pip try (and fail) to reconcile every transitive pin. Afterwards, install any genuinely-missing runtime deps individually and re-run the smoke test. (Upstream’s own docs use uv

, whose override/resolution model sidesteps this; with plain pip, --no-deps

is the escape hatch.)

Pitfall 8: `MAX_JOBS`

and parallelism

MAX_JOBS

controls ninja’s parallel compile jobs. CUDA compiles use ~2–3 GB each, so MAX_JOBS × 3 GB

should fit in RAM. On 62 GB you can run 16; we used 12 as a safe default. You’ll notice ninja drops to fewer jobs near the end ([267/340]

) — that’s dependency ordering on the final heavy template units and the .so

link, not a misconfiguration. NVCC_THREADS

parallelizes within a single nvcc invocation.

Step 7: Verify — and the runtime-only pitfalls #

A successful build does not mean inference works. vLLM’s runtime JIT-compiles more kernels on first use, which surfaces a fresh set of issues.

1
2
3
4
5
python
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", enforce_eager=True,
          gpu_memory_utilization=0.5, max_model_len=512)
print(llm.generate(["The capital of France is"],
                   SamplingParams(temperature=0, max_tokens=20))[0].outputs[0].text)

Pitfall 9: `Could not find nvcc and default cuda_home='/usr/local/cuda'`

flashinfer JIT-compiles sampling kernels at runtime and needs nvcc

— but at runtime nobody set CUDA_HOME

, so it falls back to the nonexistent /usr/local/cuda

. Because our toolkit lives in the venv, export it (and bake it into activate

so it’s always present):

1
2
3
4
cat >> $VIRTUAL_ENV/bin/activate <<'EOF'
export CUDA_HOME="$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13"
export PATH="$CUDA_HOME/bin:$PATH"
EOF

This is also where Pitfalls 5c (cccl version check) and the lib64

symlink (cannot find -lcudart

) bite — they’re runtime-JIT failures, not build failures, so they only appear here. With the 13.3 alignment + the lib64

symlink in place, the JIT compile succeeds and you get:

1
2
PROMPT: 'The capital of France is'
OUTPUT: ' the capital of the French Republic...'

🎉

Step 8: Run the GPU test suite #

A generate()

proves the happy path; the kernel tests prove the build broadly. The suite that most directly exercises what you just compiled is tests/kernels/

. Run it with CUDA_HOME

on PATH

(the tests JIT-compile too):

1
2
3
export CUDA_HOME="$VIRTUAL_ENV/lib/python3.14/site-packages/nvidia/cu13"
export PATH="$CUDA_HOME/bin:$PATH"
python -m pytest tests/kernels/core tests/kernels/attention -q

On an A10G a focused subset (activation, layernorm, rotary/positional encoding, paged attention, cache) runs in ~1 hr and lands at 2402 passed, 583 skipped, 36 failed. The 583 skips are arch-gated kernels (Hopper/Blackwell sm_90+) correctly opting out. The 36 failures are all the same issue — see Pitfall 10.

Pitfall 10: FP8 KV-cache tests fail (not skip) on SM < 89

Every one of those 36 failures is test_reshape_and_cache_flash[...fp8...]

with:

1
FP8 KV cache needs native fp8e4nv (SM89+). Use --kv-cache-dtype bfloat16 ...

The A10G is sm_86; native FP8 (fp8e4nv

) needs sm_89+ (Ada/Hopper). This is a hardware limit, not a broken build — but unlike the cleanly arch-gated kernels, this Triton path assert

s on unsupported hardware instead of skip

ping, so it counts as a failure. Deselect the FP8 cases to get a fully green run:

1
2
python -m pytest tests/kernels/attention/test_cache.py -k "not fp8" -q

Takeaway: on pre-Ada GPUs, treat FP8 KV-cache test failures as expected, and gate them out with -k "not fp8"

rather than chasing them.

Appendix: every error → one-line fix #

Error	Root cause	Fix
`nvidia-smi: command not found` (assumed no GPU)	driver not installed; hardware was there	`lspci \| grep nvidia` to detect hardware
`modprobe nvidia: No such device`	nouveau owns the GPU	blacklist + unbind + `rmmod nouveau`
`CUDA unknown error` / `cuInit → 999`	missing/stale UVM nodes; no `nvidia-modprobe`	`apt install nvidia-modprobe` ; recreate `/dev/nvidia-uvm`
`nvidia-cuda-nvcc-cu13` has no real version	wrong package name	use `cuda-toolkit[nvcc]`
`ptxas Unsupported .version 9.3`	nvcc/ptxas minor mismatch	pin all CUDA pkgs to one minor
CMake: `nvcc says 13.3 but headers say 13.0`	runtime headers ≠ nvcc	align headers to nvcc version
`mathcalls.h: rsqrt ... incompatible`	CUDA 13.0 headers vs glibc 2.43	use CUDA ≥ 13.1 headers
flash-attn CMake: Python 3.14 not supported	submodule allow-list	patch `utils.cmake` (re-apply after any update that bumps its tag)
`ResolutionImpossible` (cuda-toolkit 13.0.2 vs ≥13.2)	conflicting CUDA pins across vLLM deps	build with `pip install -e . --no-deps`
`cccl: compiler and toolkit headers incompatible`	runtime downgraded after vLLM reinstall	re-pin CUDA runtime to nvcc’s minor
`cannot find -lcudart` (JIT link)	wheels use `lib/` , tool wants `lib64/`	`ln -sfn $CUDA_HOME/lib $CUDA_HOME/lib64`
`Could not find nvcc ... /usr/local/cuda`	`CUDA_HOME` unset at runtime	export `CUDA_HOME` (bake into `activate` )
`FP8 KV cache needs native fp8e4nv (SM89+)` (test fails)	A10G is sm_86; FP8 path asserts instead of skipping	not a build bug — deselect with `-k "not fp8"`

Updating an existing checkout #

Pulling a newer vLLM isn’t just git pull

— an editable source build has moving parts that a pull invalidates. The sequence that works:

1
2
3
4
5
git fetch upstream && git reset --hard upstream/main   # or your target commit
rm -rf build .deps && find vllm -name '*.abi3.so' -delete   # force a clean rebuild
pip install -v -e . --no-build-isolation --no-deps     # --no-deps dodges resolver conflicts

Before pulling, check the gap with git diff --name-only HEAD..upstream/main | grep -E '\.cu|CMakeLists|requirements/'

— if native/build files changed (they usually have), budget for a full recompile (~30–40 min) and re-verification. Also confirm the torch==

pin and requires-python

in pyproject.toml

didn’t move; if torch’s version changed, you’re re-doing the whole CUDA/toolkit alignment, not just a rebuild.

Key takeaways #

Detect hardware with Don’t build for CPU because a tool is missing.lspci

, notnvidia-smi

.UVM nodes +nvidia-smi

working ≠ CUDA working.nvidia-modprobe

matter.Pin the entire CUDA pip toolkit to one minor version. Skew fails three different ways at three different stages.Pick the CUDA minor that’s compatible with your glibc/compiler, then rely on CUDA minor-version compatibility for the driver/torch.** A green build isn’t done**— runtime JIT (flashinfer) needsCUDA_HOME

and a couple of symlinks. Verify with a realgenerate()

.Scope to keep build times sane.TORCH_CUDA_ARCH_LIST

to your GPUSome test failures are hardware limits, not build bugs. On pre-Ada GPUs the FP8 KV-cache testsassert

instead ofskip

— deselect them with-k "not fp8"

.

References #

Disclaimer: This article was generated using the Gemini 3.1 Pro model.

CC BY 4.0by the author.

source & further reading

hiraditya.github.io — original article Loop Unrolling in the ML Era vLLM's op IR, or: where the inference engine meets the compiler "Hello, World!" in a Heterogeneous System