Crossing the Boundary: Custom Kernels and the C++/Python ABI in vLLM VLLM, a large-model inference serving framework, uses Python for control flow but pushes arithmetic into compiled C++ and CUDA kernels to avoid interpreter overhead. The Python/C++ boundary crossing incurs fixed costs from argument marshaling, the PyTorch dispatcher, and kernel launches, which can rival GPU math time at small batch sizes. vLLM mitigates this with fused kernels and CUDA graphs to amortize per-call overhead. Crossing the Boundary: Custom Kernels and the C++/Python ABI in vLLM Python is a productive orchestration language for inference serving, but it is the wrong tool for the critical path of token generation. Large-model inference is bound by memory bandwidth and by per-operation latency, and the interpreter cannot meet either constraint. So frameworks like vLLM keep the control flow in Python and push the arithmetic into compiled C++ and CUDA kernels. That split is not free. Every call has to cross the Python/C++ boundary, and that crossing involves an Application Binary Interface ABI , a dispatcher, and the overhead of launching work on the GPU. This post walks through how vLLM crosses that boundary: how kernels are registered with PyTorch, the ABI constraints that make the boundary fragile, the hardware-level decisions inside the kernels themselves, and how vLLM amortizes the per-call overhead with CUDA graphs. Here is the whole path a single custom op travels, from a Python call to execution on the GPU’s streaming multiprocessors SMs : php graph TD A "Python: vllm.paged attention v1 ... " -- |GIL held| B "CPython C-API frame" B -- C "torch op binding