Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA A developer has released tiny-vllm, a high-performance LLM inference engine written in C++ and CUDA that serves as a smaller sibling to the vLLM project. The open-source repository includes both the full source code for an inference server capable of running real models like Llama 3.2 1B Instruct, and an educational course that teaches users how to implement the engine from scratch, covering topics from Safetensors loading to PagedAttention. The project aims to serve as a learning tool for individuals and a teaching resource for university lecturers. You're going to build a high performance LLM inference engine with C++ and CUDA - tiny-vllm, a younger and smaller sibling of vLLM https://github.com/vllm-project/vllm We will learn a lot along the way, make mistakes and derive the ideas and maths from scratch This repository consists of two things: 1. a full source code of the inference server and 2. a course where I lead you through the process of implementing the engine. Feel invited to use it as a learning tool on your learning path or if you are a lecturer, feel welcome to use it as a teaching resource at your university The inference engine consists of: - load a real LLM model from Safetensors Llama 3.2 1B Instruct - full LLM forward pass prefill + decode - all computation with CUDA kernels - KV cache - static batching - continuous batching - online softmax, FlashAttention-like https://courses.cs.washington.edu/courses/cse599m/23sp/notes/flashattn.pdf - PagedAttention https://arxiv.org/pdf/2309.06180 Make yourself a hot beverage and let's begin tiny-vllm tiny-vllm Intro: LLM, vLLM, models, inference servers intro-llm-vllm-models-inference-servers Technical prerequisities technical-prerequisities Safetensors and your model safetensors-and-your-model How floating-point numbers work and why we use bfloat16 how-floating-point-numbers-work-and-why-we-use-bfloat16 GPU and CPU memory gpu-and-cpu-memory Single token inference single-token-inference Tokenization tokenization Embeddings embeddings CUDA kernel engineering - embeddings cuda-kernel-engineering---embeddings RMSNorm and parallel reduction in CUDA rmsnorm-and-parallel-reduction-in-cuda RoPE rope Residual connections residual-connections cublasGemmEx cublasgemmex The column-major to row-major transposition trick the-column-major-to-row-major-transposition-trick Prefill vs decode prefill-vs-decode Why KV cache exists why-kv-cache-exists Attention attention GQA gqa SiLU silu Softmax softmax Causal mask causal-mask Argmax argmax Feed forward network feed-forward-network Buffer reuse buffer-reuse Static batching static-batching Continuous batching continuous-batching Online softmax online-softmax Paged Attention paged-attention Paged KV cache paged-kv-cache Paged Attention CUDA kernel paged-attention-cuda-kernel It's easy to get lost with so much going on recent years. Let's unpack it LLM is a model . Physically, LLM is a file which contains a lot of float numbers . Conceptually, these numbers represent weights of operations. Weights are learned/discovered/found during training phase. Some of the operations use these weights. Every operation is a function, which takes some data as input, do something with it and produces data as output. Operations and their order are defined by LLM's architecture. Every model has its own architecture, which is designed by engineers and researchers. The process of going from 0 to LLM writing a text is like this: Design the model - engineers and researchers use high level language like Python with tensor library like PyTorch https://github.com/pytorch/pytorch or tinygrad https://github.com/tinygrad/tinygrad to design model's architecture. They train small versions of the model, make experiments with different operations, data and hyperparameters parameters for operations . It's the phase of figuring out the specification Implement the model - Once they decide on final model architecture and prepare the data for training, they write the code that defines the final model. It can be also in PyTorch or similar Train the model - The chosen model architecture is initialized with dummy weights. They write a script which again uses PyTorch or similar to run learning algorithm like backpropagation on a lot of hardware, like GPUs https://en.wikipedia.org/wiki/Graphics processing unit and TPUs https://en.wikipedia.org/wiki/Tensor Processing Unit . This phase burns a lot of energy, money and computational power. The product of training phase is a file with model weights, in some format, like Safetensors format https://huggingface.co/docs/safetensors/index . So, the training phase is finding such a set of weights which produces good text using the given architecture Serve the model we are here - The file with weights can't be ran on a computer. It's not an executable. It's a lot of numbers. The architecture can't be ran either - it's just a plan, a blueprint, a description of computation. To actually run the model, we need a program that turns the architecture and its operations into executable code and uses file with model weights to load the weights into the architecture. Once you write a program that implements the operations and once the program loads the weights weights are loaded in the runtime of the program, at the startup , you can finally send prompts to the model and get a meaningful response. Generating an output from a model is called inference. That's why what we build here is called an inference server or inference engine Knowing the reason behind a need for an inference server, let's think why we build it in C++ and CUDA. It's because we want to maximize efficient use of the hardware and get high performance. It means that we want to get responses fast and we want to be able to handle multiple prompts at the same time. CUDA is the whole ecosystem, but also a language that you use to write code that runs on GPUs. We need to write code on GPUs, because many operations inside LLM are multiplying and adding multiple numbers. If you need to do small amount of math, CPU enough. If a lot, GPU better. LLMs are mostly about multiplying the matrices, which boils down to computing dot products of two vectors, for many numbers and for many vectors. The math of LLMs is simple, we will need basics of linear algebra and you can learn while coding and fill the gaps on the go. I find this way of JIT learning the most effective and perhaps you will like it too My take on a relationship between AI and computation which you maybe find useful is that the intelligence comes from a lot of parameters of the model and a lot of computation of input values using these parameters . There is no a single element, that you can point to and say: "this is what makes the model intelligent or useful". Every part of the model you can replace with a different one and get different tradeoffs in return, like trade accuracy for complexity. I hope I won't forget to get back to this topic later, when we touch the math of attention. Because - the default attention mechanism is very computationally complex O n^2 d . And this complexity can be challenged and in fact people do it and figure out alternative attention mechanisms, like linear attention https://haileyschoelkopf.github.io/blog/2024/linear-attn/ . If more people will find this course useful, I will think about another one, about ML compilers a practical one in Python or C++ + some SSA theory or about alternative attention mechanisms math + CUDA kernels . If you are interested, please let me know If you will find this course valuable, please let other people know about it Out of scope: The training phase of an LLM is something we don't do in this course. We take a trained LLM and write a program which will run this LLM fast on NVIDIA GPU for multiple requests in parallel. If you want to train your own LLM, I strongly recommend sensei Karpathy repositories like nanoGPT and llm.c and his YouTube channel . Similarly, we don't design the model, but the tensor libraries are also fascinating topic and worth understanding from scratch. George Hotz's tinygrad is a project which implements a tensor library with a very little amount of code, so if you want to get inspired and learn the internals, it's a good place to do it also their Discord is nice There is also a bit older and smaller version by Andrej Karpathy - micrograd . And since I brought the Discord, I want to recommend you Mark Saroufim's GPU MODE . Many great people hanging out there And if you feel lost with what is going on here, and you are new on your AI/ML journey, start with Jeremy Howard and Rachel Thomas fastai book . I conveniently omit the data science and engineering part here, because I don't know much about it. Probably Kaggle can be a good place to start with it and learn on-hands. Last but not least, we're going to code in C++ and CUDA and use cuBLAS where applicable. You can learn on the go. NVIDIA official resources are good and helpful You can build and run it on any platform, with minor changes, assuming you have a NVIDIA GPU. You might need to adjust some paths, like CUDA or GCC in c cpp propertiesjson /jmaczan/tiny-vllm/blob/main/.vscode/c cpp properties.json or NVCC in CMakeLists.txt /jmaczan/tiny-vllm/blob/main/CMakeLists.txt I suggest you to fork this repo and make the necessary adjustments so it works on your machine and then create a pull request to jmaczan/tiny-vllm https://github.com/jmaczan/tiny-vllm and upstream your changes for benefit of another readers The exact setup on which I develop and test it: - Linux 6.19.8 x64 64 CUDA Toolkit https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ 13.1 - C++ 17 GCC https://gcc.gnu.org/ 15.2.1 - The only external dependency you will pull in is JSON parser nlohmann/json https://github.com/nlohmann/json 3.12.0, which is a single header file include/json.hpp /jmaczan/tiny-vllm/blob/main/include/json.hpp - AMD CPU Ryzen 7 9800X3D - NVIDIA GPU RTX 5090 - I used Llama 3.2 1B Instruct https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct from Hugging Face commit hash 898999bd25b40516fce5a5b8f0948f4c81c650bc , you need just model.safetensors file from this repository Install the dependencies and run the program with ./test.sh - it will build it and immediately execute it If you fail to build or run it and your AI of choice won't be able to help, please open an Issue on GitHub - I will try to help. Make sure to provide all useful context First thing you need to do is to download a LLM which we will use to run inference on. I choose Llama 3.2 1B Instruct, because it's easy, small, tuned for dialogs and good enough for us. From perspective of us, the engineers who build an inference server, the model is just a single file containing weights. The model is in Safetensors format https://huggingface.co/docs/safetensors/index . There exist other formats, like Pickle https://docs.python.org/3/library/pickle.html and Parquet https://parquet.apache.org/docs/file-format/ . Safetensors is just very popular and widely used, and the model we picked is hosted in Safetensors Let's stop for a moment and understand the Safetensors format before we move on. A safetensor file consists of 3 sections, always in this order: header size, header and tensors data. Header size is always 8 bytes. These 8 bytes are an unsigned 64-bit integer, which says how many bytes the actual header takes. std::ifstream safetensors file "model.safetensors", std::ios base::binary ; uint64 t header size; safetensors file.read reinterpret cast