I built an interactive 11-chapter guide to how LLM inference actually works

A developer built an 11-chapter interactive guide explaining how LLM inference works, centered around nano-vLLM, a 1,200-line Python reimplementation of the vLLM serving engine. The guide covers algorithms like PagedAttention and sampling with interactive simulators, requiring no ML background.

Production vLLM is 100,000+ lines of C++, CUDA, and Python. It powers most of the industry's LLM serving — but reading it cold is brutal. So I built a study series around nano-vLLM , an open-source reimplementation of vLLM's core ideas in ~1,200 lines of pure Python. Every algorithm is visible. Every design decision is legible. It turned out to be the perfect lens for actually understanding how LLMs generate text. The result is an 11-chapter interactive guide. No ML background required — every piece of jargon is explained from scratch with analogies, diagrams, annotated source code, interactive simulators, and quizzes. What it covers: Each chapter is fully self-contained and interactive. A few of the simulators I'm most happy with: a PagedAttention block allocator you can fill up and watch fragment, a live scheduler you step through token by token, and a sampling playground where you reshape the probability distribution with sliders and sample from it. 🔗 Read the full series: https://ashwing.github.io/vllm-guide/ https://ashwing.github.io/vllm-guide/ It's free and open. If you've ever wanted to understand what actually happens between sending a prompt and getting tokens back — this is the path I wish I'd had. Feedback very welcome. Happy to answer questions about any of the concepts in the comments.