{"slug": "free-llm-inference-handbook-100-engineers-cloned-it-in-week-1", "title": "Free LLM inference handbook: 100 engineers cloned it in week 1", "summary": "More than 100 engineers cloned a free open-source handbook on large language model inference within its first week of release. The guide consolidates years of production experience and research to address the unique challenges of serving LLMs, including unpredictable latency, growing memory demands, and high costs. The project aims to fill a gap in available resources by providing a comprehensive reference for deploying LLMs in production.", "body_md": "**The definitive guide to serving large language models in production.**\n\n[Quick Start](#-quick-start) •\n[Contents](#-table-of-contents) •\n[Labs](#-labs) •\n[Community](#-community) •\n[Contributing](#-contributing)\n\nLLM inference is hard. Not \"read the docs and figure it out\" hard — **fundamentally different from everything else in ML** hard.\n\nTraditional ML inference is a solved problem. You batch requests, run a forward pass, return results. Latency is predictable, memory is fixed, scaling is linear.\n\nLLM inference breaks all of these assumptions:\n\n**Latency is unpredictable**— a 10-token response takes 100ms, a 1000-token response takes 10 seconds** Memory grows during requests**— the KV cache expands with every generated token** Scaling is sub-linear**— communication overhead dominates as you add GPUs** Cost is 100x higher**— $0.001/request becomes $0.10/request\n\nThis handbook exists because we needed it and couldn't find it. The knowledge is scattered across papers, blog posts, tribal knowledge, and source code comments. We've consolidated years of production experience and research into one comprehensive resource.\n\n**This is the guide we wish existed when we started.**\n\n📬\n\nFollow the build— New chapters, explained in plain English with production context. Subscribe to[The Engineer's Digest]to get notified when new content drops.[Subscribe free →]\n\n## 💬 [Join the discussion](https://github.com/harshuljain13/llm-inference-at-scale/discussions)\n— questions, feedback, and corrections welcome\n\n|\n|\n|\n|\n\n- Python 3.10+\n- CUDA 12.0+ (for GPU labs)\n- Basic PyTorch familiarity\n\n```\ngit clone https://github.com/harshuljain13/llm-inference-at-scale.git\ncd llm-inference-at-scale\n\n# Create virtual environment\npython -m venv .venv\nsource .venv/bin/activate  # On Windows: .venv\\Scripts\\activate\n\n# Install dependencies\npip install -r requirements.txt\n# Open the first chapter\nopen content/00_foundations/00.0_what_is_llm_inference/what_is_llm_inference.md\n```\n\nOr browse the [Table of Contents](#-table-of-contents) below.\n\n| Chapter | Title | Description |\n|---|---|---|\n| 0.0 |\n|\n\n[Why LLM Inference is Different](/harshuljain13/llm-inference-at-scale/blob/master/content/00_foundations/00.1_why_llm_inference_is_different/why_llm_inference_is_different.md)[Transformer Inference Mechanics](/harshuljain13/llm-inference-at-scale/blob/master/content/00_foundations/00.2_transformer_inference_basics/transformer_inference_basics.md)| Chapter | Title | Description |\n|---|---|---|\n| 1.1 |\n|\n\n[Roofline Model](/harshuljain13/llm-inference-at-scale/blob/master/content/01_gpu_fundamentals/01.2_roofline_model/roofline_model.md)[FlashAttention](/harshuljain13/llm-inference-at-scale/blob/master/content/01_gpu_fundamentals/01.3_flash_attention/flash_attention.md)| Chapter | Title | Description |\n|---|---|---|\n| 2.1 |\n|\n\n[Attention Mechanisms](/harshuljain13/llm-inference-at-scale/blob/master/content/02_attention_and_kv/02.2_attention_mechanisms/attention_mechanisms.md)[PagedAttention](/harshuljain13/llm-inference-at-scale/blob/master/content/02_attention_and_kv/02.3_paged_attention/paged_attention.md)[KV Cache Compression](/harshuljain13/llm-inference-at-scale/blob/master/content/02_attention_and_kv/02.4_kv_cache_compression/kv_cache_compression.md)| Chapter | Title | Description |\n|---|---|---|\n| 3.1 |\n|\n\n[TurboQuant](/harshuljain13/llm-inference-at-scale/blob/master/content/03_optimization/03.2_turboquant/turboquant.md)[Continuous Batching](/harshuljain13/llm-inference-at-scale/blob/master/content/03_optimization/03.3_continuous_batching/continuous_batching.md)[Speculative Decoding](/harshuljain13/llm-inference-at-scale/blob/master/content/03_optimization/03.4_speculative_decoding/speculative_decoding.md)[Chunked Prefill](/harshuljain13/llm-inference-at-scale/blob/master/content/03_optimization/03.5_chunked_prefill/chunked_prefill.md)| Chapter | Title | Description |\n|---|---|---|\n| 4.1 |\n|\n\n[SGLang](/harshuljain13/llm-inference-at-scale/blob/master/content/04_engines/04.2_sglang/sglang.md)[TensorRT-LLM](/harshuljain13/llm-inference-at-scale/blob/master/content/04_engines/04.3_tensorrt_llm/tensorrt_llm.md)| Chapter | Title | Description |\n|---|---|---|\n| 5.1 |\n|\n\n[MoE Inference](/harshuljain13/llm-inference-at-scale/blob/master/content/05_scaling/05.2_moe_inference/moe_inference.md)[Distillation](/harshuljain13/llm-inference-at-scale/blob/master/content/05_scaling/05.3_distillation)| Chapter | Title | Description |\n|---|---|---|\n| 6.1 |\n|\n\n[EKS + KServe](/harshuljain13/llm-inference-at-scale/blob/master/content/06_serving/06.2_eks_kserve/eks_kserve.md)[SageMaker](/harshuljain13/llm-inference-at-scale/blob/master/content/06_serving/06.3_sagemaker/sagemaker.md)[Disaggregated Serving](/harshuljain13/llm-inference-at-scale/blob/master/content/06_serving/06.4_disaggregated_serving/disaggregated_serving.md)[Cold Start](/harshuljain13/llm-inference-at-scale/blob/master/content/06_serving/06.5_cold_start/cold_start.md)| Chapter | Title | Description |\n|---|---|---|\n| 7.1 |\n|\n\n[Structured Output](/harshuljain13/llm-inference-at-scale/blob/master/content/07_operations/07.2_structured_output)[Edge Deployment](/harshuljain13/llm-inference-at-scale/blob/master/content/07_operations/07.3_edge_deployment/edge_deployment.md)Hands-on exercises to reinforce each concept. Each lab includes starter code, step-by-step instructions, and solutions.\n\n| Lab | Title | Prerequisites | Time |\n|---|---|---|---|\n| 01 |\n|\n\n[VRAM Calculation](/harshuljain13/llm-inference-at-scale/blob/master/labs/lab_02_vram_calculation)[Quantization Comparison](/harshuljain13/llm-inference-at-scale/blob/master/labs/lab_03_quantization_comparison)[vLLM Deployment](/harshuljain13/llm-inference-at-scale/blob/master/labs/lab_04_vllm_deployment)[SGLang Structured Output](/harshuljain13/llm-inference-at-scale/blob/master/labs/lab_05_sglang_structured_output)[Tensor Parallelism](/harshuljain13/llm-inference-at-scale/blob/master/labs/lab_06_tensor_parallelism)[Ray Serve Deployment](/harshuljain13/llm-inference-at-scale/blob/master/labs/lab_07_ray_serve_deployment)[EKS + KServe](/harshuljain13/llm-inference-at-scale/blob/master/labs/lab_08_eks_kserve_deployment)[SageMaker Production](/harshuljain13/llm-inference-at-scale/blob/master/labs/lab_09_sagemaker_production)[Benchmarking Suite](/harshuljain13/llm-inference-at-scale/blob/master/labs/lab_10_benchmarking_monitoring)**Hardware requirements:** Most labs run on a single GPU (g5.xlarge or equivalent). Labs 06 and 08 require multi-GPU instances.\n\nFormulas you'll use constantly when working with LLM inference:\n\nThe theoretical maximum decode speed, limited by how fast you can read model weights:\n\n```\nmax_tokens_per_second = memory_bandwidth / model_size_bytes\n```\n\n**Example:** Llama 8B (16GB FP16) on A100 (2 TB/s) → 125 tokens/sec maximum\n\nMemory required for the key-value cache:\n\n```\nkv_cache_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × dtype_bytes\n```\n\n**Example:** Llama 8B, batch=1, seq=4096, FP16 → 512 MB\n\nDetermines whether a workload is compute-bound or memory-bound:\n\n```\narithmetic_intensity = FLOPs / bytes_transferred\n```\n\n**Rule of thumb:** Below the ridge point (~156 FLOPs/byte on A100) = memory-bound\n\n```\nllm-inference-at-scale/\n├── content/                      # 📖 Handbook chapters\n│   ├── 00_foundations/           #    Part I: Foundations\n│   ├── 01_gpu_fundamentals/      #    Part II: GPU Fundamentals\n│   ├── 02_attention_and_kv/      #    Part III: Attention & KV Cache\n│   ├── 03_optimization/          #    Part IV: Optimization Techniques\n│   ├── 04_engines/               #    Part V: Inference Engines\n│   ├── 05_scaling/               #    Part VI: Scaling\n│   ├── 06_serving/               #    Part VII: Production Serving\n│   ├── 07_operations/            #    Part VIII: Operations\n│   └── utils/                    #    Visualization utilities\n├── labs/                         # 🧪 Hands-on exercises\n├── reference/                    # 📋 Quick references\n│   ├── cheat_sheet.md            #    One-page summary\n│   ├── glossary.md               #    Terminology\n│   ├── vllm_quick_reference.md   #    vLLM commands\n│   └── cost_calculator.py        #    Inference cost estimation\n├── assets/                       # 🎨 Images and diagrams\n└── slides/                       # 📊 Presentation materials\n```\n\nFor engineers who need to deploy an LLM this week:\n\n[0.0 What is LLM Inference?](/harshuljain13/llm-inference-at-scale/blob/master/content/00_foundations/00.0_what_is_llm_inference/what_is_llm_inference.md)— 15 min[0.1 Why LLM Inference is Different](/harshuljain13/llm-inference-at-scale/blob/master/content/00_foundations/00.1_why_llm_inference_is_different/why_llm_inference_is_different.md)— 20 min[3.1 Quantization](/harshuljain13/llm-inference-at-scale/blob/master/content/03_optimization/03.1_quantization/quantization.md)— 20 min[4.1 vLLM](/harshuljain13/llm-inference-at-scale/blob/master/content/04_engines/04.1_vllm/vllm.md)— 30 min[Lab 04: vLLM Deployment](/harshuljain13/llm-inference-at-scale/blob/master/labs/lab_04_vllm_deployment)— 45 min\n\nFor engineers building inference infrastructure:\n\n**Morning:** Part I (Foundations) + Part II (GPU Fundamentals)**Afternoon:** Part IV (Optimization) + Part V (Engines)**Labs:** 01, 02, 03, 04\n\nFor teams standardizing on LLM serving:\n\n**Day 1:** Parts I, II, III — Foundations through KV Cache**Day 2:** Parts IV, V, VI — Optimization through Scaling**Day 3:** Parts VII, VIII — Production Serving and Operations**Labs:** All 10 labs\n\nThis material has been presented at:\n\n*More talks coming — if you'd like this at your conference or meetup, open an issue.*\n\nIf you use this material in research or internal documentation, please cite:\n\n```\n@misc{llm-inference-at-scale,\n  title={LLM Inference at Scale: A Practitioner's Handbook},\n  author={Jain, Harshul},\n  year={2025},\n  url={https://github.com/harshuljain13/llm-inference-at-scale}\n}\n```\n\nIf you find this useful, please ⭐ the repo — it helps others discover it.\n\nContributions are welcome. This is a living document.\n\n**Fix errors**— Typos, outdated information, incorrect formulas** Improve clarity**— Better explanations, additional examples** Add content**— New chapters, labs, or reference materials\n\n- Fork the repository\n- Create a feature branch (\n`git checkout -b improve-kv-cache-chapter`\n\n) - Make your changes\n- Submit a pull request\n\nTo report errors or suggest corrections, open a GitHub Issue.\n\n**Harshul Jain** is a Senior ML Infrastructure Engineer at Audible (Amazon), where he owns the ML Feature Store, a GenAI semantic search platform serving millions of customers, and real-time streaming pipelines at scale. He has been building and operating ML infrastructure in production for 4+ years and mentors 300+ engineers through an eMentoring program.\n\n- GitHub:\n[@harshuljain13](https://github.com/harshuljain13) - Newsletter:\n[The Engineer's Digest](https://harshuljain.substack.com)— LLM inference, deeply explained\n\nThe views, techniques, and opinions expressed in this handbook are solely those of the author and **do not represent the views of Audible, Amazon, or any affiliated organization**. No proprietary, confidential, or internal Amazon/Audible systems, data, or information has been included. All content is based on publicly available research, open-source tooling, and the author's independent experience and analysis.\n\nThis handbook is provided for **educational purposes only**. Production infrastructure decisions should be validated against your specific workload, hardware, and organizational constraints. The author makes no guarantees about the accuracy, completeness, or fitness for purpose of any content herein.\n\n© 2026 Harshul Jain. All rights reserved.\n\nNo part of this work — including the framework, diagrams, models, terminology, chapter structure, or related materials — may be reproduced, distributed, modified, adapted, or used in whole or in part without prior written permission from the author. This includes but is not limited to use in courses, training programs, consulting engagements, publications, presentations, software, or organizational materials.\n\nThe framework presented in this work is the intellectual property of Harshul Jain. It may not be copied, adapted, taught, commercialized, incorporated into derivative works, or used in any professional, commercial, or organizational context — including consulting, training, software, presentations, publications, or organizational materials — without prior written permission.\n\nTo request permission, open a GitHub Issue or contact via the profile above.\n\nThis handbook builds on the work of many researchers and engineers:\n\n- The\n[vLLM](https://github.com/vllm-project/vllm)team for PagedAttention and continuous batching - The\n[SGLang](https://github.com/sgl-project/sglang)team for RadixAttention - Tri Dao for\n[FlashAttention](https://github.com/Dao-AILab/flash-attention) - The authors of foundational papers: Attention Is All You Need, GQA, Medusa, EAGLE, and many others\n\n📬 **Stay updated** — [Subscribe to The Engineer's Digest](https://harshuljain.substack.com) for chapter releases and build-in-public updates.\n\n**Built with ❤️ for the ML infrastructure community**", "url": "https://wpnews.pro/news/free-llm-inference-handbook-100-engineers-cloned-it-in-week-1", "canonical_source": "https://github.com/harshuljain13/llm-inference-at-scale", "published_at": "2026-06-06 12:37:09+00:00", "updated_at": "2026-06-06 13:18:14.497470+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence", "ai-infrastructure", "generative-ai"], "entities": ["The Engineer's Digest"], "alternates": {"html": "https://wpnews.pro/news/free-llm-inference-handbook-100-engineers-cloned-it-in-week-1", "markdown": "https://wpnews.pro/news/free-llm-inference-handbook-100-engineers-cloned-it-in-week-1.md", "text": "https://wpnews.pro/news/free-llm-inference-handbook-100-engineers-cloned-it-in-week-1.txt", "jsonld": "https://wpnews.pro/news/free-llm-inference-handbook-100-engineers-cloned-it-in-week-1.jsonld"}}