The open-source LLM inference landscape in 2025–2026 has fractured into distinct tiers of specialization. At one end are throughput-oriented serving engines (vLLM, SGLang, TensorRT-LLM) designed for production-scale GPU clusters. At the other are portability-focused runtimes (llama.cpp, LM Studio, Ollama, GPT4All, llamafile) built for consumer hardware, edge devices, and offline deployment. A third tier of compilation-driven frameworks (MLC LLM, TinyGrad, LightLLM) targets cross-platform execution and developer extensibility.
The dominant architecture pattern across the serving tier is continuous batching combined with KV-cache virtualization—originally pioneered by vLLM’s PagedAttention and now replicated in SGLang (RadixAttention/Trie-based caching), TensorRT-LLM, and TGI. On the consumer side, llama.cpp’s GGUF quantization format has become the de facto standard, powering Ollama, LM Studio, GPT4All, and dozens of downstream tools.