A GPU-Poor’s Guide to Local LLM Inference in 2026 A 35-billion-parameter Mixture-of-Experts model runs at 28 tokens per second with full 128K context on a 2019 gaming laptop with a GTX 1660 Ti and 6 GB VRAM using llama.cpp's --n-cpu-moe flag and Turboquant's asymmetric KV cache quantization, demonstrating that local LLM inference on consumer hardware with 4-12 GB VRAM is now viable. What works on consumer hardware right now if your GPU is in the 4 to 12 GB VRAM band. Techniques first, then a worked example: a 35-billion-parameter Mixture-of-Experts model running at 28 tokens per second with the full 128K context window, on a 2019 gaming laptop with a GTX 1660 Ti and 6 GB of VRAM. The gap between “you need a 4090” and “your old laptop will do” has narrowed in 2025 and 2026 in ways that aren’t widely written about yet. Any consumer setup with 4 to 12 GB of VRAM. That covers a 6 GB GTX 1660 Ti, an 8 or 12 GB RTX 3060, an 8 GB 4060, most M-series Macs with 8 to 16 GB of unified RAM, and most laptop GPUs from the last six years. The conventional wisdom said you needed a 24 GB 3090 or a Mac Studio. Not anymore. A Mixture-of-Experts model has many more total parameters than active parameters per token. Qwen3.6 35B-A3B: 35 billion total, ~3 billion active at any token the “A3B” suffix means “Activated 3 Billion” . The router picks a small subset of experts; the rest stay quiet. Compute load per token is closer to a 3B dense model than a 35B one, and inactive experts can live somewhere slower than VRAM as long as the active path stays fast. A 35B dense model needed 24 GB of VRAM minimum a couple of years ago. A 35B-A3B with the right tensor placement fits on 6 GB. For dense models, the standard advice is “fit as many layers as you can on GPU, push the rest to CPU.” That fails for MoE because every layer’s attention path needs the GPU. The flag for MoE in mainline llama.cpp is --n-cpu-moe. Keep every layer's attention and shared tensors on GPU, push the expert tensors out to system RAM. -ngl 99--n-cpu-moe 36 The numbers fall out nicely on a 35B-A3B Q4 K M quant. Attention plus shared weights fits comfortably under 5 GB of VRAM. Expert weights pushed to CPU sit around 16 GB, easy on 32 GB system RAM. The GPU handles attention work, the CPU handles expert matmuls. Granularity is per-layer-of-experts, not per-individual-expert: all routed experts of a selected layer go to CPU together. For per-expert placement, fall back to hand-written -ot regexes against blk.