Local AI - How to Run Open Source AI Models Locally A developer's guide explains how to run open-source AI models locally on consumer hardware in 2026, highlighting that a mid-range laptop can now run models once considered frontier-class. The guide covers key concepts like quantization, memory requirements, and tooling such as Ollama and LM Studio, emphasizing that local AI offers privacy, cost savings, and offline capability despite tradeoffs at the high end. There is a particular moment that hooks every developer on local AI. You type a question into a terminal, hit enter, and watch a coherent answer stream back — with your Wi-Fi off, no API key, no usage meter ticking, nothing leaving your laptop. The model is just there , running on silicon you already own. Getting to that moment used to require a research-lab pedigree. It no longer does. In 2026, a mid-range laptop can run models that would have been considered frontier-class a couple of years ago, and the tooling has matured from finicky Python scripts into one-line installers. The catch is that the landscape is now wide : a dozen serious tools, hundreds of models, and a thicket of jargon — GGUF, quantization, KV cache, MoE, offloading — standing between you and that first streamed token. This guide is the map. I'll assume you're a competent developer but new to running models locally, and I'll take you from vocabulary to a working setup, with enough depth that intermediate and senior engineers get the why behind each decision, not just the how . By the end you'll be able to do three things with confidence: pick the right open source model for a given job, configure it for your specific hardware, and run it successfully — whether you're on a MacBook Air, a gaming rig with an NVIDIA card, or a CPU-only workstation. One promise up front: I won't pretend local always beats the cloud it doesn't, at the very high end , and I won't bury the tradeoffs. Local AI is the right call for privacy, cost, offline capability, and control. Let's make those wins real. The one-paragraph version: If you read nothing else: install Ollama or LM Studio if you want a GUI , pull a 7–8B model in Q4 K M quantization, and you’re running local AI in ten minutes. The single number that decides what you can run is memory — VRAM on a GPU, or unified memory on a Mac. Everything else in this guide is detail on top of those two facts. This field has a dialect, and most tutorials assume you already speak it. Let's fix that first. Skim this section now, then refer back when a term trips you up later — it's designed as a glossary you can return to, not a wall to memorize in one pass. LLM Large Language Model . A neural network trained to predict the next token of text. That simple objective, at scale, produces the chat, code generation, and reasoning we find so useful. Everything you’ll run is an LLM or a close cousin. Open source vs. open weight. This distinction matters more than most people realize. Open weight means the trained parameters are downloadable and you can run them yourself. Open source , in the strict sense, additionally requires open training data and code, and a license with no restrictions on who can use it or for what. Most "open" models — Llama, Qwen, Gemma, DeepSeek — are open weight . Only some, typically those under Apache 2.0 or MIT licenses, approach genuine open source. Parameters weights . The learned numbers inside the network. "7B" means seven billion parameters. More parameters generally means more capability — and more memory required to hold the model. Tokens and tokenization. Models don’t read words; they read tokens. A token is roughly four characters or about three-quarters of a word. When you see "tokens per second," that’s the unit of generation speed. Context window context length . How many tokens the model can hold in its attention at once — your prompt plus its output combined. Older models maxed out around 4,000 tokens; modern ones reach 128,000, 256,000, and in a few cases over a million. Inference. Running a trained model to produce output. This is distinct from training creating the model and fine-tuning adapting it . Everything in this guide is about inference. Quantization. The most important concept after parameter count. Models are trained in 16-bit precision, but you can compress the weights to 8-bit or 4-bit to shrink memory use and speed up inference, trading a little quality. The common levels: FP16 / BF16 — full half precision, the uncompressed baseline. Two bytes per parameter. Q8 0 — 8-bit, essentially indistinguishable from the original. One byte per parameter. Q5 K M — 5-bit, a high-quality middle ground. Q4 K M — 4-bit, the universally recommended sweet spot: about 75% smaller than FP16 with only a 1–3% quality drop. GGUF — not a quantization level but the file format that packages a quantized model into a single file. It’s what Ollama, LM Studio, and llama.cpp all consume. GPTQ / AWQ / EXL2 — alternative quantization schemes optimized for GPU-based serving. VRAM vs. RAM. VRAM is the dedicated memory on a discrete graphics card; RAM is your system memory. On a machine with an NVIDIA or AMD GPU, the model must fit in VRAM to run at full speed. Unified memory. On Apple Silicon and a few new AMD chips , the CPU and GPU share one fast pool of memory. The GPU can use almost all of your system memory — which is why a 64GB MacBook can punch far above a gaming GPU on large models. GPU offloading layer offloading . When a model is too big for your VRAM, you can keep some of its layers on the GPU and push the rest to system RAM. The model still runs — but the offloaded portion is dramatically slower. Metal / CUDA / ROCm / Vulkan / SYCL. The hardware-acceleration backends: Apple, NVIDIA, AMD, a cross-vendor fallback, and the Intel path respectively. MoE Mixture of Experts . An architecture where only a fraction of the total parameters — the "active" parameters — fire for any given token. You get the quality of a big model with the compute cost of a small one. The catch: you still have to hold all the parameters in memory. Plan memory by total parameters; plan speed by active parameters. Mental model: A Mixture-of-Experts model is like a hospital with fifty specialists on staff but only four seeing any given patient. You pay the rent on the whole building memory , but each visit is fast because only a few doctors are involved compute . A 30B-A3B model has 30 billion parameters total but only ~3 billion active per token. KV cache. As a model generates, it stores the attention keys and values for every previous token so it doesn’t recompute them. This cache grows with context length, and at long contexts it can consume as much memory as the model weights themselves. Temperature, top-p, top-k. Sampling controls that govern randomness. Lower temperature produces more deterministic output; higher is more creative. Top-p and top-k limit the pool of candidate tokens. System prompt. A hidden instruction that sets the model’s role and behavior before the conversation begins. Throughput vs. latency. Throughput is total tokens per second across all requests; latency is how fast a single response comes back. Tokens per second is the headline speed number, and time to first token measures how snappy the model feels. Fine-tuning vs. RAG. Two ways to make a model "know" your data. Fine-tuning retrains the model on your examples; RAG retrieval-augmented generation https://dev.to/blog/building-an-llm-project-from-scratch-in-2026 leaves the model untouched and feeds it relevant documents at query time. For most use cases, RAG is the cheaper, faster, more maintainable choice. Embeddings. Numerical vector representations of text that capture meaning, used for semantic search and as the backbone of RAG systems. Distillation. Training a smaller model to imitate a larger one. DeepSeek’s R1 "distill" models bring large-model reasoning to consumer hardware. Multimodal / vision-language models. Models that accept images and sometimes audio or video alongside text. Reasoning models. Models trained to "think out loud" — producing an explicit chain of reasoning before their final answer. DeepSeek-R1 and OpenAI’s gpt-oss are leading examples. If you internalize one section of this guide, make it this one. Almost every question you’ll have — "Can I run this model?" "Why is it so slow?" "Which quantization should I pick?" — reduces to a single question: does the model fit in fast memory, and if not, how much are you willing to spill into slow memory? The weights of a model take up a predictable amount of space based on parameter count and quantization. The rule of thumb: Memory GB ≈ parameters billions × bytes-per-parameter × 1.2 , where bytes-per-parameter ≈ 2.0 FP16 , 1.0 Q8 0 , ~0.7 Q5 K M , ~0.55 Q4 K M . The 1.2 accounts for overhead. So a 7B model needs about 14GB at full precision, ~7.7GB at Q8, and ~4.5GB at Q4 K M. The handy shortcut: at Q4 K M, every billion parameters costs roughly 0.55–0.7GB . Here’s the reference table: | | | | |---|---|---|---| | ~2 GB | ~3.5 GB | Almost anything, even 4GB | | ~4.5–5 GB | ~8 GB | 8GB cards comfortably | | ~8 GB | ~14 GB | 12GB cards | | ~18–20 GB | ~34 GB | 24GB cards 3090/4090 | | ~40 GB | ~75 GB | 48GB+, or a high-RAM Mac | The weights are only part of the story. The KV cache grows linearly with context length, and at long contexts it can rival or exceed the weights. A Llama-3-8B at 32K context burns roughly 4GB on KV cache alone. Push to 128K and the cache can dwarf the model. Two things rescue you. First, nearly every model released in 2025 and 2026 uses Grouped-Query Attention , which cuts cache size by 50–75% for free. Second, you can quantize the KV cache itself — setting it to 8-bit or 4-bit — to roughly halve its footprint. Common trap: When a model card says "runs in 8GB," that almost always means weights only , at a short context. Budget an extra 1–2GB for the KV cache and overhead at modest context lengths — and far more at long ones. Mixture-of-Experts models break the simple mental model. Take Qwen3-30B-A3B : 30 billion total parameters but only ~3 billion active per token. It generates as fast as a 3B model, but you still need enough memory to hold all 30 billion. So: size your memory by total parameters, size your speed expectations by active parameters. When a model exceeds your VRAM, tools like llama.cpp and Ollama automatically offload the excess layers to system RAM. This prevents a crash, but it’s slow — system RAM bandwidth roughly 50–70 GB/s on a dual-channel DDR5 desktop is an order of magnitude below GPU VRAM around 1,000 GB/s on an RTX 4090 . Offloading 10–20% is often tolerable; offloading half will make you wish you hadn’t. This leads to the most important hardware insight in the guide: token generation speed is governed by memory bandwidth, not raw compute. A model generates roughly as fast as your memory bandwidth divided by the model’s size in memory. The ecosystem looks chaotic until you see its structure. There are really three layers: engines that do the actual math llama.cpp, MLX ; experiences that wrap an engine in convenience Ollama, LM Studio, Jan, GPT4All ; and servers for high-throughput, multi-user production vLLM, TGI, SGLang . Because most consumer tools wrap llama.cpp, their raw single-user speed differs by only a few percent. So choose based on workflow , not on a myth that one is dramatically faster than another. The foundation of the entire consumer local-AI world. Created by Georgi Gerganov, its first commit landed on March 10, 2023 https://github.com/ggml-org/llama.cpp — just two weeks after Meta released the original LLaMA weights. It’s a dependency-free C/C++ inference library that reads GGUF and runs on essentially everything: CPU, CUDA, Metal, ROCm, Vulkan, SYCL. Its superpower is being first: new architectures usually land here before anywhere else. It exposes every tuning knob. The tradeoff is that you compile it yourself and manage flags. Pick it if you want maximum control, the newest models the day they drop, or the last few percent of performance. If one tool is the default recommendation for developers, it’s this one. Ollama is CLI-first, runs as a background daemon, and exposes both a REST API and an OpenAI-compatible endpoint on port 11434. The workflow: ollama pull , ollama run , done. It stores models by content hash and automatically manages VRAM. Pick it if you’re a developer who wants local AI to behave like a service you forget is running. This is the one most people should start with. The friendliest on-ramp, and free for personal use. LM Studio gives you a built-in model browser that shows memory estimates before you download, a chat playground, RAG over your local documents, and an OpenAI-compatible server — all in a clean desktop app. It runs both the llama.cpp and MLX backends. Pick it if you want the smoothest discover-download-experiment loop, or you’re on a Mac and want MLX speed without touching the command line. Each of these earns its place for a specific job: | | | |---|---|---| | Open-source, offline-first ChatGPT-style desktop app; can bridge to cloud APIs | You want a clean assistant UI and value fully open-source software | | Point-and-click RAG over a folder of documents, fully offline, near-zero config | You want private document Q&A with no setup | | Single-executable llama.cpp fork built for creative writing and roleplay | Fiction or roleplay with rich world/character memory | | Packs an entire model plus runtime into one cross-platform executable | You want maximum portability or to ship a model as a single file | | Apple’s native framework; exploits unified memory and supports on-device fine-tuning | You’re on a Mac and want peak performance or local LoRA training | | "Swiss Army knife" — multiple loaders behind one UI, plus fine-tuning and RAG | You want to experiment broadly across model formats | | A router, not a runner: one OpenAI-compatible endpoint in front of many backends | You’re orchestrating several model types behind a single API | Everything above is built for one user at a desk. The moment you need to serve a model to many concurrent users, you cross into a different category — and the leader is vLLM . Its PagedAttention manages the KV cache in non-contiguous blocks like an OS manages virtual memory, cutting memory waste from 60–80% down to under 4%, and continuous batching slots new requests into the running batch the instant a slot frees. Its launch benchmarks reported up to 24× the throughput https://blog.vllm.ai/2023/06/20/vllm.html of naive Hugging Face Transformers serving. The tradeoff: vLLM needs a Python environment and a capable GPU, doesn’t run GGUF it uses safetensors with AWQ or GPTQ , and is heavier to set up. Its cousins TGI and SGLang compete in the same space. Rule of thumb: Ollama for your laptop, vLLM for your server. If exactly one person or process talks to the model at a time, use a llama.cpp-based tool. If many do at once, move to vLLM or TGI. The reference implementation everything else is measured against. Transformers with Accelerate gives you maximum model coverage and flexibility — the standard for research and fine-tuning — but carries the most setup and isn’t optimized for consumer single-user inference. Pick it if you’re doing research or need to run a brand-new model before anyone has produced a GGUF for it. | | |---|---| I’m a developer and want one default | Ollama Invisible infrastructure with a clean API | I’m a beginner or non-developer | LM Studio or GPT4All | I’m on a Mac | LM Studio or Ollama with the MLX backend | I have a powerful NVIDIA card | llama.cpp for control; vLLM to serve | I need to serve many users | vLLM or TGI / SGLang | I want document chat | GPT4All or LM Studio built-in RAG | I’m on low-end hardware | Ollama with small models; Llamafile for portability | Creative writing / roleplay | KoboldCpp | Now we get practical. Find your hardware below and follow the path. The through-line is always the memory math from the previous section — here we apply it to real silicon. The surprise winner for individual developers. Because of unified memory https://www.apple.com/newsroom/2025/10/apple-unleashes-m5-the-next-big-leap-in-ai-performance-for-apple-silicon/ , a 64GB MacBook can load a 70B model at Q4 with no copying between CPU and GPU. Use an MLX-backed runtime; on the newest chips, MLX is meaningfully faster than the older Metal path. The rule for Macs: your usable model memory is roughly total unified RAM minus about 8GB for the OS. A 16GB Mac handles 7–8B comfortably, 32GB reaches ~30B, 64GB runs 70B, and 128GB+ opens the big MoE models. The one place a Mac loses to a discrete GPU is raw speed on models that already fit comfortably in that GPU’s VRAM. The best-supported platform, full stop. Every tool works on NVIDIA first. Plan by your VRAM tier: | | | |---|---|---| | RTX 4060, 3060 Ti | 7–8B at Q4 K M, 40+ tok/s. The popular real-world floor. | | RTX 3060 12GB, 4070 | 12–14B at Q4 K M with room for context | | RTX 4060 Ti 16GB, 5060 Ti 16GB | 14B at Q5, or gpt-oss-20b — the "16GB sweet spot" | | RTX 3090, 4090 | 27–32B at Q4/Q5 fully resident, or 70B with offloading | | RTX 5090, workstation cards | Larger 32B at high quant; dual 24GB reach 70B fully in VRAM | The RTX 3090 deserves special mention: thanks to its wide 384-bit bus and ~936 GB/s bandwidth, it often out-generates the newer RTX 4080 despite being a generation older — a perfect illustration of the bandwidth-over-compute principle. A used 3090 remains one of the best value buys in 2026. The story has genuinely improved. In 2026, ROCm has matured enough that AMD is a real choice for inference. On Linux, ROCm/HIP runs llama.cpp and Ollama at roughly 70–80% of CUDA speed at equivalent bandwidth. On Windows, Vulkan through LM Studio or Ollama is the least-friction path. The RX 7900 XTX 24GB is a credible, cheaper alternative to a 4090 for inference, and AMD’s Strix Halo chips bring Apple-style unified memory to the PC world, with up to 128GB shared. Where NVIDIA still wins decisively is fine-tuning and FP8 production serving. Workable, but the roughest software story of the bunch. The A770 16GB is a genuine budget VRAM bargain. Your paths are llama.cpp’s SYCL or Vulkan backend, IPEX-LLM’s portable Ollama build, or Intel’s vLLM-based stack. Buy Intel only if you enjoy the setup adventure. Realistic, but manage expectations. A modern CPU does 3–13 tok/s on a quantized 7B — fine for batch jobs, frustrating for interactive chat anything under ~15 tok/s feels laggy . Stick to small models: Phi-4-mini, Llama 3.2 3B, or Gemma 3 4B at Q4, on 8GB systems. You have an underrated option: run a big MoE model say, gpt-oss-120b keeping the attention layers on the GPU and the experts in system RAM. Because only a few experts activate per token, you can hit 10–30 tok/s with surprisingly little VRAM. In llama.cpp the --cpu-moe flag does exactly this. Approximate tokens/second for a Q4 K M model, single user, via llama.cpp or Ollama: | | | |---|---|---| RTX 4090 | ~135 tok/s | Faster than you can read | RTX 3090 | ~95 tok/s | The value champion | RTX 4070 Super | ~75 tok/s | Excellent mid-range | RTX 4060 8GB | ~25–37 tok/s | Perfectly usable | M3 Max 64–128GB | fast on 7B; ~7–14 on 70B | Holds the whole 70B — so it beats a 4090 there | Modern CPU | ~12–13 tok/s | Batch jobs, not chat | And the headline number that breaks the pattern: a 30B MoE like Qwen3-30B-A3B can sustain 100+ tok/s on a 4090 — nothing like the slowdown you’d expect from 30 billion total parameters — because only ~3B are active per token. You’ve got a tool and you know your memory budget. Now: which of the hundreds of available models should you actually download? Start with the size tiers, then match a family to your task. Read this first: This space moves monthly . Specific version numbers and "best in class" claims go stale fast. Treat the families below as durable and the exact version numbers as a snapshot — always check the current model card on Hugging Face or the Ollama library before committing. 1–3B small / on-device : Phi-4-mini, Llama 3.2 1B/3B, Gemma 3 1B/4B. For edge devices, autocomplete, classification, and simple chat. 7–8B the workhorse : Llama 3.1/3.3 8B, Qwen3 8B, Mistral 7B. The best cost-to-capability ratio for most laptops. 13–14B: Phi-4 14B, Qwen3 14B. Meaningfully smarter; needs ~12GB. 27–34B: Gemma 3 27B, Qwen3 32B, Mistral Small 3 24B . The single-24GB-GPU sweet spot. 70B+: Llama 3.3 70B, Qwen2.5 72B. Needs 48GB+ or a high-RAM Mac. MoE giants: Llama 4 Scout/Maverick, DeepSeek-V3/R1, Qwen3-235B, gpt-oss-120b. Big-model quality at small-model compute — but huge memory footprints. Llama Meta . The largest ecosystem and the most community fine-tunes. The 3.x series are dependable dense workhorses; Llama 4 April 2025 https://ai.meta.com/blog/llama-4-multimodal-intelligence/ brought Mixture-of-Experts to the line, with Scout offering a headline 10M-token context. The caveat: the Llama Community License is not open source — it carries a 700M-monthly-active-user cap and EU restrictions. Qwen Alibaba . For many developers, the default answer to "what should I run locally?" in 2025–2026. Apache 2.0 licensed, dense and MoE from ~1.7B to 235B, with strong coding, math, and multilingual ability. Gemma Google . Gemma 3 spans 1B to 27B, is multimodal from 4B up, and runs beautifully on consumer hardware. The 4B is a superb laptop default; the 27B is competitive on a 24GB GPU. DeepSeek. DeepSeek-V3 for general use and the reasoning-focused DeepSeek-R1, released January 2025 under a clean MIT license. The full model needs a server, but the distilled variants 1.5B to 70B bring R1-style reasoning to consumer hardware. Mistral / Mixtral. Mistral 7B and Mixtral 8x7B remain heavily deployed; Mistral Small 3 24B rivals models three times its size, and the Mistral 3 family moved fully to Apache 2.0. Microsoft Phi. Phi-4 14B and Phi-4-mini 3.8B , MIT-licensed, punch well above their weight per parameter — ideal for budget and edge deployments. OpenAI gpt-oss. Released August 2025 under Apache 2.0 https://openai.com/index/introducing-gpt-oss/ — OpenAI’s first open-weight models since GPT-2. Both are MoE with a 128K context and three configurable reasoning-effort levels. The gpt-oss-20b needs only ~16GB a standout for a 16GB GPU or Mac , and gpt-oss-120b runs within 80GB on a single high-end GPU. Also worth watching: Cohere’s Command R/R+ RAG-focused , Falcon 3, GLM, Kimi, and coding specialists Qwen-Coder and Mistral’s Devstral. | | |---|---| | Qwen3, Gemma 3, Llama 3.3 | | Qwen-Coder, Devstral, DeepSeek-Coder, Code Llama | | DeepSeek-R1 or its distills , gpt-oss at high reasoning effort | | Llama 4 Scout, Qwen3, Gemma 3 | | Mid-size instruct models plus an embedding model | | Qwen3 strongest, especially CJK , Gemma 3, Mistral | | Gemma 3, Qwen-VL, Llama 4, Mistral Small 3.1 | | Phi-4-mini, Llama 3.2 3B, Gemma 3 4B | You’ll find models in two main places: the Ollama library curated, one-command pulls and Hugging Face everything, including community quantizations — "bartowski" and "Unsloth" are prolific and reliable . On any model card, check five things: parameter count, context length, license, intended use, and which quantizations are available. The licensing nuance matters the moment you build something real. Apache 2.0 Qwen, Gemma’s permissive releases, Mistral 3, gpt-oss and MIT DeepSeek, Phi are the cleanest for commercial use. Llama’s license permits commercial use but with that 700M-MAU cap and EU restrictions. Always read the actual card before shipping a product, and involve legal for anything at scale. Here’s the playbook, roughly in the order you should apply it. Q4 K M is the default for a reason — ~75% smaller than FP16 with only a 1–3% quality drop. Step up to Q5 K M or Q8 0 when you have headroom. The most useful heuristic of all: a larger model at lower precision usually beats a smaller model at higher precision . A 14B at Q4 typically outperforms a 7B at Q8 — if both fit. In llama.cpp, set -ngl 999 to push every layer onto the GPU. If you run out of memory, lower the number until the model fits. Partial offload is far faster than running everything on the CPU. Don’t set your context window higher than you actually need; the KV cache scales linearly with it. When context is tight, quantize the cache — setting the KV cache type to q8 0 roughly halves its footprint. In Ollama, this is the OLLAMA KV CACHE TYPE environment variable. Flash attention reduces the memory cost of attention and speeds up long-context inference. It’s standard on modern setups — turn it on with --flash-attn on in llama.cpp. Pair a large "main" model with a small "draft" model from the same family — say, Llama 3.2 1B drafting for Llama 3.1 8B. When acceptance is high, you get a 1.5–3× speedup with no quality loss, because the large model still has the final say. Keep the draft model much smaller than the main one. A 30B-A3B MoE generates as fast as a 3B dense model while approaching the quality of something far larger — provided you have the memory to hold all 30B. Memory bandwidth GB/s predicts token-generation speed better than raw compute TFLOPS . This is why the RTX 3090 often beats the newer 4080, and why the M3 Max beats the M4 Pro on generation. Match your purchase to two things: enough memory capacity to fit the model, and enough memory bandwidth to run it fast. Optimization order of operations: 1 Pick the largest model that fits at Q4 K M. 2 Max out GPU layers. 3 Enable flash attention. 4 Quantize the KV cache if context is tight. 5 Add speculative decoding if you need more speed. 6 Only then consider buying hardware. Enough theory. Here are the exact commands to get running on each of the three paths most people take. All three expose an OpenAI-compatible API, so any code you’ve written against the OpenAI SDK https://dev.to/blog/integrate-openai-api-production-express-nodejs will work against your local model with a one-line change to the base URL. Works on macOS, Windows, and Linux. On macOS and Windows, download the installer; on Linux, one command does it: Linux install macOS/Windows: download the app curl -fsSL https://ollama.com/install.sh | sh Pull and run a model — downloads on first run ollama run qwen3:8b type your prompt in the chat; /bye to exit Handy commands ollama list installed models ollama ps models currently in memory ollama pull gemma3:4b download without running ollama rm