How to Run Open-Weight AI Models Locally with Ollama and LM Studio Ollama and LM Studio now allow users to run open-weight AI models like Qwen 3, Gemma 3, and DeepSeek-R1 locally on consumer hardware without API keys or monthly fees. The tools support quantized models in GGUF format, enabling users to run a 7B parameter model on as little as 4-5GB of VRAM by compressing weights from 16-bit to 4-bit precision. This development gives developers, researchers, and privacy-conscious users direct control over model execution and data, bypassing cloud dependencies. How to Run Open-Weight AI Models Locally with Ollama and LM Studio Run Qwen 3.6, Gemma, and DeepSeek locally with Ollama and LM Studio. This guide covers setup, quantization, and performance on consumer hardware. Why Running LLMs Locally Is Worth Your Time Running open-weight AI models locally has gone from a niche hobby to a practical option for developers, researchers, and privacy-conscious users. With tools like Ollama and LM Studio, you can run models like Qwen 3, Gemma 3, and DeepSeek-R1 on consumer hardware — no API key, no monthly bill, no data leaving your machine. This guide covers everything you need to get started: which tools to use, how to pick and configure models, what quantization means for performance, and realistic expectations for what your hardware can handle. What “Open-Weight” Actually Means Before getting into setup, it’s worth being precise about terminology. “Open-weight” means the model weights are publicly available — you can download and run them yourself. It does not necessarily mean fully open-source some models restrict commercial use or fine-tuning . Popular open-weight models right now include: Meta LLaMA 3.1 and 3.3 — Strong general-purpose models, widely supported Qwen 3 Alibaba — Excellent multilingual performance, comes in sizes from 0.6B to 235B Gemma 3 Google — Efficient, well-documented, strong at reasoning tasks DeepSeek-R1 — Reasoning-focused model with strong benchmark scores Mistral and Mixtral — Fast inference, solid for instruction following Phi-4 Microsoft — Surprisingly capable at small sizes 14B and under These models vary widely in size, capability, and licensing. Choosing the right one depends on your hardware and use case. Understanding Quantization The Short Version Remy doesn't build the plumbing. It inherits it. Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something. Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want. Full-precision LLMs require enormous amounts of VRAM. A 70B parameter model at 16-bit precision takes around 140GB — well beyond what most consumer GPUs can handle. Quantization compresses the model by reducing the precision of its weights. Instead of storing each weight as a 16-bit float, quantized models use 4-bit or 8-bit integers. The tradeoff is a small reduction in quality for a dramatic reduction in memory use. Common Quantization Formats | Format | Bits per weight | Memory reduction | Quality impact | |---|---|---|---| | F16 | 16 | None baseline | None | | Q8 0 | 8 | ~50% | Minimal | | Q5 K M | 5 | ~69% | Very low | | Q4 K M | 4 | ~75% | Low | | Q3 K M | 3 | ~81% | Moderate | | Q2 K | 2 | ~87% | Significant | For most use cases, Q4 K M is the sweet spot — it gives you good quality with roughly 4–5GB of memory needed for a 7B model. Ollama and LM Studio both handle quantized models in the GGUF format, which is the standard for local inference. How Much VRAM Do You Actually Need? A rough rule of thumb: multiply the model’s parameter count in billions by 0.6 for a Q4 quantized model to get an approximate VRAM requirement in GB. - 7B model at Q4 ≈ 4–5GB VRAM - 13B model at Q4 ≈ 8–9GB VRAM - 34B model at Q4 ≈ 20–22GB VRAM - 70B model at Q4 ≈ 40–45GB VRAM or needs RAM offloading If your GPU doesn’t have enough VRAM, both Ollama and LM Studio can offload layers to system RAM — but this comes at a significant speed penalty. Setting Up Ollama Ollama is a command-line tool that makes running local LLMs as straightforward as pulling a Docker image. It handles model downloads, quantization selection, and serving a local API automatically. Installation Ollama supports macOS, Linux, and Windows. Installation is a single download: macOS : Download the .dmg from ollama.com https://ollama.com and run it. Ollama runs as a menu bar app. Linux : Run curl -fsSL https://ollama.com/install.sh | sh in your terminal. Windows : Download the Windows installer from the same site. After installation, Ollama runs a local server on port 11434. Pulling and Running Your First Model Open a terminal and pull a model: ollama pull qwen3:8b Ollama will download the default quantized version. To run it interactively: ollama run qwen3:8b You’ll get a prompt where you can type messages directly. To exit, type /bye . Choosing Specific Quantizations By default, Ollama picks a sensible quantization for the model. But you can specify: ollama pull llama3.3:70b-instruct-q4 K M Use ollama list to see what’s downloaded, and ollama show