{"slug": "why-and-how-to-run-local-models-in-zed", "title": "Why and How to Run Local Models in Zed", "summary": "Zed users can now run local AI models directly within the editor, offering complete data privacy, predictable costs, and always-available access without reliance on cloud providers. The company reports local model usage has grown 3x in the past 10 weeks, and supports tools including LM Studio, Ollama, and llama.cpp for running models like Qwen 3.6 35B A3B on developer laptops. While local models cannot match frontier cloud models in capability or speed, they provide developers with full control over system prompts, context windows, and model selection.", "body_md": "For many tasks, I prefer to use local models.\n\nWhen I need the best possible model, I still reach for frontier options, but a lot of the time I don't need that. I prefer something that runs on my machine, keeps my data on hardware I control, and won't disappear because a provider changed their pricing or limits.\n\nOpen-weight models are getting\nbetter, too. Tools like LM Studio, Ollama, and llama.cpp keep getting easier to use,\nand in the last 10 weeks, [local model usage has grown 3x in Zed's agent](/agent-metrics).\n\nAt Zed, [we're not building AI features for\nthe money](/blog/not-building-ai-for-the-money), and we're not in the business of locking devs into one way of using AI. We make it easy to use whatever provider you prefer, whether that's Codex over ACP, your own API key, or a direct subscription to Zed Pro.\n\nIn this post I want to walk through why local models can be great, where they fall short, and how to get set up in Zed.\n\n[Why Local?](#why-local)\n\nLocal models have a number of advantages over cloud-hosted models:\n\n*They're totally private.*\nWhile most cloud providers offer zero-data-retention policies, local models provide absolute certainty.\nThe data never leaves your network, or even your machine if you so choose.\n\n*They can be much cheaper to run.*\nThere is the up-front hardware cost, but, as we'll see in this post, your current developer laptop may be more than capable of running a competent model.\nAnd you don't have to worry about [unexpected price changes](https://x.com/ClaudeDevs/status/2054610152817619388) 1.\nThe price is consistent, transparent, and low.\n\n*You get more control*.\nYou can set your own system prompt, enable or disable features (e.g. image support), change the context window, and more.\nYou can also discover fine-tuned versions of popular models tailored to your use case.\nAnd since you own the full pipeline, you can be sure that you aren't being secretly served a lower-cost model under the same name.\n\nFinally, and most importantly for me, *local LLMs are always available*.\nLike many developers, I worry about becoming too reliant on providers that operate like SaaS platforms where a change of pricing or setup makes them unfeasible to use.\nWith a local model, you *always* have access.\n\n[Local Model Shortcomings](#local-model-shortcomings)\n\nIf local models were perfect, cloud providers wouldn't exist (at least not at the scale they currently do).\n\nThere is no getting around the fact that the hardware required to run frontier models at acceptable speeds is simply out of reach for consumers. Models you will be able to run locally are not as capable as what you can get from the top AI labs. You will also likely get fewer tokens per second.\n\nThat said, you can get good results even on a developer laptop. Just don't expect frontier-level results.\n\n[How to Run Local Models](#how-to-run-local-models)\n\nThere are a number of free and open source projects that let you run models locally.\nI have had the most success with [LM Studio](https://lmstudio.ai/), but [Ollama](https://ollama.com/) and [llama.cpp](https://github.com/ggml-org/llama.cpp) are also popular choices.\nZed [supports all three](https://zed.dev/docs/ai/llm-providers) out of the box.\n\nOnce you have a runtime, you need to choose a model.\nI've been using `Qwen 3.6 35B A3B`\n\n.\nThat name is a bit of a mouthful, but each part tells you something useful:\n\n`Qwen 3.6`\n\nis the model family. Qwen models are made by Alibaba, and 3.6 is their latest release as of the time of writing. Models in the same family can differ by size, speed, feature support, and more.`35B`\n\nmeans the model has 35 billion parameters. A parameter is one of the values the model learned during training. When you run the model, those values need to be loaded into memory.`A3B`\n\nis short for \"active 3 billion\".\n\nThis is a \"Mixture of Experts\" model, or MoE.\nThat means the model has 35 billion parameters in total, but only about 3 billion are active for each generated token.\nA dense model works differently, because all of its weights are active all the time.\nIn practice, MoE models usually trade a small amount of intelligence for a dramatic increase in performance. As a very crude rule of thumb, the time to generate a token scales linearly with the number of active parameters. In a dense model, all parameters are active, so the number of parameters is just the size of the model. In a model like `Qwen 3.6 35B A3B`\n\n, there are 3 billion active parameters, so it runs roughly 10x faster than a dense 35B model.\n\nSome chips, such as Apple's M series or AMD Strix Halo, support \"unified memory\". With unified memory, the GPU can access system memory directly, although it is much slower than memory on a dedicated GPU. MoE models are particularly compelling on these systems, since the lower memory bandwidth hurts less when fewer parameters are active.\n\nFinally, you should consider quantization, which is a way to make a model smaller by storing each parameter with fewer bits.\nIf you need 35 billion parameters in memory, how much VRAM does that require?\nIt depends on how big each parameter is.\nModels are usually trained with 16 bit floating point parameters, but those parameters can be compressed.\nThe Qwen 3.6 model I tested with is a `Q4`\n\nmodel, which means each parameter is 4 bits.\nSince it has 35 billion parameters, that's about 17.5GB of VRAM (plus overhead for the context and other assorted bits and bobs).\nLM Studio has a nice UI that shows whether a model is likely to fit on your GPU.\n\n[Configuring Zed](#configuring-zed)\n\nOnce you have your provider set up, you can point Zed at it.\nSince I'm using LM Studio, I just add an LM Studio config, pointing at `http://localhost:1234/api/v0`\n\n, and make sure LM Studio's server is running with `lms server start`\n\n.\n\nIf you're using Ollama, `llama.cpp`\n\n, or any other OpenAI-compatible system, you can use the built-in Ollama provider.\n\nFrom there, you should see your downloaded models in the model selector within the Zed agent.\n\n[Working with Non-Frontier Models](#working-with-non-frontier-models)\n\nFrom there, it should be a familiar experience: send a prompt, and the model can respond, edit your code, and use tools.\n\nBut if you're used to using frontier models, there are two things that you will need to be extra careful about when using local models:\n\n- They're not as \"clever\" as the frontier models\n- They typically have smaller context windows\n\nBecause of this, they require more attention and discipline to be used effectively. Best practices become more necessary.\n\nFor example, if you see the model going down an incorrect path, or getting stuck in a loop, it's often better to edit your previous message to guide against the bad path, rather than sending a new message correcting it. This ensures that the context window doesn't get filled with unhelpful information.\n\nYou also may want to encourage them to use subagents more. Subagents can be a powerful tool for limiting the impact on the context window of small, menial changes.\n\nFinally, experiment!\nGo nuts!\nTest different models from different providers.\nTweak the context window size or the temperature.\nMaybe you have a fancy gaming PC with a dedicated GPU - maybe try a dense model.\nFind a cool combination that works well for you?\nShare it in our [Discord](https://discord.com/invite/zedindustries).\n\nHappy hacking!\n\n[Footnotes](#footnote-label)\n\n-\nModulo local energy prices.\n\n[↩](#user-content-fnref-1)\n\n### Related Posts\n\nCheck out similar blogs from the Zed team.\n\n### Looking for a better editor?\n\nYou can try Zed today on macOS, Windows, or Linux. [Download now](/download)!\n\n### We are hiring!\n\nIf you're passionate about the topics we cover on our blog, please consider [joining our team](/jobs) to help us ship the future of software development.", "url": "https://wpnews.pro/news/why-and-how-to-run-local-models-in-zed", "canonical_source": "https://zed.dev/blog/local-ai-in-zed", "published_at": "2026-05-26 05:24:05+00:00", "updated_at": "2026-05-26 05:38:08.916122+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-tools", "ai-products", "ai-infrastructure"], "entities": ["Zed", "LM Studio", "Ollama", "llama.cpp", "Codex", "ACP", "Zed Pro"], "alternates": {"html": "https://wpnews.pro/news/why-and-how-to-run-local-models-in-zed", "markdown": "https://wpnews.pro/news/why-and-how-to-run-local-models-in-zed.md", "text": "https://wpnews.pro/news/why-and-how-to-run-local-models-in-zed.txt", "jsonld": "https://wpnews.pro/news/why-and-how-to-run-local-models-in-zed.jsonld"}}