{"slug": "qwen-3-6-27b-is-the-sweet-spot-for-local-development", "title": "Qwen 3.6 27B is the sweet spot for local development", "summary": "Qwen 3.6 27B, a dense local language model from Alibaba's Qwen team, impresses developers with its general intelligence and practical coding abilities, running efficiently on consumer hardware via llama.cpp. The model punches above its weight in tasks like constrained writing and code generation, making it a viable option for local development despite high computational demands.", "body_md": "I’ve been disappointed by local models in the past. But then I checked Qwen 3.6, and I was in awe. For me it’s the first local model that actually makes sense as a general intelligence.\n\nIt comes in two variants, a mixture-of-experts model [Qwen 3.6 35B A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B), and a dense [Qwen 3.6 27B](https://huggingface.co/Qwen/Qwen3.6-27B) - slower, but more powerful. The one I recommend!\n\nLet me share my impressions, and show that you can run it too.\n\nQwen 3.6, rightfully, [got a lot of coverage on Hacker News](https://hn.algolia.com/?dateEnd=1782305498&dateRange=custom&dateStart=1775001600&page=0&prefix=true&query=qwen&sort=byPopularity&type=story). The most common statement about Qwen 3.6 27B is that it punches above its weight - see [Will it Mythos?](https://swelljoe.com/post/will-it-mythos/). And I think it is a well-deserved sentiment.\nIt will make your computer hot, but it’s worth it!\n\n## Testing the waters\n\nSimon Willison uses “penguins on a bicycle” as a smoke test (see for [Qwen 3.6 35B A3B](https://simonwillison.net/2026/Apr/16/qwen-beats-opus/) and then [Qwen 3.6 27B](https://simonwillison.net/2026/Apr/22/qwen36-27b/)). I usually go with constrained writing.\n\nI also asked it to write an 8 line poem about Zouk dance and quantum physics, see [the transcript](https://gist.github.com/stared/bac79cd053ea5443abcf58e622c083b7).\nThe thought process made sense, both in terms of deliberation on quantum terms, and rhymes.\n\nThen I asked in OpenCode to create a hexagonal minesweeper using `pnpm`\n\n. It worked:\n\nIt worked on the first go, from a single prompt, with a proper Node package.\nThe mixture-of-experts Qwen 3.6 35B A3B was faster… but ignored my instruction to create a package, and did it in a single `index.html`\n\n.\n\n## Real work\n\nSure, creative writing about quantum mechanics, or yet another clone of a minesweeper, is rarely a day job. But Qwen 3.6 27B is decent at regular tasks as well.\n\nIt worked for a few minutes and created this:\n\nBy standards of current frontier models, it’s unremarkable. But it is already a practical job. It worked, was reactive, defaults were nice - all from a single, short prompt.\n\n## Running Qwen 3.6 locally with llama.cpp\n\nRunning local models is easier than ever. A few CLI lines and you’re off.\n\nI recommend [llama.cpp](https://github.com/ggml-org/llama.cpp) - a direct, open source tool that allows running models on various devices. You don’t need Ollama, and frankly - [I would recommend against using that on ethical grounds](https://sleepingrobots.com/dreams/stop-using-ollama/).\n\nFirst, we go to Hugging Face, to get proper quantization, i.e. a model with reduced size - popular ones are by [unsloth](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) or [bartowski](https://huggingface.co/bartowski/Qwen_Qwen3.6-27B-GGUF), among others.\nDefault models usually come with `BF16`\n\nprecision. A common 8-bit quantization saves half the space at almost no cost to quality. Going further down the road, models are smaller (and potentially - faster), but at the cost of quality, see [this comparison for 27B](https://www.reddit.com/r/LocalLLaMA/comments/1tr9vzn/qwen3627b_quantization_benchmark/) and another one for [35B A3B](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/discussions/10).\n\nWe grab [unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF), an 8-bit quantization with support for multi-token prediction (MTP).\n\n```\nllama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \\\n    --spec-type draft-mtp -ngl 999 -fa on -c 65536 --jinja --port 8080\n```\n\nWhat it does:\n\n`-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0`\n\ngrabs from Hugging Face, on the next runs will reuse that`-m ~/models/Qwen3.6-27B-Q8_0.gguf`\n\nuse instead if you already have it`draft-mtp`\n\nwe use a fast model to predict subsequent tokens, speeds up things`-ngl 999`\n\nfor putting all layers to GPU`-fa on`\n\nflash attention is on`-c 65536`\n\ncontext size set to 64k tokens (this we can tweak, as Qwen 3.6 27B native context is 256k)`--jinja`\n\nfor tool calling support`--port 8080`\n\nbetter to pin port, as it will be used by other configs\n\nIf you open `http://127.0.0.1:8080`\n\n, you can directly chat with it.\n\nPrecisely the same server can be used for vibe coding. Choice of agent depends both on one’s goal and subjective taste - for an all-around OpenCode, minimalistic Pi, and self-improving Hermes.\n\nFor OpenCode, it is as simple as adding to `~/.config/opencode/opencode.jsonc`\n\n:\n\n```\n{\n  \"$schema\": \"https://opencode.ai/config.json\",\n  \"provider\": {\n    \"llama\": {\n      \"name\": \"llama.cpp (local)\",\n      \"npm\": \"@ai-sdk/openai-compatible\",\n      \"options\": {\n        \"baseURL\": \"http://127.0.0.1:8080/v1\",\n        \"apiKey\": \"local\"\n      },\n      \"models\": {\n        \"qwen3.6-27b\": { \"name\": \"Qwen3.6-27B Q8 +MTP\" }\n      }\n    }\n  },\n  \"model\": \"llama/qwen3.6-27b\"\n}\n```\n\nIf you just want to chat and are a big fan of Terminal, instead of `llama-server`\n\nuse `llama-cli`\n\n:\n\n```\n llama-cli -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 \\\n                -ngl 999 -fa on -c 65536 --jinja\n```\n\n## Measuring performance\n\nIs it fast enough?\n\nI ran a few tests ([source is here](https://github.com/stared/benching-local-llms-on-apple-silicon)) on my Macbook Max M5 128 GB, running it with and without multi-token prediction, and comparing both with the 35B A3B model, and also a quantized DeepSeek V4 Flash version [DwarfStar4](https://github.com/antirez/ds4).\n\n30 tokens per second is not bad, [well within typical frontier model API range](https://openrouter.ai/openai/gpt-5.5#performance).\nWhile [mlx-lm](https://github.com/ml-explore/mlx-lm) is precisely targeted at Apple Silicon devices, and AI agents heavily recommend it, llama.cpp turned out to be faster.\nIt was using 95% of GPU, which means it is efficiently using available resources.\n\nMacbook Max M5 is a beast (at least for a laptop), but on other devices it should also work decently. For consumer Nvidia RTX cards, on one hand models need to be quantized, on the other, it is even faster.\n\nI set this up today on my 5090 at Q6_K quantization and Q4_0 KV, got 50 tokens/s consistently at 123k context, using ~28/32gb vram through LM Studio. -\n\n[gfosco on the Hacker News]\n\nWhile 35B A3B is 3x faster, I prefer 27B. I’d rather generate a third as much code, but of higher quality.\n\n## How do they relate to previous state of the art models?\n\nManual inspection is great, but benchmarks help with grounding intuitions. Here is the score from [Artificial Analysis](https://artificialanalysis.ai/), comparing it with frontier models:\n\nA few more benchmarks are in [these notes](https://github.com/stared/benching-local-llms-on-apple-silicon), but the spirit is similar.\nAdded here [Gemma 4 31B](https://deepmind.google/models/gemma/gemma-4/), as a lot of people use this as the default for local coding. But both benchmarks and general sentiment online favour Qwen 3.6 27B by a large margin.\n\nHere there is a caveat - 8-bit quantization likely does not affect results much, but DwarfStar4 uses much more aggressive ones for DeepSeek V4 Flash, 2-4 bit. For sure it is worse than the full model. My personal impression is that within these quantizations Qwen 3.6 27B is as good as (or maybe slightly better than) DwarfStar4. Though, I won’t be surprised if for longer context projects DS4 has an edge.\n\n## What’s next\n\nI think we are entering a fascinating era, when it becomes feasible to run one’s own models.\n\nThe change will be propelled further by the state of proprietary frontier models. Claude Fable 5 was taken down. Other frontier models run at a massive subsidy, where paying $100 a month gives us thousands worth in tokens. Let’s use the discount while it lasts!\n\nA locally set model can be fine-tuned to our needs, and cannot be taken away. Businesses can use them for proprietary and sensitive data. We can use them personally for offline projects, or when we don’t feel comfortable sharing our deepest secrets, or medical data, with the US or China.\n\nWith the release of [frontier-level open-weight GLM 5.2](https://artificialanalysis.ai/articles/glm-5-2-is-the-new-leading-open-weights-model-on-the-artificial-analysis-intelligence-index), there is a new era.\nWhile Qwen 3.6 was the stepping stone, even frontier [GLM 5.2 can be run locally](https://unsloth.ai/docs/models/glm-5.2). It won’t run on your Macbook or a single RTX 5090. But still, it is manageable with a company budget.\n\nMoreover, I strongly believe that we will have models smarter than current state of the art, while runnable on local devices, maybe even smartphones. Current models combine both raw intelligence and factual knowledge in the same weights. Future models will likely separate that, offloading a lot of knowledge to tool calling.\n\nStay tuned for future posts and releases", "url": "https://wpnews.pro/news/qwen-3-6-27b-is-the-sweet-spot-for-local-development", "canonical_source": "https://quesma.com/blog/qwen-36-is-awesome/", "published_at": "2026-06-29 17:05:16+00:00", "updated_at": "2026-06-29 17:19:56.659328+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "developer-tools", "ai-research"], "entities": ["Qwen", "Alibaba", "llama.cpp", "Hugging Face", "Simon Willison", "OpenCode", "unsloth", "bartowski"], "alternates": {"html": "https://wpnews.pro/news/qwen-3-6-27b-is-the-sweet-spot-for-local-development", "markdown": "https://wpnews.pro/news/qwen-3-6-27b-is-the-sweet-spot-for-local-development.md", "text": "https://wpnews.pro/news/qwen-3-6-27b-is-the-sweet-spot-for-local-development.txt", "jsonld": "https://wpnews.pro/news/qwen-3-6-27b-is-the-sweet-spot-for-local-development.jsonld"}}