{"slug": "how-to-fit-qwen-3-6-35b-a3b-into-16gb-of-vram-run-it-with-llama-cpp-on-an-rtx", "title": "How to fit Qwen 3.6 35B A3B into 16GB of VRAM, & run it  with Llama.cpp on an RTX 3080", "summary": "A guide explains how to run the Qwen 3.6 35B A3B model on an RTX 3080 with 16GB VRAM using Llama.cpp, offloading most layers to CPU to fit within memory constraints. The author details steps for installing Llama.cpp, downloading a quantized GGUF model from HuggingFace, and using the --n-gpu-layers flag to avoid CUDA errors.", "body_md": "# How to fit Qwen 3.6 35B A3B into 16GB of VRAM, & run it with Llama.cpp on an RTX 3080\n\n# The belly hangs over the belt, but it fits\n\nI have been informed by the internet that using Ollama is passé. Though it’s easy to get set up and use, and [popular](https://www.autodidacts.io/ocr-typewritten-manuscripts-with-local-ollama-vision-model/) [with](https://www.autodidacts.io/usable-local-ai-handwriting-recognition/) rank [beginners](https://www.autodidacts.io/increase-ollama-context-length-num-ctx/) like The Autodidacts, power users balk at the fact that:\n\n- Performance is worse than llama.cpp\n*et al.* - It’s a VC-backed startup that’s rapidly selling-out on the local-first promise by touting paid cloud offerings\n- It relies on Llama.cpp for the heavy lifting, without giving credit where credit is due\n\nLong ago, I thought, okay, I should just switch to llama.cpp. I’ve used whisper.cpp, how hard can it be?\n\nLet me just say: there’s a *reason* Ollama is so popular.\n\nEvery time I tried to switch to llama.cpp, it either a) it wouldn’t compile, or b) I got cryptic CUDA memory allocation errors. (Even when I went to run the commands to test for this article, it was broken, because of an interrupted brew upgrade.)\n\nBut now I am past that. After fruitless Googling, I threw command line arguments at it until it worked, and now this post is fruit for you to Google.\n\n### Step 0: Pre-requisites\n\nYou will need a system that can run the model you want to run. Two websites that are useful for getting a general idea of what’s likely to fit (though they aren’t the last word!):\n\n(Which one is ripping off which? I can’t tell.)\n\nFor this post, I’m assuming 16gb of VRAM. You will also want a fast-enough processor and plenty of system RAM, since we will be offloading some of the work to CPU + system ram, and some to GPU + VRAM.\n\n(Find out how much VRAM you have with `nvidia-smi`\n\n, `rocm-smi`\n\n, or `lspci -v | grep -i vga -A 12`\n\n)\n\n### Step 1: Install Llama.cpp\n\nOnce you figure it out, compiling it is easy. But llama.cpp changes so fast, I prefer to use the version packaged by brew, for simplicity and automatic updates.\n\nOnce you have Linuxbrew installed:\n\n```\nbrew install llama.cpp\n```\n\nAs long as your $PATH is set correctly, llama-cli and llama-server will now be available everywhere.\n\n### Step 2: Download the Model\n\nObviously, we’re going to be using a quantized model. 4 bit quantization seems to generally be considered a good compromise, and I’m not fancy, so I went with the crowd.\n\nLlama.cpp can download directly from HuggingFace. Unsloth seems like a reasonably trustworthy provider, so I went with [ unsloth_Qwen3.6-35B-A3B-GGUF_Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf?ref=autodidacts.io) and\n\n[.](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/mmproj-BF16.gguf?ref=autodidacts.io)\n\n`mmproj-BF16.gguf`\n\n(FWIW I have no idea what I’m doing. You should probably go read Simon Willison or something. Except, they don’t write about stuff this basic. So you’re stuck with me!)\n\n### Step 3: Make it fit!\n\nAfter reading various things about offloading the MoE (mixture of expert) layers to CPU, and trying all kinds of things that didn’t work, I found one that did: offloading only ~16-24 of 40 layers to GPU. Otherwise, I got CUDA allocation errors (ie, not enough VRAM).\n\nThe relevant flag:\n\n```\n--n-gpu-layers 16 # or -ngl 16\n```\n\nI was able to get up to 29 layers on GPU. 32 was too many. You can calculate exactly how many you can get, if you want (layer_size = model_size / num_layers), but it also depends on what else is using VRAM, so aim conservative.\n\nIt’s also instructive to see what Llama can find in the way of devices. It might be `CUDA0`\n\nor `Vulkan1`\n\n(or possibly `Klingon2`\n\n). Find out what you’ve got with: `llama-cli –list-devices`\n\n### Step 4: Run it!\n\nHere’s the full command (adjust the paths to match where you downloaded the model):\n\n```\nllama-cli --model /path/to/models/Qwen3.6-35B-A3B-GGUF/unsloth_Qwen3.6-35B-A3B-GGUF_Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmproj /path/to/Qwen3.6-35B-A3B-GGUF/mmproj-BF16.gguf --n-gpu-layers 24 -p \"Write a 100 word sermon on the moral and practical benefits of supporting independent bloggers such as Autodidacts.io with micro-donations, as a thank you when they write an article that improves your life in some tiny way.\"\n```\n\nHere is the turgid homily Qwen produced:\n\n*Beloved, heed the call of gratitude. When an independent voice on Autodidacts.io sharpens your mind or eases your burden, however slightly, return the blessing. A micro-donation is no small matter; it is moral alchemy, transforming gratitude into justice for the creator.*\n\n*Practically, these scattered offerings fuel the engine of truth. They ensure the lights remain on, servers hum, and wisdom flows free from corporate chains. By tipping for tiny improvements, you sustain the ecosystem of ideas. Support the independent blogger. Honor the craft. Let your thanks keep the wellspring fresh. Amen.*\n\nIt’s so bad, it kind of makes it *better*, because it makes its own point!\n\n### Or run a jazzy web interface\n\nIt turns out that `llama-server`\n\nis super easy to run, and much more pleasant to use. Really, the only reason I can think of to use llama-cli is for one-off questions and scripting in bash pipelines, when you don’t want to write a Python wrapper.\n\nStart it like this:\n\n```\nllama-server --model /path/to/models/Qwen3.6-35B-A3B-GGUF/unsloth_Qwen3.6-35B-A3B-GGUF_Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmproj /path/to/Qwen3.6-35B-A3B-GGUF/mmproj-BF16.gguf --n-gpu-layers 24\n```\n\nThen, go to [0.0.0.0:8080](http://0.0.0.0:8080/?ref=autodidacts.io) (the default host and port), and you get a nice web UI. I didn’t even know it existed, but it does, and it’s as good as Jan.ai or Lumo.\n\n### Conclusion\n\nI’m getting 14-22 t/s, which is pretty feeble, but adequate for my limited needs (mostly, looking up syntax I’ve forgotten when I’m offline). This is probably because my CUDA install is, as usual, broken, and I’m using Vulkan.\n\n*[Every time I buy a laptop,* *I decide that* next time *I’m buying AMD graphics**. And then, because I’m a cheapskate, I buy Nvidia, and spend the next half-decade fighting with my graphics drivers.]*\n\nSoon, I’ll write about using Llama.cpp for handwriting OCR. (Spoiler: Qwen 3.6-35B-A3B works even better than [Qwen3-VL:8b](https://www.autodidacts.io/usable-local-ai-handwriting-recognition/).)\n\nOther people use the 8 bit quant of Qwen 3.6 27B, and like it. I don’t know where the sweet spot is between a bigger model with more aggressive quantization, and a smaller model with less aggressive quantization. I haven’t tried 27B Q8 yet, but I probably will soon. If you’ve tried both, let me know your impressions!", "url": "https://wpnews.pro/news/how-to-fit-qwen-3-6-35b-a3b-into-16gb-of-vram-run-it-with-llama-cpp-on-an-rtx", "canonical_source": "https://www.autodidacts.io/how-to-fit-qwen3-6-35b-a3b-into-16gb-vram-run-with-llama-cpp-rtx-3080/", "published_at": "2026-06-13 21:49:20+00:00", "updated_at": "2026-06-13 22:01:52.474856+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "developer-tools"], "entities": ["Qwen", "Llama.cpp", "Ollama", "HuggingFace", "Unsloth", "NVIDIA", "RTX 3080"], "alternates": {"html": "https://wpnews.pro/news/how-to-fit-qwen-3-6-35b-a3b-into-16gb-of-vram-run-it-with-llama-cpp-on-an-rtx", "markdown": "https://wpnews.pro/news/how-to-fit-qwen-3-6-35b-a3b-into-16gb-of-vram-run-it-with-llama-cpp-on-an-rtx.md", "text": "https://wpnews.pro/news/how-to-fit-qwen-3-6-35b-a3b-into-16gb-of-vram-run-it-with-llama-cpp-on-an-rtx.txt", "jsonld": "https://wpnews.pro/news/how-to-fit-qwen-3-6-35b-a3b-into-16gb-of-vram-run-it-with-llama-cpp-on-an-rtx.jsonld"}}