cd /news/large-language-models/how-to-fit-qwen-3-6-35b-a3b-into-16g… · home topics large-language-models article
[ARTICLE · art-26521] src=autodidacts.io ↗ pub= topic=large-language-models verified=true sentiment=· neutral

How to fit Qwen 3.6 35B A3B into 16GB of VRAM, & run it with Llama.cpp on an RTX 3080

A guide explains how to run the Qwen 3.6 35B A3B model on an RTX 3080 with 16GB VRAM using Llama.cpp, offloading most layers to CPU to fit within memory constraints. The author details steps for installing Llama.cpp, downloading a quantized GGUF model from HuggingFace, and using the --n-gpu-layers flag to avoid CUDA errors.

read5 min publishedJun 13, 2026

I have been informed by the internet that using Ollama is passé. Though it’s easy to get set up and use, and popular with rank beginners like The Autodidacts, power users balk at the fact that:

  • Performance is worse than llama.cpp et al. - It’s a VC-backed startup that’s rapidly selling-out on the local-first promise by touting paid cloud offerings
  • It relies on Llama.cpp for the heavy lifting, without giving credit where credit is due

Long ago, I thought, okay, I should just switch to llama.cpp. I’ve used whisper.cpp, how hard can it be?

Let me just say: there’s a reason Ollama is so popular.

Every time I tried to switch to llama.cpp, it either a) it wouldn’t compile, or b) I got cryptic CUDA memory allocation errors. (Even when I went to run the commands to test for this article, it was broken, because of an interrupted brew upgrade.)

But now I am past that. After fruitless Googling, I threw command line arguments at it until it worked, and now this post is fruit for you to Google.

Step 0: Pre-requisites

You will need a system that can run the model you want to run. Two websites that are useful for getting a general idea of what’s likely to fit (though they aren’t the last word!):

(Which one is ripping off which? I can’t tell.)

For this post, I’m assuming 16gb of VRAM. You will also want a fast-enough processor and plenty of system RAM, since we will be off some of the work to CPU + system ram, and some to GPU + VRAM.

(Find out how much VRAM you have with nvidia-smi

, rocm-smi

, or lspci -v | grep -i vga -A 12

)

Step 1: Install Llama.cpp

Once you figure it out, compiling it is easy. But llama.cpp changes so fast, I prefer to use the version packaged by brew, for simplicity and automatic updates.

Once you have Linuxbrew installed:

brew install llama.cpp

As long as your $PATH is set correctly, llama-cli and llama-server will now be available everywhere.

Step 2: Download the Model

Obviously, we’re going to be using a quantized model. 4 bit quantization seems to generally be considered a good compromise, and I’m not fancy, so I went with the crowd.

Llama.cpp can download directly from HuggingFace. Unsloth seems like a reasonably trustworthy provider, so I went with unsloth_Qwen3.6-35B-A3B-GGUF_Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf and

.

mmproj-BF16.gguf

(FWIW I have no idea what I’m doing. You should probably go read Simon Willison or something. Except, they don’t write about stuff this basic. So you’re stuck with me!)

Step 3: Make it fit!

After reading various things about off the MoE (mixture of expert) layers to CPU, and trying all kinds of things that didn’t work, I found one that did: off only ~16-24 of 40 layers to GPU. Otherwise, I got CUDA allocation errors (ie, not enough VRAM).

The relevant flag:

--n-gpu-layers 16 # or -ngl 16

I was able to get up to 29 layers on GPU. 32 was too many. You can calculate exactly how many you can get, if you want (layer_size = model_size / num_layers), but it also depends on what else is using VRAM, so aim conservative.

It’s also instructive to see what Llama can find in the way of devices. It might be CUDA0

or Vulkan1

(or possibly Klingon2

). Find out what you’ve got with: llama-cli –list-devices

Step 4: Run it!

Here’s the full command (adjust the paths to match where you downloaded the model):

llama-cli --model /path/to/models/Qwen3.6-35B-A3B-GGUF/unsloth_Qwen3.6-35B-A3B-GGUF_Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmproj /path/to/Qwen3.6-35B-A3B-GGUF/mmproj-BF16.gguf --n-gpu-layers 24 -p "Write a 100 word sermon on the moral and practical benefits of supporting independent bloggers such as Autodidacts.io with micro-donations, as a thank you when they write an article that improves your life in some tiny way."

Here is the turgid homily Qwen produced:

Beloved, heed the call of gratitude. When an independent voice on Autodidacts.io sharpens your mind or eases your burden, however slightly, return the blessing. A micro-donation is no small matter; it is moral alchemy, transforming gratitude into justice for the creator.

Practically, these scattered offerings fuel the engine of truth. They ensure the lights remain on, servers hum, and wisdom flows free from corporate chains. By tipping for tiny improvements, you sustain the ecosystem of ideas. Support the independent blogger. Honor the craft. Let your thanks keep the wellspring fresh. Amen.

It’s so bad, it kind of makes it better, because it makes its own point!

Or run a jazzy web interface

It turns out that llama-server

is super easy to run, and much more pleasant to use. Really, the only reason I can think of to use llama-cli is for one-off questions and scripting in bash pipelines, when you don’t want to write a Python wrapper.

Start it like this:

llama-server --model /path/to/models/Qwen3.6-35B-A3B-GGUF/unsloth_Qwen3.6-35B-A3B-GGUF_Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmproj /path/to/Qwen3.6-35B-A3B-GGUF/mmproj-BF16.gguf --n-gpu-layers 24

Then, go to 0.0.0.0:8080 (the default host and port), and you get a nice web UI. I didn’t even know it existed, but it does, and it’s as good as Jan.ai or Lumo.

Conclusion

I’m getting 14-22 t/s, which is pretty feeble, but adequate for my limited needs (mostly, looking up syntax I’ve forgotten when I’m offline). This is probably because my CUDA install is, as usual, broken, and I’m using Vulkan.

[Every time I buy a laptop, I decide that next time I’m buying AMD graphics*. And then, because I’m a cheapskate, I buy Nvidia, and spend the next half-decade fighting with my graphics drivers.]*

Soon, I’ll write about using Llama.cpp for handwriting OCR. (Spoiler: Qwen 3.6-35B-A3B works even better than Qwen3-VL:8b.)

Other people use the 8 bit quant of Qwen 3.6 27B, and like it. I don’t know where the sweet spot is between a bigger model with more aggressive quantization, and a smaller model with less aggressive quantization. I haven’t tried 27B Q8 yet, but I probably will soon. If you’ve tried both, let me know your impressions!

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/how-to-fit-qwen-3-6-…] indexed:0 read:5min 2026-06-13 ·