The age of local LLMs is here

A developer revisits local LLMs six months after an initial discouraging assessment and finds the landscape transformed. With new models like Qwen3.6-27B, Qwen3.6-35B-A3B, and Qwen-Coder-Next-80B, local performance now rivals or exceeds free inference tiers, which have largely disappeared. Improvements in llama.cpp's router mode and context cache saving further enhance usability.

Half a year ago, I wanted to see for myself what can we currently have with local LLMs. I went down the rabbit hole, learned quite a lot in the process, and shared my results in an article https://dev.to/darkpenguin/local-llms-state-of-the-art-43n . The results were pretty discouraging: even with 32 GB VRAM, the best models I could run were both too slow and too dumb. At the same time, what you could get for free from inference providers was actually decent - and much faster. I remember my conclusion: "Let's wait for the next generation of models, which looks very promising. If we can run something comparable to full-size Qwen3-Coder-480B locally, that would be ~~year of the Linux Desktop~~ age of fully capable local LLMs. And now this day has arrived. Half a year later, I'm revisiting this question. And this time, the whole situation has turned upside-down. Almost none of the providers still have free tier, and anything that's still free is barely good enough even for the simplest tasks. And is rate-limited all over. And on the local side, the next Qwen lineup is out. So, that's what I'm going to be looking at. Once again, I have two RX6800's, 16 GB each, and 64 GB RAM. On one hand, this is more VRAM than any "normal person" can have with one GPU - unless you've got something specifically for AI, like an unified-memory Mac or a DGX Spark. On the other hand, RX6800 is "pre-AI" - anything newer will have much better performance thanks to tensor processors. Qwen3.6-27B : This is a dense model, so basically you can't run it at all on anything less than 32 GB VRAM. It's the slowest one, but also the best one if you can run it. Its accuracy is claimed to be on par with Claude 4.5 Opus, and better than Qwen3.5-397B-A17B . This is what I've been waiting for. It runs reasonably fast on my setup, so it's very much usable both in terms of performance and accuracy. Qwen3.6-35B-A3B : This one is MoE, and it's pretty small, so it's the fastest one. It's good for anything that doesn't require too much i.e. for agentic tasks that don't need a lot of reasoning , and apparently better than GLM-4.7-Flash or Gemini-3.1-Flash-Lite which is basically all you can get for free nowadays . So, we don't need all that anymore. And it's FAST Qwen-Coder-Next-80B : It's big, but it's also MoE, so you can offload some experts to the CPU. On my setup, its performance is somewhere between 3.6-dense and 3.6-MoE. Its accuracy is claimed to be near the full-sized GLM-4.7 , or Kimi K2.5 , or DeepSeek-V3.2 . It's based on Qwen-Next , fine-tuned specifically for coding.Other than the new models, there are quite a lot of other improvements. Last time, REAP was just appearing. This time, there is a REAM variant of Qwen-Coder-Next-80B - that's when they merge the weights instead of simply pruning them. And based on the benchmarks, its accuracy is within the margin of error from the full model In other news: we no longer need llama-swap - llama.cpp now has an experimental "router mode", where it's loading and unloading models itself. You can specify different parameters per model in a config file. The config file format is less robust than what llama-swap had, but there is a very good reason to use it instead. Read on. In the real-life usage, you will probably want to switch models a lot. With a good enough SSD, loading a new model can be brought down to under a minute, and if you have enough RAM, then both smaller models fit into 64GB of page cache nicely. But here comes another problem: context cache. If you switch to a faster model and then back to a smarter one, then you'll have to reprocess your whole conversation. And that can easily take minutes. It would be really nice if there was a way to save that context cache, and restore it after switching back, wouldn't it? Turns out there is an option for this in llama.cpp : when you switch models in router mode, it saves your slots to disk, then restores them But this option does not work anymore, because for the newer models' attention mechanisms, you also have checkpoints, so only restoring slots it not enough. But there is a PR to save checkpoints as well But it's not merged yet last time I checked . So, I've created my own fork of llama.cpp https://github.com/dark-penguin/llama.cpp/tree/cherrypick/slot-save-restore with this and some other yet-unmerged PRs included, which you are welcome to try Now that we've got usable local models, it's time to choose a harness. And there's a lot of choice Not only coding - lately, there is talk about OpenClaw all around, and Hermes is emerging as a better alternative with some interesting features. I've spent some time tinkering with Hermes, and quickly noticed two things. First, it feels very vibe-coded - even some basic features don't work. 8000+ open issues on its Github tell me that this is unlikely to be fixed any time soon. And second... Sometimes I have no idea what it's doing. I've tried turning on maximum verbosity in reasoning and tool calls, but that's one of the things that are bugged and doesn't work. I even developed a tiny proxy https://github.com/dark-penguin/llm-tools to intercept its requests that it doesn't show me, but that's a really janky solution. I can literally hear my LLMs working apparently AMD GPUs are famous for their coil whine, which I consider a great feedback feature . I don't want to interrupt if it's taking so much time to think about something useful, but I don't even know what is it thinking about. And then I saw Mario Zechner's talk about Pi https://www.youtube.com/watch?v=RjfbvDXpFls . Specifically, how he was annoyed with the same things, and created his own minimalistic harness. His mindset sounds very close to what I'm looking for, so the next thing I'm going to do is tinker with Pi to my heart's content. I've listened to his other talks, and in one of them, he also said that he is very impressed with the current state of local LLMs; if all frontier models are turned off today - which in my opinion we can say is already happening with all the price changes - then he would be very happy with what we have now. There is also a lot of insight into AI in general. So, here it is - the moment when AI becomes available to hardcore libre software people who don't want to rely on software running on someone else's machine. Which also means we can experiment for free and without fear of it changing or disappearing. So, let's experiment a lot And let's remember to use it to learn, and not to contribute to the slop-pocalypse of libre software.