Two Hacker News threads in the same week said the same thing: local models are finally good enough for real coding work. One asked "Has anyone replaced Claude/GPT with a local model for daily coding?" (1,260 points). The other declared "Running local models is good now" (1,350 points). Both triggered hundreds of replies from developers who had switched, and hundreds more pointing out exactly where local models still fall short. This guide covers when local is genuinely good enough, when cloud APIs still win, and why the right answer for most developers is probably both.
What you're actually choosing between #
Most comparisons treat this as a binary: run models on your own GPU or pay for Claude/GPT. There are actually three distinct options, and the middle one gets skipped in most articles.
Self-hosted inference runs on your hardware and your GPU, so zero data leaves your machine. Tools: Ollama, llama.cpp, vLLM. Upfront hardware cost, but no per-token fees and full data control.
Hosted open-weight puts open-source models on cloud infrastructure (Together.ai, Fireworks.ai). No hardware to buy, data does leave your machine, but prices are far below proprietary APIs. Together.ai's gpt-oss-20B runs at $0.05 input / $0.20 output per million tokens. Qwen3 235B FP8 is $0.20 / $0.60.
Proprietary cloud APIs (Claude, GPT-5, Gemini) offer frontier model quality and no setup, with pay-per-token pricing. Claude Sonnet 4.6 is $3.00 / $15.00 per million tokens. GPT-5.4 is $2.50 / $15.00.
The hosted open-weight option is the one most articles ignore. It undercuts proprietary APIs by 5β10x at light to medium usage, requires no hardware investment, and gives you access to models that now rival Claude Sonnet on routine coding tasks.
Ollama vs llama.cpp: which tool to use #
If you're going self-hosted, two tools dominate: Ollama and llama.cpp. Ollama is a user-friendly wrapper built on top of the same inference technology as llama.cpp, so they're more complementary than competing. The difference is how much setup friction you're willing to trade for control and performance.
Ollama gets you running in minutes:
curl -fsSL https://ollama.com/install.sh | shollama pull qwen2.5-coder:7bollama run qwen2.5-coder:7b
No compiling, no configuration files, no manual model downloads. The library includes qwen2.5-coder, qwen3-coder, deepseek-r1, codellama, gemma4, and most other coding-relevant models.
llama.cpp is the raw inference engine. Getting it running takes 30β60 minutes. You compile the project or download a binary, manually fetch model files from Hugging Face, then decide on quantization levels, GPU layer counts, and batch sizes. The payoff is direct control over inference parameters, access to exotic quantization formats like Unsloth quants and NVFP4, and roughly 10β20% faster throughput than Ollama on the same hardware. llama.cpp now ships a built-in web UI server:
llama-server -hf ggml-org/qwen2.5-coder-7b-GGUF --port 8080
For most solo developers, Ollama is the right starting point. The performance gap is small enough for interactive coding work that the setup savings are worth it. Switch to llama.cpp if you need quantization formats outside what Ollama supports, or you're squeezing throughput from dedicated hardware.
How much VRAM you need for coding models #
A consumer GPU at 16GB VRAM is the practical entry point for coding use. An RTX 4070 Ti Super (16GB) can run Qwen2.5-Coder-32B using Q4_K_M quantization, which fits the model into available VRAM with acceptable quality. At 8GB VRAM, you're limited to 7Bβ16B models that score noticeably lower on coding benchmarks. A 24GB card (RTX 4090) lets you run the same 32B model at higher precision.
One caveat most comparisons skip is that Q4_K_M quantization compresses the model significantly. The quality is good, but measurably below full-precision. When you see benchmarks showing a local model at "85β90% of Claude quality," you're comparing a compressed model against Claude running at full precision, which is just the tradeoff you make to run locally.
On Apple Silicon, Ollama's June 2026 MLX engine update delivers 55 tokens/second on an M5 Max using NVFP4 quantization (up from 46 tok/s with Q4_K_M), with roughly half the quality loss of older 4-bit formats.
Consumer GPU throughput tops out around 15β25 tokens/second. Claude's infrastructure runs at 60β80. For short prompts, local is faster (no network round-trip, 50β200ms time-to-first-token vs 200β800ms for cloud APIs). For long outputs (a 200-line function, a test suite), cloud is significantly faster.
Routine tasks: where local models hold up #
In a hands-on benchmark running 50 prompts across five coding task categories (function generation, bug detection, refactoring, multi-file context, explanation) on an RTX 4070 Ti Super ($489) with Ollama, Qwen2.5-Coder-32B reached 85β90% of Claude Sonnet 4's quality on routine tasks. On code explanation specifically, it matched or slightly beat Claude on several individual prompts. On short prompts, CodeStral 22B averaged 1.4 seconds per response versus Claude's 2.1 seconds.
Three months of real-world use from the same developer produced a practical split. Local handled 70β80% of daily coding prompts at acceptable quality. The remaining 20β30% (complex debugging, multi-file refactors, architecture questions) still went to Claude.
Local wins clearly on:
Privacy-sensitive code: nothing leaves your machine. Proprietary codebases, client work, HIPAA-regulated environments.Offline access: no internet dependency, works air-gapped.Autocomplete and short prompts: lower latency than cloud round-trip when consistency matters more than peak throughput.High sustained volume: at 10M+ tokens/day, fixed hardware cost amortizes and per-token cost drops sharply.
Complex tasks: where cloud APIs still lead #
Multi-file context is where the gap becomes significant. In the same benchmark, Claude's larger effective context window and stronger cross-file reasoning gave it a 60% advantage over the best local model on tasks spanning multiple files. Bug detection showed a similar gap, with cloud models handling subtle logic errors that require holding complex state across reasoning steps better than quantized local models.
The setup overhead is also real and often excluded from cost comparisons. Optimizing a local stack (tuning system prompts per model, figuring out context chunking, managing model swaps) takes 20+ hours before it runs smoothly, versus calling an endpoint with a cloud API.
Cloud holds the advantage on:
Multi-file and complex reasoning: 60% advantage in benchmarks on cross-file tasks.Long output generation: 60β80 tok/s vs 15β25 tok/s locally.Frontier models: Claude Fable 5 ($10 / $50 per MTok) and GPT-5.5 ($5 / $30) don't run locally; weights aren't publicly released.Low-volume economics: at 500K tokens/day, local hardware costs $6,457 in year one vs $1,260 for GPT-5.4 via API. Break-even doesn't arrive until months 18β24 at medium volume.Burst scaling: cloud absorbs traffic spikes without capacity planning.
The hosted open-weight middle ground #
Hosted open-weight inference on providers like Together.ai is the option that doesn't fit the local vs cloud framing, and it's often the right call for developers who want open-model pricing without managing hardware.
Your code still leaves your machine, as these are cloud-hosted models. But for code that isn't sensitive enough to require on-machine inference, this path undercuts Claude and GPT pricing by 5β50x while avoiding the upfront hardware investment. At 500K tokens/day, open-weight hosted APIs cost around $360/year vs $1,260 for GPT-5.4.
When to go local and when not to #
The crossover points follow from token volume, data sensitivity, and task type, not preference.
Go self-hosted if:
- Your code can't leave your machine (regulatory requirements, proprietary IP, HIPAA)
- Your daily token volume is above 5M and growing (hardware starts amortizing around months 18β24)
- You need offline or air-gapped access
- Latency consistency for in-editor tooling matters more than peak throughput
Use hosted open-weight if:
- You want open-model pricing without hardware investment
- Code sensitivity doesn't require on-machine inference
- You need scale without managing GPU capacity
Stick with proprietary cloud APIs if:
- Your volume is low (under 500K tokens/day), where hardware cost doesn't pay off in year one
- Multi-file reasoning, complex debugging, or architecture work is your main bottleneck
- You need frontier model quality
- You're still iterating toward product-market fit and speed matters more than cost
The hybrid most developers land on: local or hosted open-weight for routine, short, private tasks. Cloud API for multi-file work, complex debugging, and anything requiring frontier reasoning.
The 2026 hardware and open-weight model ecosystem has moved the self-hosting break-even point roughly 40% lower than it stood in 2024. At light individual usage, cloud APIs remain the rational choice, and the frontier models that can't yet run locally are still where the hardest problems get solved fastest.