JetBrains just open-sourced Mellum2, a 12B Mixture-of-Experts coding model released under Apache 2.0. The pitch is direct: it runs where Claude Code and OpenAI Codex cannot. Air-gapped networks. Compliance-locked infrastructure. Finance and healthcare orgs with data residency requirements that make external API calls a non-starter. If you’ve been stuck choosing between capable AI coding assistance and keeping your source code off third-party servers, Mellum2 is worth a look.
Not What You Think It Is #
Before you run the benchmark comparison in your head: JetBrains isn’t positioning Mellum2 as a frontier model killer. They’re calling it a “focal model” — a fast, specialized component designed to live inside larger AI pipelines, not replace Claude or GPT outright. Think of it as the inner-loop specialist: the router that decides which model handles a given task, the validator that checks another model’s output, the RAG post-processor that summarizes retrieved context at low latency. It coexists with bigger models. It doesn’t race them.
That framing matters because it resets expectations appropriately and reveals the genuine design goal. Mellum2 is built for the infrastructure layer of agentic AI, not the conversational surface. JetBrains published the full announcement on their AI blog alongside the model release.
The Architecture: MoE Makes Inference Cheap #
Mellum2 uses a Mixture-of-Experts design: 12B total parameters, but only 2.5B are active per token. Of its 64 expert subnetworks, only 8 activate for any given token. This matters for self-hosting because inference cost tracks active parameters, not total. You get model quality that competes with full 12B dense models while running closer to a 2.5B inference budget. JetBrains reports 2x+ faster throughput than similarly-sized dense models in batched scenarios, and roughly matched single-request speed against Qwen2.5-7B on an H100 (192 vs 193 tokens/sec).
The context window is 8,192 tokens by default, extended to 131,072 tokens via YaRN in the long-context checkpoint — sufficient for most real-world agentic tasks. The full model family, including GGUF-quantized versions, is available on the JetBrains Hugging Face page.
Benchmarks: The Honest Picture #
Mellum2 ships in two fine-tuned variants worth knowing: Instruct and Thinking. The Thinking variant is where the serious numbers live.
| Benchmark | Instruct | Thinking |
|---|---|---|
| LiveCodeBench v6 | 37.2% | 69.9% |
| AIME 2025+2026 | 41.7% | 58.4% |
The 69.9% on LiveCodeBench v6 is competitive for an open model you can run entirely on-prem. The AIME number is more complicated: the Thinking variant scores 58.4%, which trails Qwen3.5 4B at 68.3%. That’s not a typo — a 4B dense model edges out a 12B MoE on math reasoning. The comparison is misleading without context: MoE active-parameter counts (2.5B here) are what matter for compute, not the 12B total. The math performance gap is real, but the trade-off is inference speed and self-hosted deployment. The New Stack’s breakdown digs into where Mellum2 holds its ground and where it doesn’t.
Running It Today #
The weights are on Hugging Face. JetBrains also ships pre-quantized GGUF versions (Q4_K_M) for llama.cpp and Ollama, though early community reports flag compatibility issues with Ollama’s handling of the custom MoE architecture — test before you build a workflow around it.
For production use, vLLM is the recommended path. The vLLM Recipes page for Mellum2 has the full configuration reference.
vllm serve JetBrains/Mellum2-12B-A2.5B-Thinking --max-model-len 131072 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser hermes
The --tool-call-parser hermes
flag gives you MCP-compatible tool use out of the box — relevant if you’re building agentic pipelines on top. JetBrains IDE users (PyCharm, GoLand, WebStorm) can connect a local Mellum2 instance via a single checkbox in AI Assistant settings, choosing Ollama or LM Studio as the backend.
Who Should Actually Try This #
Mellum2 is a strong candidate if you’re operating in an environment where external API calls to AI providers aren’t an option, or if you’re building multi-model pipelines and want a fast, cheap inner-loop model with code awareness. It’s not the model to reach for if you need maximum reasoning quality on math-heavy tasks — Qwen3.5 or DeepSeek models still have the edge there.
The Apache 2.0 license removes the licensing friction that makes some open models difficult to embed in commercial products. That alone makes Mellum2 worth evaluating for teams that need a production-deployable, compliance-friendly coding model with no external dependencies.