Mellum2: JetBrains Open-Sources a 12B MoE Coding Model

wpnews.pro

cd /news/large-language-models/mellum2-jetbrains-open-sources-a-12b… · home › topics › large-language-models › article

[ARTICLE · art-30137] src=byteiota.com ↗ pub=2026-06-16T20:08Z topic=large-language-models verified=true sentiment=↑ positive

Mellum2: JetBrains Open-Sources a 12B MoE Coding Model

JetBrains open-sourced Mellum2, a 12B Mixture-of-Experts coding model under Apache 2.0, designed for air-gapped and compliance-locked environments where external API calls are prohibited. The model uses a MoE architecture with 2.5B active parameters per token, achieving competitive coding benchmarks while enabling on-premise deployment for finance and healthcare organizations.

read4 min views27 publishedJun 16, 2026

JetBrains just open-sourced Mellum2, a 12B Mixture-of-Experts coding model released under Apache 2.0. The pitch is direct: it runs where Claude Code and OpenAI Codex cannot. Air-gapped networks. Compliance-locked infrastructure. Finance and healthcare orgs with data residency requirements that make external API calls a non-starter. If you’ve been stuck choosing between capable AI coding assistance and keeping your source code off third-party servers, Mellum2 is worth a look.

Not What You Think It Is #

Before you run the benchmark comparison in your head: JetBrains isn’t positioning Mellum2 as a frontier model killer. They’re calling it a “focal model” — a fast, specialized component designed to live inside larger AI pipelines, not replace Claude or GPT outright. Think of it as the inner-loop specialist: the router that decides which model handles a given task, the validator that checks another model’s output, the RAG post-processor that summarizes retrieved context at low latency. It coexists with bigger models. It doesn’t race them.

That framing matters because it resets expectations appropriately and reveals the genuine design goal. Mellum2 is built for the infrastructure layer of agentic AI, not the conversational surface. JetBrains published the full announcement on their AI blog alongside the model release.

The Architecture: MoE Makes Inference Cheap #

Mellum2 uses a Mixture-of-Experts design: 12B total parameters, but only 2.5B are active per token. Of its 64 expert subnetworks, only 8 activate for any given token. This matters for self-hosting because inference cost tracks active parameters, not total. You get model quality that competes with full 12B dense models while running closer to a 2.5B inference budget. JetBrains reports 2x+ faster throughput than similarly-sized dense models in batched scenarios, and roughly matched single-request speed against Qwen2.5-7B on an H100 (192 vs 193 tokens/sec).

The context window is 8,192 tokens by default, extended to 131,072 tokens via YaRN in the long-context checkpoint — sufficient for most real-world agentic tasks. The full model family, including GGUF-quantized versions, is available on the JetBrains Hugging Face page.

Benchmarks: The Honest Picture #

Mellum2 ships in two fine-tuned variants worth knowing: Instruct and Thinking. The Thinking variant is where the serious numbers live.

Benchmark	Instruct	Thinking
LiveCodeBench v6	37.2%	69.9%
AIME 2025+2026	41.7%	58.4%

The 69.9% on LiveCodeBench v6 is competitive for an open model you can run entirely on-prem. The AIME number is more complicated: the Thinking variant scores 58.4%, which trails Qwen3.5 4B at 68.3%. That’s not a typo — a 4B dense model edges out a 12B MoE on math reasoning. The comparison is misleading without context: MoE active-parameter counts (2.5B here) are what matter for compute, not the 12B total. The math performance gap is real, but the trade-off is inference speed and self-hosted deployment. The New Stack’s breakdown digs into where Mellum2 holds its ground and where it doesn’t.

Running It Today #

The weights are on Hugging Face. JetBrains also ships pre-quantized GGUF versions (Q4_K_M) for llama.cpp and Ollama, though early community reports flag compatibility issues with Ollama’s handling of the custom MoE architecture — test before you build a workflow around it.

For production use, vLLM is the recommended path. The vLLM Recipes page for Mellum2 has the full configuration reference.

vllm serve JetBrains/Mellum2-12B-A2.5B-Thinking   --max-model-len 131072   --reasoning-parser qwen3   --enable-auto-tool-choice   --tool-call-parser hermes

The --tool-call-parser hermes

flag gives you MCP-compatible tool use out of the box — relevant if you’re building agentic pipelines on top. JetBrains IDE users (PyCharm, GoLand, WebStorm) can connect a local Mellum2 instance via a single checkbox in AI Assistant settings, choosing Ollama or LM Studio as the backend.

Who Should Actually Try This #

Mellum2 is a strong candidate if you’re operating in an environment where external API calls to AI providers aren’t an option, or if you’re building multi-model pipelines and want a fast, cheap inner-loop model with code awareness. It’s not the model to reach for if you need maximum reasoning quality on math-heavy tasks — Qwen3.5 or DeepSeek models still have the edge there.

The Apache 2.0 license removes the licensing friction that makes some open models difficult to embed in commercial products. That alone makes Mellum2 worth evaluating for teams that need a production-deployable, compliance-friendly coding model with no external dependencies.

source & further reading

byteiota.com — original article VS Code 1.131: See Your Subagents, Speak Your Code EU AI Act Enforcement Is Live: What Developers Must Do OpenAI Terraform Provider v1.0: Manage API Projects as Code

~/api · this article 200

$curl api.wpnews.pro/v1/news/mellum2-jetbrains-open-s…

Read original on byteiota.com → byteiota.com/mellum2-jetbrains-moe-coding-model/

mentioned entities

JetBrains

Mellum2

Claude Code

OpenAI Codex

Hugging Face

vLLM

Ollama

Qwen3.5

metadata

slugmellum2-jetbrains-open-sources-a-12b-moe-coding-model

topic#large-language-models

secondary3 topics

sentimentpositive

canonicalbyteiota.com

navigation

← prevShow HN: ctx is now open-source,…

next →SpaceX valuation balloons to $2.…

── more in #large-language-models 4 stories · sorted by recency

agent-browser.dev · 2 Aug · #large-language-models

Agent-Browser – Browser Automation for AI

github.com · 2 Aug · #large-language-models

AirLLM: Inference 2.8T Kimi K3 on a single 4GB GPU

gizmodo.com · 2 Aug · #large-language-models

OpenAI Smuggled the Announcement of Astra, Its Next AI Model, Into a Blog Post About Math

promptcube3.com · 2 Aug · #large-language-models

How Much VRAM to Fine-Tune an LLM? 12 to 120 GB

── more on @jetbrains 3 stories trending now

wpnews · 1 Aug · #ai-products

OpenAI Atlas Shuts Down August 9: Migration Guide

wpnews · 2 Aug · #developer-tools

Agent-Browser – Browser Automation for AI

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required