Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

wpnews.pro

cd /news/large-language-models/introducing-mellum2-a-12b-mixture-of… · home › topics › large-language-models › article

[ARTICLE · art-20429] src=huggingface.co ↗ pub=2026-06-01T15:45Z topic=large-language-models verified=true sentiment=↑ positive

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

JetBrains released Mellum2, a 12-billion-parameter Mixture-of-Experts model trained on natural language and code that activates only 2.5 billion parameters per token for efficient inference. The open-source model, licensed under Apache 2.0, achieves more than 2x faster inference than similarly sized models while delivering competitive benchmark performance across code generation, reasoning, and math tasks. Mellum2 is designed for latency-sensitive production workloads including routing, RAG pipelines, sub-agents, and private deployments in multi-model AI systems.

read3 min views18 publishedJun 1, 2026

Text Generation • 12B • Updated • 6.94k • 163

Team ArticlePublished June 1, 2026

Mellum2 is a 12B-parameter Mixture-of-Experts model trained from scratch on natural language and code.
The model activates only 2.5B parameters per token, making it efficient for high-throughput, low-latency inference. Mellum2 is can be used for routing, RAG, summarization, sub-agents, high-throughput coding features, and private deployments.
It is released under the Apache 2.0 license.
Compared with similar-sized models, Mellum2 delivers competitive benchmark performance while achieving more than 2x faster inference.
Download the model on Hugging Face: https://huggingface.co/collections/JetBrains/mellum-2 - For architecture details, training setup, benchmarks, and evaluation methodology, read the full technical report:

https://arxiv.org/pdf/2605.31268 Today we’re releasing Mellum2, an open Mixture-of-Experts model optimized for low-latency text-and-code workloads. Mellum originally started as a code completion model. With Mellum2, we extend that foundation to a broader set of natural language and software engineering tasks while keeping the model focused on efficient inference and deployability. Modern AI systems increasingly rely on multiple model calls: routing, retrieval, summarization, planning, validation, and tool use. Many of these operations are latency-sensitive and do not require the largest available model. Mellum2 targets these workloads.

Benchmark highlights #

In our technical report, we evaluate Mellum2 across code generation, reasoning, science, and math benchmarks. Mellum2 is competitive with similarly sized open models while delivering more than 2x faster inference, making it suitable for high-throughput production workloads. Model architecture Mellum2 is a Mixture-of-Experts model:

Model	Total parameters	Active parameters per token	Modality	License
Mellum2	12B	2.5B	Text and code	Apache 2.0

The MoE architecture keeps total model capacity high while activating only a subset of parameters for each token. This makes inference more efficient and helps reduce serving cost for real-time workloads. Mellum2 is intentionally focused on text and code rather than multimodal tasks. This specialization keeps the model compact and efficient for software engineering workloads.

Key use cases #

Routing and orchestration

Mellum2 works well as a lightweight routing and orchestration model in multi-model systems, including prompt classification, tool selection, and intermediate control-flow steps.

RAG pipelines

The model is well suited for latency-sensitive retrieval pipelines, including context compression, summarization, and retrieval post-processing.

Sub-agents

Mellum2 can be used for agent subtasks such as planning, validation, transformation, and context preparation, reducing the need to invoke larger models for intermediate operations.

Private deployment

Because Mellum2 is open and efficient to serve, it can be deployed in self-hosted environments involving proprietary code or internal data.

Why well-scoped models matter #

As AI systems mature, the most effective architectures are becoming less monolithic. A single frontier model can be powerful, but production systems often need several specialized components working together: retrievers, routers, code-aware models, validators, tool callers, and larger reasoning models. We think of Mellum2 as a “focal” model: a fast, well-scoped model optimized for high-frequency tasks inside larger AI systems. The goal is not to replace every model in the stack. The goal is to make the stack faster, cheaper, and easier to control.

Getting started with Mellum2 #

If you are building AI systems for software engineering – inside an IDE, in a RAG pipeline, as part of an agent workflow, or on private infrastructure – Mellum2 is ready to try.

source & further reading

huggingface.co — original article I built code-repair training data and shipped the eval so you can rerun it Hierarchical Agentic Memory with Hyperbolic Embeddings Requesting arXiv cs.AI Endorsement for Work on Out-of-Distribution Generalization, Conformal Prediction, and Biomedical ML

~/api · this article 200

$curl api.wpnews.pro/v1/news/introducing-mellum2-a-12…

Read original on huggingface.co → huggingface.co/blog/JetBrains/mellum2-launch

mentioned entities

JetBrains

Mellum2

Hugging Face

Apache 2.0

metadata

slugintroducing-mellum2-a-12b-mixture-of-experts-model-by-jetbrains

topic#large-language-models

secondary4 topics

sentimentpositive

canonicalhuggingface.co

navigation

← prevAgentic AI arrives for Delphi an…

next →The Only AI Skill That Actually …

── more in #large-language-models 4 stories · sorted by recency

snipvote.com · 19 Jul · #large-language-models

NVIDIA NeMo Automodel integrates with Hugging Face Diffusers for scalable training

twitter.com · 19 Jul · #large-language-models

Qwen3.8 is launching and going open-weight soon

cryptobriefing.com · 19 Jul · #large-language-models

Claude tests AI-assisted forecasting in World Cup prediction contest

snipvote.com · 19 Jul · #large-language-models

Simon Willison built an app to highlight LLM writing clichés

── more on @jetbrains 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 18 Jul · #artificial-intelligence

Ada: An AI business intelligence software from CSV and Excel(yes LLMs but more)

wpnews · 8 Jul · #ai-chips

D-Matrix launches Corsair AI inference platform, challenging Nvidia’s GPU dominance

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required