JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

JetBrains released Mellum2, a 12-billion-parameter Mixture-of-Experts model with 2.5 billion active parameters per token, under the Apache 2.0 license. The model is specialized for software engineering tasks including code generation, debugging, and agentic coding, and is designed as a fast, specialized component for multi-model AI pipelines rather than a standalone frontier model. JetBrains open-sourced six checkpoints covering the full training pipeline, positioning Mellum2 to serve as a "focal model" for low-latency, specialized tasks within larger AI systems.

JetBrains released Mellum2, open-sourcing the weights under the Apache 2.0 license. The first version of Mellum was a completion-focused 4B dense model. Mellum2 is its successor: a general-purpose model specialized in software engineering. It covers code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance. JetBrains team positions Mellum2 as a “focal model” — a fast, specialized component inside larger AI systems, not a standalone replacement for frontier models. Architecture Mellum2 uses a Mixture-of-Experts MoE architecture with 12B total parameters and 2.5B active parameters per token. In MoE models, only a subset of parameters runs on each token. Here, the model has 64 experts and activates 8 per token. This keeps per-token compute equivalent to a 2.5B dense model, while the total parameter count provides higher capacity for specialization. Key architectural details: Layers: 28 Hidden size: 2304 MoE experts: 64 total, 8 activated per token Attention: Grouped-Query Attention GQA with 32 query heads and 4 KV heads Sliding Window Attention SWA : Applied to three of every four layers, with a window size of 1,024. Full attention runs on the remaining layer. Context length: 131,072 tokens Multi-Token Prediction MTP head: Serves as an auxiliary pre-training objective and as a built-in draft model for speculative decoding Precision: bfloat16 Vocabulary size: 98,304 The model handles natural language and code. It is not multimodal — there is no image or video input. Pre-Training Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum. The data mixture progressively shifts from diverse web content toward curated code and mathematical content across the three phases. Training used the Muon optimizer under FP8 hybrid precision with a Warmup-Hold-Decay learning rate schedule with linear decay to zero. After pre-training, the base model’s context window was extended to 128K tokens using a layer-selective YaRN method before post-training began. The Model Family JetBrains team released six checkpoints covering the full training pipeline: | Checkpoint | Description | |---|---| | Mellum2-12B-A2.5B-Base-Pretrain | Base checkpoint before long-context extension | | Mellum2-12B-A2.5B-Base | Final base model after context extension | | Mellum2-12B-A2.5B-Instruct-SFT | Supervised fine-tuned instruction checkpoint | | Mellum2-12B-A2.5B-Thinking-SFT | Supervised thinking checkpoint | | Mellum2-12B-A2.5B-Instruct | RL-tuned instruction model | | Mellum2-12B-A2.5B-Thinking | RL-tuned thinking model | Post-training follows two stages: supervised fine-tuning SFT , then reinforcement learning with verifiable rewards RLVR on math, executable coding, tool use, instruction following, reasoning, and knowledge tasks. The Instruct variant answers directly, without an externalized chain of thought. Use it for low-latency tasks: direct answers, tool use, and instruction following. The Thinking variant emits an explicit reasoning trace before its final answer. Use it for complex debugging, multi-step planning, or agentic flows where step-by-step reasoning matters. Benchmark Results All numbers below are self-reported by JetBrains. The comparison set is open-weight models in the 4B–14B range. Coding: | Benchmark | Mellum2 Instruct | Qwen3.5 4B | Qwen3.5 9B | Ministral 3 14B | OLMo-3 7B | Seed-Coder 8B | |---|---|---|---|---|---|---| | LiveCodeBench v6 | 37.2 | 51.0 | 63.7 | 42.4 | 28.2 | 28.1 | | EvalPlus | 78.4 | 69.4 | 71.8 | 74.1 | 67.3 | 73.8 | | MultiPL-E | 67.1 | 51.0 | 67.1 | 71.5 | 36.1 | 77.0 | Tool Use: | Benchmark | Mellum2 Instruct | Qwen3.5 4B | Qwen3.5 9B | Ministral 3 14B | OLMo-3 7B | |---|---|---|---|---|---| | BFCL v3 | 66.3 | 64.1 | 70.5 | 52.7 | 41.9 | | BFCL v4 | 44.2 | 52.0 | 60.6 | 38.8 | 19.8 | Math: | Benchmark | Mellum2 Instruct | Qwen3.5 4B | Qwen3.5 9B | Ministral 3 14B | OLMo-3 7B | |---|---|---|---|---|---| | AIME 2025+2026 | 41.7 | 38.3 | 58.3 | 33.3 | 40.0 | | GSM-Plus | 80.5 | 85.2 | 87.9 | 86.6 | 85.8 | Knowledge and Conversational: | Benchmark | Mellum2 Instruct | Qwen3.5 4B | Qwen3.5 9B | Ministral 3 14B | OLMo-3 7B | |---|---|---|---|---|---| | MMLU-Redux | 78.1 | 87.5 | 91.1 | 85.9 | 71.8 | | GPQA Diamond | 40.9 | 76.8 | 79.8 | 58.6 | 40.9 | | IFEval | 75.8 | 82.1 | 83.9 | 67.3 | 83.2 | | MixEval | 62.2 | 65.9 | 71.1 | 71.2 | 59.4 | Benchmark notes: - EvalPlus is the mean of HumanEval+ and MBPP+ - AIME is the mean of AIME 2025 and AIME 2026 30 questions each - BFCL v4 is the macro-average of five subtasks: v1, v2, v3, web search, memory - Seed-Coder 8B does not support native tool calling; BFCL scores are not listed for it Use Cases JetBrains identifies four production scenarios where Mellum2’s latency and efficiency profile is relevant: Routing and orchestration : In a multi-model system, a router analyzes incoming prompts and selects the appropriate model or tool for each task. Mellum2’s low per-token compute makes it suitable for this high-frequency classification step. Low-latency RAG pipelines : Retrieval-Augmented Generation RAG systems retrieve relevant context, summarize it, and generate a response. Mellum2 handles retrieval summarization at lower latency than larger dense models. Sub-agents in complex workflows : Agent pipelines break tasks into steps: context gathering, planning, validation, and execution. Mellum2 can handle repetitive or latency-sensitive steps instead of routing every step through a single large frontier model. Private and local deployment : The Apache 2.0 license permits self-hosting without restrictions. Engineers can run Mellum2 on their own infrastructure, keeping code and data under their control. Strengths and Limitations Strengths: - MoE design activates only 2.5B of 12B parameters per token — per-token compute equivalent to a 2.5B dense model - MTP head enables speculative decoding without a separate draft model - 131,072 token context window - Full checkpoint set released: base pretrain, base, SFT, and RL-tuned variants for both Instruct and Thinking - Apache 2.0 license — permits commercial use, self-hosting, and fine-tuning - Strong EvalPlus 78.4 and BFCL v3 66.3 scores relative to 4B–14B comparisons - vLLM support, including optional tool-calling via --tool-call-parser hermes Limitations: - Text and code only — no image or multimodal input - LiveCodeBench v6 37.2 trails Qwen3.5 9B 63.7 and Ministral 3 14B 42.4 - GPQA Diamond 40.9 and MMLU-Redux 78.1 are below most models in the comparison set - GSM-Plus 80.5 is below all comparable models listed - Not designed for frontier-level tasks — JetBrains explicitly positions Mellum2 as a component model Marktechpost’s Visual Explainer Getting Started Serve Mellum2 with vLLM: pip install vllm vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct --max-model-len 131072 With tool calling enabled: vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct \ --max-model-len 131072 \ --enable-auto-tool-choice \ --tool-call-parser hermes Using the Hugging Face Transformers library: python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from pretrained "JetBrains/Mellum2-12B-A2.5B-Instruct" model = AutoModelForCausalLM.from pretrained "JetBrains/Mellum2-12B-A2.5B-Instruct" messages = {"role": "user", "content": "Write a Python function to reverse a string."} inputs = tokenizer.apply chat template messages, add generation prompt=True, tokenize=True, return dict=True, return tensors="pt", .to model.device outputs = model.generate inputs, max new tokens=512 print tokenizer.decode outputs 0 inputs "input ids" .shape -1 : Check out the Model Weights and Also, feel free to follow us on Technical details https://blog.jetbrains.com/ai/2026/06/mellum2-goes-open-source-a-fast-model-for-ai-workflows/ . and don’t forget to join our Twitter https://x.com/intent/follow?screen name=marktechpost and Subscribe to 150k+ ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58