{"slug": "meet-north-mini-code-coheres-30b-open-weight-mixture-of-experts-model-with-3b", "title": "Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding", "summary": "Cohere AI released North Mini Code, a 30-billion-parameter open-weight mixture-of-experts model with 3 billion active parameters per token, optimized for code generation, agentic software engineering, and terminal tasks. The model, available under Apache 2.0 on Hugging Face and through the Cohere API, targets sovereign AI by enabling self-hosted coding capabilities without large GPU clusters. Cohere reports the model achieves up to 2.8x higher output throughput and a 30% edge in inter-token latency compared to similar models.", "body_md": "This week, Cohere AI team shipped its first developer-facing coding model named ‘[North Mini Code](https://huggingface.co/CohereLabs/North-Mini-Code-1.0)‘. ‘North Mini Code’ is open-weight and focused at software engineers. It is a mixture-of-experts (MoE) model with 30B total parameters. Only 3B of those parameters activate per token.\n\nThe release is positioned around “sovereign” AI. The idea is simple: run capable models on your own terms. Small, efficient coding models let teams self-host without large GPU clusters. North Mini Code targets that gap directly.\n\n**North Mini Code**\n\nNorth Mini Code is a 30B-A3B parameter model. The A3B stands for three billion active parameters per forward pass. Cohere optimized it for **three jobs: code generation, agentic software engineering, and terminal tasks**. The model is text-in, text-out. There is no image or video input.\n\nThe context window is 256K tokens. Maximum output length is 64K tokens. Cohere lists a minimum hardware bar of one H100 at FP8. Weights ship under Apache 2.0 on Hugging Face. You can also reach it through the Cohere API, Model Vault, and OpenRouter.\n\n| Field | North-Mini-Code-1.0 |\n|---|---|\n| License | Apache 2.0 |\n| Model size | 30B total; 3B active |\n| Context length | 256K total; 64K max generation |\n| Optimized for | Code generation, agentic software engineering, terminal tasks |\n| Availability | Hugging Face, Cohere API, Cohere Model Vault, OpenRouter |\n| Hardware (minimum) | 1× H100 @ FP8 |\n\n**The Architecture**\n\nNorth Mini Code is a decoder-only Transformer with sparse MoE layers. Its attention interleaves two types in a 3:1 ratio. Sliding-window attention uses RoPE for positions. Global attention uses no positional embeddings at all. The feed-forward block holds 128 experts. Eight experts activate per token. Each expert is an FFN with SwiGLU activation.\n\nThe router applies a sigmoid before top-k selection. A single dense layer sits before the sparse layers. That mix keeps active compute small while widening total capacity. Cohere released the weights in BF16.\n\nPost-training ran in two phases. First came two-stage cascaded supervised fine-tuning (SFT). Then came reinforcement learning with verifiable rewards (RLVR). The post-training focused on agentic coding. The model also supports interleaved thinking and native tool use.\n\n**Benchmarks**\n\nCohere reports a 33.4 on the Artificial Analysis Coding Index. It describes this as a competitive position among similarly sized models. The company evaluated on SWE-Bench Verified, SWE-Bench Pro, and Terminal-Bench v2. It also used Terminal-Bench Hard, SciCode, and LiveCodeBench v6.\n\nThe methodology is specific. SWE-Bench used the SWE-agent harness v1.1.0. Terminal-Bench v2 used a simple ReAct harness with one terminal tool. Terminal-Bench Hard used the Terminus-2 harness. Each benchmark ran with three seeds, then averaged. Sampling used temperature 1.0 and top_p 0.95.\n\n**The Speed**\n\nIn Cohere’s internal tests, North Mini Code reached up to 2.8x higher output throughput. That held at identical concurrency and hardware. It also showed a 30% edge in inter-token latency. Time-to-first-token was closer between the two. Devstral Small 2 kept a slight TTFT lead.\n\n| Metric | North Mini Code vs Devstral Small 2 |\n|---|---|\n| Output throughput | Up to 2.8x higher (same concurrency and hardware) |\n| Inter-token latency | 30% better for North Mini Code |\n| Time-to-first-token | Slightly behind Devstral Small 2 |\n\n**Use Cases With Examples**\n\nCohere built North Mini Code for agentic workflows.\n\n**Three patterns stand out in its own framing**:\n\n**Sub-agent orchestration**: A main agent delegates subtasks to helpers. Example: one agent writes unit tests while another fixes failing code.** Systems architecture mapping**: The model reads a repository and sketches its structure. Example: tracing how services call each other before a large refactor.**Code reviews**: The model scans a diff for problems. Example: flagging an unguarded null dereference before a merge.\n\nTerminal tasks fit the model as well. Example: listing files, running a build, then parsing the output for errors.\n\n**Getting Started**\n\nThe fastest path is Hugging Face Transformers. Install Transformers from source for this model. Recommended sampling is temperature 1.0 and top_p 0.95.\n\n```\n# Install Transformers from source (required for this model):\n# pip install \"git+https://github.com/huggingface/transformers.git\"\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\nmodel_id = \"CohereLabs/North-Mini-Code-1.0\"\ntokenizer = AutoTokenizer.from_pretrained(model_id)\nmodel = AutoModelForCausalLM.from_pretrained(model_id, device_map=\"auto\")\n\nprompt = \"Write a python program to check if a string is a palindrome or not.\"\nmessages = [{\"role\": \"user\", \"content\": prompt}]\n\n# return_dict=True yields a dict (input_ids + attention_mask) so **inputs unpacks cleanly\ninputs = tokenizer.apply_chat_template(\n    messages,\n    tokenize=True,\n    add_generation_prompt=True,\n    return_dict=True,\n    return_tensors=\"pt\",\n).to(model.device)\n\ngen_tokens = model.generate(\n    **inputs,\n    max_new_tokens=1024,\n    do_sample=True,\n    temperature=1.0,\n    top_p=0.95,\n)\n\n# Decode only the newly generated tokens, not the prompt\noutput = tokenizer.decode(gen_tokens[0][inputs[\"input_ids\"].shape[-1]:])\nprint(output)\n```\n\nFor serving, vLLM works. You need vLLM main plus Cohere’s melody library. Accurate response parsing depends on it.\n\n```\nuv pip install \"git+https://github.com/vllm-project/vllm.git\"\nuv pip install \"cohere_melody>=0.9.0\"\n\nvllm serve CohereLabs/North-Mini-Code-1.0 \\\n  -tp 2 \\\n  --max-model-len 320000 \\\n  --tool-call-parser cohere_command4 \\\n  --reasoning-parser cohere_command4 \\\n  --enable-auto-tool-choice\n```\n\nQuantized builds exist for Ollama, LM Studio, and llama.cpp. You can also try the model before downloading. Cohere offers free access through OpenCode and a hosted Hugging Face Space.\n\n**Key Takeaways**\n\n- Cohere’s first coding model, North Mini Code, is a 30B mixture-of-experts that activates just 3B parameters per token.\n- It runs on a single H100 at FP8, with 256K context and 64K max output.\n- Weights ship under Apache 2.0, though the Hugging Face card adds a non-commercial note.\n- Cohere official release reports 33.4 on the Artificial Analysis Coding Index, and up to 2.8x throughput over Devstral Small 2.\n- Built for agentic coding—sub-agent orchestration, architecture mapping, code reviews with native tool use\n\n**Marktechpost’s Interactive Explainer**\n\nCheck out the ** Model weights** and\n\n**Also, feel free to follow us on**\n\n[Technical details](https://huggingface.co/CohereLabs/North-Mini-Code-1.0).**and don’t forget to join our**[Twitter](https://x.com/intent/follow?screen_name=marktechpost)\n\n**and Subscribe to**\n\n[150k+ ML SubReddit](https://www.reddit.com/r/machinelearningnews/)**. Wait! are you on telegram?**\n\n[our Newsletter](https://www.aidevsignals.com/)\n\n[now you can join us on telegram as well.](https://t.me/machinelearningresearchnews)Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? [Connect with us](https://forms.gle/wbash1wF6efRj8G58)", "url": "https://wpnews.pro/news/meet-north-mini-code-coheres-30b-open-weight-mixture-of-experts-model-with-3b", "canonical_source": "https://www.marktechpost.com/2026/06/11/meet-north-mini-code-coheres-30b-open-weight-mixture-of-experts-model-with-3b-active-parameters-for-agentic-coding/", "published_at": "2026-06-11 08:33:27+00:00", "updated_at": "2026-06-11 18:19:22.777214+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-tools", "ai-infrastructure", "ai-agents"], "entities": ["Cohere", "North Mini Code", "Hugging Face", "OpenRouter", "Cohere API", "Cohere Model Vault", "Apache 2.0", "H100"], "alternates": {"html": "https://wpnews.pro/news/meet-north-mini-code-coheres-30b-open-weight-mixture-of-experts-model-with-3b", "markdown": "https://wpnews.pro/news/meet-north-mini-code-coheres-30b-open-weight-mixture-of-experts-model-with-3b.md", "text": "https://wpnews.pro/news/meet-north-mini-code-coheres-30b-open-weight-mixture-of-experts-model-with-3b.txt", "jsonld": "https://wpnews.pro/news/meet-north-mini-code-coheres-30b-open-weight-mixture-of-experts-model-with-3b.jsonld"}}