# Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding

> Source: <https://www.marktechpost.com/2026/06/11/meet-north-mini-code-coheres-30b-open-weight-mixture-of-experts-model-with-3b-active-parameters-for-agentic-coding/>
> Published: 2026-06-11 08:33:27+00:00

This week, Cohere AI team shipped its first developer-facing coding model named ‘[North Mini Code](https://huggingface.co/CohereLabs/North-Mini-Code-1.0)‘. ‘North Mini Code’ is open-weight and focused at software engineers. It is a mixture-of-experts (MoE) model with 30B total parameters. Only 3B of those parameters activate per token.

The release is positioned around “sovereign” AI. The idea is simple: run capable models on your own terms. Small, efficient coding models let teams self-host without large GPU clusters. North Mini Code targets that gap directly.

**North Mini Code**

North Mini Code is a 30B-A3B parameter model. The A3B stands for three billion active parameters per forward pass. Cohere optimized it for **three jobs: code generation, agentic software engineering, and terminal tasks**. The model is text-in, text-out. There is no image or video input.

The context window is 256K tokens. Maximum output length is 64K tokens. Cohere lists a minimum hardware bar of one H100 at FP8. Weights ship under Apache 2.0 on Hugging Face. You can also reach it through the Cohere API, Model Vault, and OpenRouter.

| Field | North-Mini-Code-1.0 |
|---|---|
| License | Apache 2.0 |
| Model size | 30B total; 3B active |
| Context length | 256K total; 64K max generation |
| Optimized for | Code generation, agentic software engineering, terminal tasks |
| Availability | Hugging Face, Cohere API, Cohere Model Vault, OpenRouter |
| Hardware (minimum) | 1× H100 @ FP8 |

**The Architecture**

North Mini Code is a decoder-only Transformer with sparse MoE layers. Its attention interleaves two types in a 3:1 ratio. Sliding-window attention uses RoPE for positions. Global attention uses no positional embeddings at all. The feed-forward block holds 128 experts. Eight experts activate per token. Each expert is an FFN with SwiGLU activation.

The router applies a sigmoid before top-k selection. A single dense layer sits before the sparse layers. That mix keeps active compute small while widening total capacity. Cohere released the weights in BF16.

Post-training ran in two phases. First came two-stage cascaded supervised fine-tuning (SFT). Then came reinforcement learning with verifiable rewards (RLVR). The post-training focused on agentic coding. The model also supports interleaved thinking and native tool use.

**Benchmarks**

Cohere reports a 33.4 on the Artificial Analysis Coding Index. It describes this as a competitive position among similarly sized models. The company evaluated on SWE-Bench Verified, SWE-Bench Pro, and Terminal-Bench v2. It also used Terminal-Bench Hard, SciCode, and LiveCodeBench v6.

The methodology is specific. SWE-Bench used the SWE-agent harness v1.1.0. Terminal-Bench v2 used a simple ReAct harness with one terminal tool. Terminal-Bench Hard used the Terminus-2 harness. Each benchmark ran with three seeds, then averaged. Sampling used temperature 1.0 and top_p 0.95.

**The Speed**

In Cohere’s internal tests, North Mini Code reached up to 2.8x higher output throughput. That held at identical concurrency and hardware. It also showed a 30% edge in inter-token latency. Time-to-first-token was closer between the two. Devstral Small 2 kept a slight TTFT lead.

| Metric | North Mini Code vs Devstral Small 2 |
|---|---|
| Output throughput | Up to 2.8x higher (same concurrency and hardware) |
| Inter-token latency | 30% better for North Mini Code |
| Time-to-first-token | Slightly behind Devstral Small 2 |

**Use Cases With Examples**

Cohere built North Mini Code for agentic workflows.

**Three patterns stand out in its own framing**:

**Sub-agent orchestration**: A main agent delegates subtasks to helpers. Example: one agent writes unit tests while another fixes failing code.** Systems architecture mapping**: The model reads a repository and sketches its structure. Example: tracing how services call each other before a large refactor.**Code reviews**: The model scans a diff for problems. Example: flagging an unguarded null dereference before a merge.

Terminal tasks fit the model as well. Example: listing files, running a build, then parsing the output for errors.

**Getting Started**

The fastest path is Hugging Face Transformers. Install Transformers from source for this model. Recommended sampling is temperature 1.0 and top_p 0.95.

```
# Install Transformers from source (required for this model):
# pip install "git+https://github.com/huggingface/transformers.git"
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereLabs/North-Mini-Code-1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "Write a python program to check if a string is a palindrome or not."
messages = [{"role": "user", "content": prompt}]

# return_dict=True yields a dict (input_ids + attention_mask) so **inputs unpacks cleanly
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

gen_tokens = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=1.0,
    top_p=0.95,
)

# Decode only the newly generated tokens, not the prompt
output = tokenizer.decode(gen_tokens[0][inputs["input_ids"].shape[-1]:])
print(output)
```

For serving, vLLM works. You need vLLM main plus Cohere’s melody library. Accurate response parsing depends on it.

```
uv pip install "git+https://github.com/vllm-project/vllm.git"
uv pip install "cohere_melody>=0.9.0"

vllm serve CohereLabs/North-Mini-Code-1.0 \
  -tp 2 \
  --max-model-len 320000 \
  --tool-call-parser cohere_command4 \
  --reasoning-parser cohere_command4 \
  --enable-auto-tool-choice
```

Quantized builds exist for Ollama, LM Studio, and llama.cpp. You can also try the model before downloading. Cohere offers free access through OpenCode and a hosted Hugging Face Space.

**Key Takeaways**

- Cohere’s first coding model, North Mini Code, is a 30B mixture-of-experts that activates just 3B parameters per token.
- It runs on a single H100 at FP8, with 256K context and 64K max output.
- Weights ship under Apache 2.0, though the Hugging Face card adds a non-commercial note.
- Cohere official release reports 33.4 on the Artificial Analysis Coding Index, and up to 2.8x throughput over Devstral Small 2.
- Built for agentic coding—sub-agent orchestration, architecture mapping, code reviews with native tool use

**Marktechpost’s Interactive Explainer**

Check out the ** Model weights** and

**Also, feel free to follow us on**

[Technical details](https://huggingface.co/CohereLabs/North-Mini-Code-1.0).**and don’t forget to join our**[Twitter](https://x.com/intent/follow?screen_name=marktechpost)

**and Subscribe to**

[150k+ ML SubReddit](https://www.reddit.com/r/machinelearningnews/)**. Wait! are you on telegram?**

[our Newsletter](https://www.aidevsignals.com/)

[now you can join us on telegram as well.](https://t.me/machinelearningresearchnews)Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? [Connect with us](https://forms.gle/wbash1wF6efRj8G58)
