# DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds

> Source: <https://www.marktechpost.com/2026/06/25/deepreinforce-releases-ornith-1-0-an-open-source-coding-model-family-that-learns-its-own-rl-scaffolds/>
> Published: 2026-06-25 17:11:37+00:00

DeepReinforce has released ** Ornith-1.0**, an open-source model family built for agentic coding. The lineup spans four sizes, from a 9B dense model to a 397B mixture-of-experts flagship. Every checkpoint ships under the MIT license on Hugging Face. The models are post-trained on top of pretrained Gemma 4 and Qwen 3.5.

Most coding agents pair a model with a fixed, human-designed harness. Ornith-1.0 instead learns to write its own. The DeepReinforce research team reports state-of-the-art results among open models of comparable size.

**TL;DR**

- Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes under MIT, built on Gemma 4 and Qwen 3.5.
- The model learns its own scaffold during RL, jointly optimizing the harness and the solution.
- Ornith-1.0-397B tops Claude Opus 4.7 on both headline benchmarks, but not Opus 4.8 or the larger GLM-5.2-744B.
- Three layers — fixed trust boundary, deterministic monitor, frozen LLM judge — guard against reward hacking.

**What is Ornith-1.0?**

Ornith-1.0 is a set of reasoning models tuned for coding agents. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B model is mixture-of-experts and activates roughly 3B parameters per token. FP8 and GGUF builds are also published for faster local serving.

Each model is a reasoning model. Replies open with a `<think>`

block before the final answer. The serving recipes enable a reasoning parser, so that trace returns in a separate `reasoning_content`

field. The models also emit well-formed tool calls for agent loops.

Deployment is straightforward. The 9B model is about 19GB in bf16 and serves on a single 80GB GPU. Serving recipes target vLLM, SGLang, and Transformers. Each model exposes an OpenAI-compatible endpoint. Standard agent frameworks therefore work without code changes.

**Interactive Explainer**

**The Self-Scaffolding Idea**

Most coding agents rely on a scaffold, also called a **harness**. A scaffold wraps the model with memory, tools, error handling, and orchestration logic. AI teams usually hand-design one scaffold per task category.

Ornith-1.0 treats the scaffold as a learnable object instead. During reinforcement learning, the scaffold co-evolves with the model’s policy. **Each RL step runs in two stages**.

**First**, the model reads the task and its previous scaffold. It then proposes a refined scaffold. **Second**, it uses that scaffold and the task to generate a solution rollout. Reward from the rollout flows back to both stages.

So the model is optimized to author orchestration, not just answers. Over training, higher-reward scaffolds are mutated and selected automatically. Per-task strategies emerge without hand-engineered harness design.

Training also runs asynchronously, using a pipeline-RL setup. A staleness weight downweights older, off-policy tokens and drops them past a threshold. The optimization uses a token-level GRPO objective.

**Guarding Against Reward Hacking**

Letting a model write its own scaffold invites reward hacking. A scaffold could read visible test files and hardcode expected outputs. It could also copy an oracle solution sitting in the environment. DeepReinforce team describes three defense layers.

- The outer trust boundary is fixed and immutable. The environment, tool surface, and test isolation stay outside the model’s reach. The model evolves only its inner policy scaffold.
- A deterministic monitor flags banned actions. Reading withheld paths or editing verification scripts earns zero reward. Those trajectories are excluded from the advantage computation.
- A frozen LLM judge acts as a veto. It sits on top of the verifier, not as the primary reward.

**Benchmark**

DeepReinforce reports vendor numbers across several agentic coding benchmarks. At flagship scale, Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified. On SWE-Bench Verified, that 82.4 trails only Claude Opus 4.8 (87.6) among the listed models. On Terminal-Bench 2.1, the picture is more mixed.

Ornith-1.0-397B beats Claude Opus 4.7 (70.3) on Terminal-Bench 2.1. But it trails Claude Opus 4.8 (85) and the larger GLM-5.2-744B (81.0). So the ‘state-of-the-art’ claim is scoped to open models of comparable size.

The smaller models carry the efficiency case. The 35B model scores 64.2 on Terminal-Bench 2.1, above Qwen 3.5-397B’s 53.5. The 9B model reaches 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified.

| Benchmark | Ornith-1.0-397B | Qwen3.5-397B | Qwen3.7-Max | GLM-5.2-744B | Minimax-M3-428B | DeepSeek-V4-Pro-1.6T | Claude Opus 4.7 | Claude Opus 4.8 |
|---|---|---|---|---|---|---|---|---|
| Terminal-Bench 2.1 | 77.5 | 53.5 | 73.5 | 81.0 | 64 | 64 | 70.3 | 85 |
| SWE-Bench Verified | 82.4 | 76.4 | 80.4 | – | – | 80.6 | 80.8 | 87.6 |
| SWE-Bench Pro | 62.2 | 51.6 | 60.6 | 62.1 | 59 | 55.4 | 64.3 | 69.2 |
| SWE-Bench Multilingual | 78.9 | 69.3 | 78.3 | – | – | 76.2 | – | – |
| NL2Repo | 48.2 | 36.8 | 47.2 | 48.9 | 42.1 | – | – | 69.7 |
| ClawEval Avg | 77.1 | 70.7 | 65.2 | – | – | 75.8 | 78.2 | – |

**Use Cases and a Quick Start**

The models target terminal-native coding agents and repository-scale work. Practical fits include multi-file refactors, bug localization, and test-driven patches. The 9B model suits edge or single-GPU setups where latency and cost matter. The 397B model targets maximum accuracy on long, multi-step tasks.

For example, a dev can run the 9B model locally to triage a failing test suite. A platform team can self-host the 397B model for an internal coding agent.

Serving is a one-liner with vLLM:

```
vllm serve deepreinforce-ai/Ornith-1.0-9B \
    --served-model-name Ornith-1.0-9B \
    --max-model-len 262144 \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --reasoning-parser qwen3 \
    --trust-remote-code
```

Then call it with any OpenAI client:

``` python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="Ornith-1.0-9B",
    messages=[{"role": "user", "content": "Write a Python is_prime(n)."}],
    temperature=0.6, top_p=0.95,
)
msg = resp.choices[0].message
print(getattr(msg, "reasoning_content", None))  # the <think> trace
print(msg.content)                              # the final answer
```

The reasoning trace returns in `reasoning_content`

, with the answer in `content`

. Recommended sampling is `temperature=0.6`

, `top_p=0.95`

, `top_k=20`

. The model also plugs into OpenHands, OpenClaw, and OpenCode.

Check out the ** Model Weights** and

**.**

[Technical details](https://deep-reinforce.com/ornith_1_0.html)**Also, feel free to follow us on**

**and don’t forget to join our**[Twitter](https://x.com/intent/follow?screen_name=marktechpost)

**and Subscribe to**

[150k+ML SubReddit](https://www.reddit.com/r/machinelearningnews/)**. Wait! are you on telegram?**

[our Newsletter](https://www.aidevsignals.com/)

[now you can join us on telegram as well.](https://t.me/machinelearningresearchnews)Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? [Connect with us](https://forms.gle/wbash1wF6efRj8G58)
