DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds

DeepReinforce released Ornith-1.0, an open-source family of coding models that learn their own reinforcement learning scaffolds, achieving state-of-the-art results among open models. The lineup includes four sizes from 9B to 397B parameters, built on Gemma 4 and Qwen 3.5, and is available under the MIT license. The 397B variant outperforms Claude Opus 4.7 on key benchmarks but trails Opus 4.8 and GLM-5.2-744B.

DeepReinforce has released Ornith-1.0 , an open-source model family built for agentic coding. The lineup spans four sizes, from a 9B dense model to a 397B mixture-of-experts flagship. Every checkpoint ships under the MIT license on Hugging Face. The models are post-trained on top of pretrained Gemma 4 and Qwen 3.5. Most coding agents pair a model with a fixed, human-designed harness. Ornith-1.0 instead learns to write its own. The DeepReinforce research team reports state-of-the-art results among open models of comparable size. TL;DR - Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes under MIT, built on Gemma 4 and Qwen 3.5. - The model learns its own scaffold during RL, jointly optimizing the harness and the solution. - Ornith-1.0-397B tops Claude Opus 4.7 on both headline benchmarks, but not Opus 4.8 or the larger GLM-5.2-744B. - Three layers — fixed trust boundary, deterministic monitor, frozen LLM judge — guard against reward hacking. What is Ornith-1.0? Ornith-1.0 is a set of reasoning models tuned for coding agents. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B model is mixture-of-experts and activates roughly 3B parameters per token. FP8 and GGUF builds are also published for faster local serving. Each model is a reasoning model. Replies open with a <think block before the final answer. The serving recipes enable a reasoning parser, so that trace returns in a separate reasoning content field. The models also emit well-formed tool calls for agent loops. Deployment is straightforward. The 9B model is about 19GB in bf16 and serves on a single 80GB GPU. Serving recipes target vLLM, SGLang, and Transformers. Each model exposes an OpenAI-compatible endpoint. Standard agent frameworks therefore work without code changes. Interactive Explainer The Self-Scaffolding Idea Most coding agents rely on a scaffold, also called a harness . A scaffold wraps the model with memory, tools, error handling, and orchestration logic. AI teams usually hand-design one scaffold per task category. Ornith-1.0 treats the scaffold as a learnable object instead. During reinforcement learning, the scaffold co-evolves with the model’s policy. Each RL step runs in two stages . First , the model reads the task and its previous scaffold. It then proposes a refined scaffold. Second , it uses that scaffold and the task to generate a solution rollout. Reward from the rollout flows back to both stages. So the model is optimized to author orchestration, not just answers. Over training, higher-reward scaffolds are mutated and selected automatically. Per-task strategies emerge without hand-engineered harness design. Training also runs asynchronously, using a pipeline-RL setup. A staleness weight downweights older, off-policy tokens and drops them past a threshold. The optimization uses a token-level GRPO objective. Guarding Against Reward Hacking Letting a model write its own scaffold invites reward hacking. A scaffold could read visible test files and hardcode expected outputs. It could also copy an oracle solution sitting in the environment. DeepReinforce team describes three defense layers. - The outer trust boundary is fixed and immutable. The environment, tool surface, and test isolation stay outside the model’s reach. The model evolves only its inner policy scaffold. - A deterministic monitor flags banned actions. Reading withheld paths or editing verification scripts earns zero reward. Those trajectories are excluded from the advantage computation. - A frozen LLM judge acts as a veto. It sits on top of the verifier, not as the primary reward. Benchmark DeepReinforce reports vendor numbers across several agentic coding benchmarks. At flagship scale, Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified. On SWE-Bench Verified, that 82.4 trails only Claude Opus 4.8 87.6 among the listed models. On Terminal-Bench 2.1, the picture is more mixed. Ornith-1.0-397B beats Claude Opus 4.7 70.3 on Terminal-Bench 2.1. But it trails Claude Opus 4.8 85 and the larger GLM-5.2-744B 81.0 . So the ‘state-of-the-art’ claim is scoped to open models of comparable size. The smaller models carry the efficiency case. The 35B model scores 64.2 on Terminal-Bench 2.1, above Qwen 3.5-397B’s 53.5. The 9B model reaches 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified. | Benchmark | Ornith-1.0-397B | Qwen3.5-397B | Qwen3.7-Max | GLM-5.2-744B | Minimax-M3-428B | DeepSeek-V4-Pro-1.6T | Claude Opus 4.7 | Claude Opus 4.8 | |---|---|---|---|---|---|---|---|---| | Terminal-Bench 2.1 | 77.5 | 53.5 | 73.5 | 81.0 | 64 | 64 | 70.3 | 85 | | SWE-Bench Verified | 82.4 | 76.4 | 80.4 | – | – | 80.6 | 80.8 | 87.6 | | SWE-Bench Pro | 62.2 | 51.6 | 60.6 | 62.1 | 59 | 55.4 | 64.3 | 69.2 | | SWE-Bench Multilingual | 78.9 | 69.3 | 78.3 | – | – | 76.2 | – | – | | NL2Repo | 48.2 | 36.8 | 47.2 | 48.9 | 42.1 | – | – | 69.7 | | ClawEval Avg | 77.1 | 70.7 | 65.2 | – | – | 75.8 | 78.2 | – | Use Cases and a Quick Start The models target terminal-native coding agents and repository-scale work. Practical fits include multi-file refactors, bug localization, and test-driven patches. The 9B model suits edge or single-GPU setups where latency and cost matter. The 397B model targets maximum accuracy on long, multi-step tasks. For example, a dev can run the 9B model locally to triage a failing test suite. A platform team can self-host the 397B model for an internal coding agent. Serving is a one-liner with vLLM: vllm serve deepreinforce-ai/Ornith-1.0-9B \ --served-model-name Ornith-1.0-9B \ --max-model-len 262144 \ --enable-auto-tool-choice --tool-call-parser qwen3 xml \ --reasoning-parser qwen3 \ --trust-remote-code Then call it with any OpenAI client: python from openai import OpenAI client = OpenAI base url="http://localhost:8000/v1", api key="EMPTY" resp = client.chat.completions.create model="Ornith-1.0-9B", messages= {"role": "user", "content": "Write a Python is prime n ."} , temperature=0.6, top p=0.95, msg = resp.choices 0 .message print getattr msg, "reasoning content", None the <think trace print msg.content the final answer The reasoning trace returns in reasoning content , with the answer in content . Recommended sampling is temperature=0.6 , top p=0.95 , top k=20 . The model also plugs into OpenHands, OpenClaw, and OpenCode. Check out the Model Weights and . Technical details https://deep-reinforce.com/ornith 1 0.html Also, feel free to follow us on and don’t forget to join our Twitter https://x.com/intent/follow?screen name=marktechpost and Subscribe to 150k+ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58