{"slug": "deepreinforce-releases-ornith-1-0-an-open-source-coding-model-family-that-learns", "title": "DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds", "summary": "DeepReinforce released Ornith-1.0, an open-source family of coding models that learn their own reinforcement learning scaffolds, achieving state-of-the-art results among open models. The lineup includes four sizes from 9B to 397B parameters, built on Gemma 4 and Qwen 3.5, and is available under the MIT license. The 397B variant outperforms Claude Opus 4.7 on key benchmarks but trails Opus 4.8 and GLM-5.2-744B.", "body_md": "DeepReinforce has released ** Ornith-1.0**, an open-source model family built for agentic coding. The lineup spans four sizes, from a 9B dense model to a 397B mixture-of-experts flagship. Every checkpoint ships under the MIT license on Hugging Face. The models are post-trained on top of pretrained Gemma 4 and Qwen 3.5.\n\nMost coding agents pair a model with a fixed, human-designed harness. Ornith-1.0 instead learns to write its own. The DeepReinforce research team reports state-of-the-art results among open models of comparable size.\n\n**TL;DR**\n\n- Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes under MIT, built on Gemma 4 and Qwen 3.5.\n- The model learns its own scaffold during RL, jointly optimizing the harness and the solution.\n- Ornith-1.0-397B tops Claude Opus 4.7 on both headline benchmarks, but not Opus 4.8 or the larger GLM-5.2-744B.\n- Three layers — fixed trust boundary, deterministic monitor, frozen LLM judge — guard against reward hacking.\n\n**What is Ornith-1.0?**\n\nOrnith-1.0 is a set of reasoning models tuned for coding agents. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B model is mixture-of-experts and activates roughly 3B parameters per token. FP8 and GGUF builds are also published for faster local serving.\n\nEach model is a reasoning model. Replies open with a `<think>`\n\nblock before the final answer. The serving recipes enable a reasoning parser, so that trace returns in a separate `reasoning_content`\n\nfield. The models also emit well-formed tool calls for agent loops.\n\nDeployment is straightforward. The 9B model is about 19GB in bf16 and serves on a single 80GB GPU. Serving recipes target vLLM, SGLang, and Transformers. Each model exposes an OpenAI-compatible endpoint. Standard agent frameworks therefore work without code changes.\n\n**Interactive Explainer**\n\n**The Self-Scaffolding Idea**\n\nMost coding agents rely on a scaffold, also called a **harness**. A scaffold wraps the model with memory, tools, error handling, and orchestration logic. AI teams usually hand-design one scaffold per task category.\n\nOrnith-1.0 treats the scaffold as a learnable object instead. During reinforcement learning, the scaffold co-evolves with the model’s policy. **Each RL step runs in two stages**.\n\n**First**, the model reads the task and its previous scaffold. It then proposes a refined scaffold. **Second**, it uses that scaffold and the task to generate a solution rollout. Reward from the rollout flows back to both stages.\n\nSo the model is optimized to author orchestration, not just answers. Over training, higher-reward scaffolds are mutated and selected automatically. Per-task strategies emerge without hand-engineered harness design.\n\nTraining also runs asynchronously, using a pipeline-RL setup. A staleness weight downweights older, off-policy tokens and drops them past a threshold. The optimization uses a token-level GRPO objective.\n\n**Guarding Against Reward Hacking**\n\nLetting a model write its own scaffold invites reward hacking. A scaffold could read visible test files and hardcode expected outputs. It could also copy an oracle solution sitting in the environment. DeepReinforce team describes three defense layers.\n\n- The outer trust boundary is fixed and immutable. The environment, tool surface, and test isolation stay outside the model’s reach. The model evolves only its inner policy scaffold.\n- A deterministic monitor flags banned actions. Reading withheld paths or editing verification scripts earns zero reward. Those trajectories are excluded from the advantage computation.\n- A frozen LLM judge acts as a veto. It sits on top of the verifier, not as the primary reward.\n\n**Benchmark**\n\nDeepReinforce reports vendor numbers across several agentic coding benchmarks. At flagship scale, Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified. On SWE-Bench Verified, that 82.4 trails only Claude Opus 4.8 (87.6) among the listed models. On Terminal-Bench 2.1, the picture is more mixed.\n\nOrnith-1.0-397B beats Claude Opus 4.7 (70.3) on Terminal-Bench 2.1. But it trails Claude Opus 4.8 (85) and the larger GLM-5.2-744B (81.0). So the ‘state-of-the-art’ claim is scoped to open models of comparable size.\n\nThe smaller models carry the efficiency case. The 35B model scores 64.2 on Terminal-Bench 2.1, above Qwen 3.5-397B’s 53.5. The 9B model reaches 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified.\n\n| Benchmark | Ornith-1.0-397B | Qwen3.5-397B | Qwen3.7-Max | GLM-5.2-744B | Minimax-M3-428B | DeepSeek-V4-Pro-1.6T | Claude Opus 4.7 | Claude Opus 4.8 |\n|---|---|---|---|---|---|---|---|---|\n| Terminal-Bench 2.1 | 77.5 | 53.5 | 73.5 | 81.0 | 64 | 64 | 70.3 | 85 |\n| SWE-Bench Verified | 82.4 | 76.4 | 80.4 | – | – | 80.6 | 80.8 | 87.6 |\n| SWE-Bench Pro | 62.2 | 51.6 | 60.6 | 62.1 | 59 | 55.4 | 64.3 | 69.2 |\n| SWE-Bench Multilingual | 78.9 | 69.3 | 78.3 | – | – | 76.2 | – | – |\n| NL2Repo | 48.2 | 36.8 | 47.2 | 48.9 | 42.1 | – | – | 69.7 |\n| ClawEval Avg | 77.1 | 70.7 | 65.2 | – | – | 75.8 | 78.2 | – |\n\n**Use Cases and a Quick Start**\n\nThe models target terminal-native coding agents and repository-scale work. Practical fits include multi-file refactors, bug localization, and test-driven patches. The 9B model suits edge or single-GPU setups where latency and cost matter. The 397B model targets maximum accuracy on long, multi-step tasks.\n\nFor example, a dev can run the 9B model locally to triage a failing test suite. A platform team can self-host the 397B model for an internal coding agent.\n\nServing is a one-liner with vLLM:\n\n```\nvllm serve deepreinforce-ai/Ornith-1.0-9B \\\n    --served-model-name Ornith-1.0-9B \\\n    --max-model-len 262144 \\\n    --enable-auto-tool-choice --tool-call-parser qwen3_xml \\\n    --reasoning-parser qwen3 \\\n    --trust-remote-code\n```\n\nThen call it with any OpenAI client:\n\n``` python\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"EMPTY\")\n\nresp = client.chat.completions.create(\n    model=\"Ornith-1.0-9B\",\n    messages=[{\"role\": \"user\", \"content\": \"Write a Python is_prime(n).\"}],\n    temperature=0.6, top_p=0.95,\n)\nmsg = resp.choices[0].message\nprint(getattr(msg, \"reasoning_content\", None))  # the <think> trace\nprint(msg.content)                              # the final answer\n```\n\nThe reasoning trace returns in `reasoning_content`\n\n, with the answer in `content`\n\n. Recommended sampling is `temperature=0.6`\n\n, `top_p=0.95`\n\n, `top_k=20`\n\n. The model also plugs into OpenHands, OpenClaw, and OpenCode.\n\nCheck out the ** Model Weights** and\n\n**.**\n\n[Technical details](https://deep-reinforce.com/ornith_1_0.html)**Also, feel free to follow us on**\n\n**and don’t forget to join our**[Twitter](https://x.com/intent/follow?screen_name=marktechpost)\n\n**and Subscribe to**\n\n[150k+ML SubReddit](https://www.reddit.com/r/machinelearningnews/)**. Wait! are you on telegram?**\n\n[our Newsletter](https://www.aidevsignals.com/)\n\n[now you can join us on telegram as well.](https://t.me/machinelearningresearchnews)Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? [Connect with us](https://forms.gle/wbash1wF6efRj8G58)", "url": "https://wpnews.pro/news/deepreinforce-releases-ornith-1-0-an-open-source-coding-model-family-that-learns", "canonical_source": "https://www.marktechpost.com/2026/06/25/deepreinforce-releases-ornith-1-0-an-open-source-coding-model-family-that-learns-its-own-rl-scaffolds/", "published_at": "2026-06-25 17:11:37+00:00", "updated_at": "2026-06-25 17:18:47.095194+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "ai-research", "ai-products"], "entities": ["DeepReinforce", "Ornith-1.0", "Gemma 4", "Qwen 3.5", "Claude Opus 4.7", "Claude Opus 4.8", "GLM-5.2-744B", "Hugging Face"], "alternates": {"html": "https://wpnews.pro/news/deepreinforce-releases-ornith-1-0-an-open-source-coding-model-family-that-learns", "markdown": "https://wpnews.pro/news/deepreinforce-releases-ornith-1-0-an-open-source-coding-model-family-that-learns.md", "text": "https://wpnews.pro/news/deepreinforce-releases-ornith-1-0-an-open-source-coding-model-family-that-learns.txt", "jsonld": "https://wpnews.pro/news/deepreinforce-releases-ornith-1-0-an-open-source-coding-model-family-that-learns.jsonld"}}