Arbor framework outperforms Claude Code and Codex by 2.5x in AI optimization benchmarks

wpnews.pro

cd /news/artificial-intelligence/arbor-framework-outperforms-claude-c… · home › topics › artificial-intelligence › article

[ARTICLE · art-33030] src=cryptobriefing.com ↗ pub=2026-06-18T18:23Z topic=artificial-intelligence verified=true sentiment=↑ positive

Arbor framework outperforms Claude Code and Codex by 2.5x in AI optimization benchmarks

Researchers at Renmin University of China and Microsoft Research released Arbor, an open-source framework that outperformed OpenAI's Codex and Anthropic's Claude Code by more than 2.5 times in average relative held-out gains across six autonomous optimization tasks. The framework uses Hypothesis-Tree Refinement to structure AI trial-and-error into cumulative learning, achieving the best held-out test results on all evaluated tasks.

read2 min views23 publishedJun 18, 2026

A new open-source system from Renmin University and Microsoft Research turns AI trial-and-error into structured, cumulative learning

Researchers at Renmin University of China’s Gaoling School of Artificial Intelligence and Microsoft Research released Arbor on June 10, 2026, an open-source framework that outperformed both OpenAI’s Codex and Anthropic’s Claude Code by more than 2.5 times in average relative held-out gains across six autonomous optimization tasks. The framework also achieved the best held-out test results on every single task evaluated.

How Arbor actually works #

Arbor uses Hypothesis-Tree Refinement (HTR), which organizes optimization work into a branching tree structure of hypotheses, experiments, evidence, and insights, where each branch builds on what came before rather than treating each attempt as a standalone experiment.

The architecture splits into two layers. A long-lived coordinator agent handles strategy, deciding which hypotheses are worth pursuing and how to sequence experiments. Short-lived executor agents then run those experiments in controlled environments. When an executor finishes its job and reports back, the coordinator absorbs the findings and refines its approach for the next round.

The benchmark numbers #

Across six autonomous optimization tasks spanning model training and data synthesis, Arbor delivered over 2.5 times the average relative held-out gain compared to both Codex and Claude Code. It also posted the best held-out test results on all evaluated tasks.

On MLE-Bench Lite, a standardized benchmark for machine learning engineering, Arbor running on GPT-5.5 achieved an Any-Medal score of 86.36%. That score measures the percentage of tasks where the system performed well enough to earn at least a bronze-level result.

The BrowseComp accuracy comparison adds another data point: Arbor scored 67.67 versus Claude Code’s 53.33.

The framework is publicly available through its GitHub repository at RUC-NLPIR/Arbor. It ships with a command-line interface runtime and skill sets designed to integrate with other coding agents.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our

Editorial Policy.

source & further reading

cryptobriefing.com — original article NXP downgraded by UBS amid China automotive slump and limited AI upside American manufacturers grow at fastest rate in 4 years amid AI boom Google DeepMind exec predicts trillion dollar AI capex in 2026, tied to machines that improve themselves

~/api · this article 200

$curl api.wpnews.pro/v1/news/arbor-framework-outperfo…

Read original on cryptobriefing.com → cryptobriefing.com/arbor-framework-outperforms-c…

mentioned entities

Renmin University of China

Microsoft Research

OpenAI

Anthropic

Arbor

Codex

Claude Code

GPT-5.5

metadata

slugarbor-framework-outperforms-claude-code-and-codex-by-2-5x-in-ai-optimization

topic#artificial-intelligence

secondary3 topics

sentimentpositive

canonicalcryptobriefing.com

navigation

← prevAmazon hopes to challenge Nvidia…

next →Jefferies projects IREN’s AI clo…

── more in #artificial-intelligence 4 stories · sorted by recency

zdnet.com · 3 Aug · #artificial-intelligence

How to keep your conversations with ChatGPT, Gemini, Copilot or Claude as private as possible

officechai.com · 3 Aug · #artificial-intelligence

How Someone Had Said That They Were In “Awe” After An LLM Had Built A React App To Add Two Numbers In 2020

infoworld.com · 3 Aug · #artificial-intelligence

Alibaba takes aim at OpenAI and Anthropic with Qwen3.8-Max launch

qazinform.com · 3 Aug · #artificial-intelligence

Claude mistakes real internet for simulation, breaches systems of three organizations

── more on @renmin university of china 3 stories trending now

wpnews · 2 Aug · #artificial-intelligence

I Ran 8 AI APIs Through the Same 50 Prompts — Here's the Real Cost Breakdown

wpnews · 2 Aug · #developer-tools

Agent-Browser – Browser Automation for AI

wpnews · 2 Aug · #artificial-intelligence

Payment Rail vs. Settlement Layer: What AEON's Coinbase x402 Partnership Actually Validates

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required