AI coding agents can tend to isolate research, running experiments and generating ideas that are then forgotten when context windows reset. This can waste tokens, as models then repeat the same mistakes and hit the same dead ends.
But new research argues that it’s not the model itself, but the overarching ‘tree,’ that needs tweaking. To that end, data scientists from the Gaoling School of Artificial Intelligence, Renmin University of China, and Microsoft Research have introduced Arbor, a “persistent hypothesis tree” that helps agents remember and refine learnings over long research sessions.
A long-lived coordinator manages research strategy across the tree, while short-lived executors spin up isolated worktrees to test different hypotheses. As results come back, the tree updates, narrowing and refining throughout experimentation.
In practical tests, this technique delivered more than two-fold performance gains over standard AI coding agents across real-world engineering tasks, for the same budget.
This is because, said Mahmoud Ramin, a research director at Info-Tech Research Group, “Arbor accumulates information over time and allows agents to build upon prior discoveries just as humans do, through learning, adaptation, and eventually building upon what they have learned in the past.”
Arbor’s builders argued that longer execution on its own does not guarantee research progress. The challenge is maintaining a state that turns many individual attempts into “cumulative hypothesis refinement.”
Further, progress should not depend on human overseers regularly stepping in to dictate logical next steps or interpret the meaning of previous trials, they noted. To be truly autonomous, agentic research frameworks must maintain connections between experiments, data, results, and failures over time.
Arbor is built to fulfill three system requirements. First, it must be able to branch as sub-trees test out competing hypotheses that are all potentially plausible. At the same time, unrestricted branching can degenerate the whole framework, so that must be controlled to remain organized. The researchers call this “branching with coherence.”
Second, the infrastructure must separate local execution from overarching strategy. Testing out single hypotheses requires short-horizon tasks like editing, debugging, and evaluation. But these should not “obscure” the larger tree making decisions based on evidence gathered across the whole run.
Finally, the systems must be able to distinguish exploratory improvement from verified improvement. This prevents AI from overfitting during trial-and-error instead of iteratively learning from underlying patterns.
Persistence is at the core; the tree links hypotheses and ideas, the code or configuration artifacts used to test them, experimental evidence (results, metrics), and distilled insights (such as “this data filter helped, but this learning rate scheduler didn’t”).
Once a project kicks off, shorter-execution work trees run code, log their work, and collect metrics. The long-lived coordinator above them serves as the de-facto head of research, keeping an eye on the process, updating nodes, selecting “promising leaves,” pruning or merging branches, propagating reusable lessons, and deciding which hypotheses to pursue next.
“The tree therefore acts as the operational research state of the system,” Arbor’s builders wrote. “It is simultaneously the search frontier, the memory of past attempts, and the audit trail for verified artifact improvement.”
To test how well this process works, the researchers evaluated Arbor in an autonomous optimization (AO) setting: the agent was given an initial research artifact (a data pipeline, harness, or training script) and was tasked with improving its “held-out performance” through iterative experimentation, without human steering. Held-out performance is a machine learning (ML) metric that evaluates how well models are able to generalize on data they haven’t seen before.
The tree-based architecture was tested on several real research tasks across model training (its ability to improve training recipes and hyperparameters), harness engineering (how well it can upgrade evaluation or training harnesses), and data synthesis (its capacity to generate better data for training or evals).
Ultimately, Arbor outperformed the average held-out gains of Codex and Claude Code by 2.5x, for the same resource budget.
The takeaway, said the researchers: Keeping a structured, evolving hypothesis tree yields greater performance improvements than running the same models as ‘memoryless’ coding agents.
Arbor’s most innovative feature is its ability to maintain the agent’s memory and retain relevant data from prior attempts and hypotheses, Info-Tech’s Ramin pointed out, and, he said, “the next step for autonomous agents may be accumulating evidence over time.”
However, this does raise concerns about the auditability of robust research environments at a large scale, he noted. “As autonomous agents become more capable of performing work without human operators overseeing them, enterprises will need transparency into how and/or why an agent took a specific action or reached a certain conclusion.”