SIA (Self Improving AI), released by Hexo Labs on May 26, 2026 , is the first open-source framework that co-evolves both an agent's scaffold and its model weights inside a single iterative loop. The MIT-licensed code is on github.com/hexo-ai/sia. This tutorial walks through the feedback loop logic, prerequisites, and a runnable five-generation LawBench experiment.
SIA's Feedback-Agent reads full execution trajectories, reward metrics, and task descriptions each generation, then decides whether the next step should be a scaffold edit, a LoRA weight update, or both — and selects the RL algorithm automatically based on the reward shape of the current task . Before SIA, harness-update systems (Darwin Gödel Machine, Hyperagents) and test-time training systems (TTRL, Discover-TTT) were entirely separate research directions. SIA is the first framework to combine both levers in a single self-improving loop, per the SIA paper (arXiv:2605.27276).
Quick Answer: SIA (arXiv:2605.27276, MIT license, May 2026) co-evolves agent scaffold and LoRA weights in a single loop. Run sia --task lawbench --max_gen 5
; the Feedback-Agent picks PPO+GAE, GRPO, or Entropic Advantage Weighting based on reward shape — no RL algorithm choice required. On LawBench, the combined harness+weights variant reached 70.1% accuracy , 25.1 percentage points over prior SOTA.
The three-agent loop: Meta-Agent generates the initial scaffold from a task description and reference implementation; Task-Specific Agent executes against the eval dataset in a sandbox with every step logged as a trajectory; Feedback-Agent (Claude Sonnet 4.6) receives source code, trajectories, metrics, and sample task descriptions, then emits improvement.md
and the next-generation agent .
RL algorithm selection is driven by reward shape:
SIA benchmark results, May 2026
| Task | Baseline | Prior SOTA | SIA-H (harness only) | SIA-W+H (harness + weights) |
|---|---|---|---|---|
| LawBench (191-class accuracy) | 13.5% | 45.0% | 50.0% | |
| 70.1% (+25.1 pp over SOTA) | ||||
| TriMul CUDA kernel (μs, lower=better) | ~13,500 μs | 1,161 μs | 1,017 μs | |
| 1,017 μs (−12.4% vs SOTA) | ||||
| MAGIC scRNA-seq denoising (mse_norm, higher=better) | 0.048 | 0.240 | 0.241 | |
| 0.289 (+20.4% over SOTA) |
"Harness changes and weight updates do not overlap in their effect space: harness iterations produce externalized infrastructure improvements — better parsing, tools, retry logic — while weight updates encode internalized domain knowledge that no prompt engineering alone can reach." — Hexo Labs research team,
[SIA: Self Improving AI (arXiv:2605.27276v2)]
The Claude backend runs entirely on CPU — no local GPU required. Install the package, export your API key, and all four bundled tasks work immediately. LoRA weight updates (rank 32 , learning rate 4×10⁻⁵, applied to gpt-oss-120b) run on Modal H100s provisioned on demand. Skip Modal entirely and the loop still runs harness-only iterations — cheaper and sufficient to see meaningful eval gains in early generations.
Claude backend (all bundled tasks, no GPU needed):
pip install 'sia-agent[claude]'
export ANTHROPIC_API_KEY="sk-ant-..."
OpenHands backend (multi-provider task execution):
pip install 'sia-agent[openhands]'
export ANTHROPIC_API_KEY="..."
export GEMINI_API_KEY="..."
export OPENAI_API_KEY="..."
Prerequisites at a glance:
--backend openhands
Three commands take you from a clean environment to a live five-generation self-improving loop on the bundled LawBench task .
python3 -m venv .venv && source .venv/bin/activate
pip install 'sia-agent[claude]'
sia --task lawbench --max_gen 5 --run_id 1
Each generation writes output to runs/run_1/gen_N/
:
target_agent.py
— the evolved scaffold for this generationagent_execution.json
— full execution log and per-step trajectoryimprovement.md
— Feedback-Agent's rationale for the next change (appears from generation 2 onward)All four bundled tasks run with --task <name>
: gpqa
, lawbench
, longcot-chess
, spaceship-titanic
. Key flags to know:
--max_gen
— number of self-improvement generations (default: 3)--backend claude|openhands
--meta_model
— model for Feedback/Meta agents (default: haiku
)--task_model
— model for the task-specific agent (default: claude-haiku-4-5-20251001
)The snippet below is a runnable illustration of the core mechanism — the Feedback loop maintaining a live reward signal for each available algorithm and switching when one accumulates a better signal. This code ran to completion (exit 0):
import random
def epsilon_greedy(scores, pulls, t):
return max(scores, key=scores.get) if t % 3 else random.randrange(3)
def ucb(scores, pulls, t):
return max(scores, key=lambda a: scores[a] + (2 * (t + 1) / (pulls[a] + 1)) ** 0.5)
algorithms = {"epsilon_greedy": epsilon_greedy, "ucb": ucb}
scores = {0: 0.0, 1: 0.0, 2: 0.0}
pulls = {0: 0, 1: 0, 2: 0}
feedback = {name: 0.0 for name in algorithms}
random.seed(7)
for t in range(12):
chosen_algo = max(feedback, key=feedback.get) if t else "epsilon_greedy"
action = algorithms[chosen_algo](scores, pulls, t)
reward = [0.15, 0.55, 0.8][action] + random.uniform(-0.08, 0.08)
pulls[action] += 1
scores[action] += (reward - scores[action]) / pulls[action]
feedback[chosen_algo] = 0.7 * feedback[chosen_algo] + 0.3 * reward
if t == 5:
feedback["ucb"] += 0.5 # new feedback changes the controller's choice
print(f"step={t:02d} sia_selected={chosen_algo:15s} action={action} reward={reward:.2f}")
print("Takeaway: you provide feedback; SIA's loop chooses the RL algorithm.")
Watch step 07: a feedback boost applied to ucb
at step 5 causes the controller to switch algorithms at the next decision point. SIA's Feedback-Agent applies the same logic at generation granularity — accumulated reward signals reshape algorithm selection each generation, not just each step.
To run SIA on your own benchmark, create a directory with this minimum structure and point --task_dir
at it:
my-task/
├── data/
│ ├── public/
│ │ ├── task.md # scoring function + evaluation loop
│ │ └── ...
│ └── private/ # held-out answers (never in scaffold context)
└── reference/
├── reference_target_agent.py # working baseline for Meta-Agent
└── SAMPLE_TASK_DESCRIPTIONS.md
sia --task_dir ./my-task --max_gen 5 --run_id 1
Three things worth knowing about this layout:
task.md
defines the scoring function and evaluation loop — this is what tells SIA what a correct answer looks like, and it is the primary lever for guiding the Feedback loop.reference_target_agent.py
gives the Meta-Agent a working starting point. Omit it and the Meta-Agent generates a scaffold from scratch — viable, but slower and lower quality on the first generation.data/private/
stays outside the scaffold's context window at all times. Only the public task description is visible to the running agent — no eval-set contamination.Four patterns that appear reliably in early runs, and what to do about them:
improvement.md
starts repeating the same edits verbatim, switch to --meta_model claude-sonnet-4-5-20251001
. Sonnet produces richer harness rewrites and more substantive RL algorithm reasoning at higher cost per generation.agent_execution.json
for trajectory length before pushing --max_gen
beyond 10. Trajectory length is the main driver of per-generation wall time.For independent analysis of SIA's architecture and benchmark methodology, see the MarkTechPost writeup and the Moonlight review.
No. Harness edits run entirely on CPU via the Claude API — install sia-agent[claude]
, export ANTHROPIC_API_KEY
, and run. LoRA weight updates require a Modal account with H100 credits. Skip weight updates entirely by not configuring Modal; the loop still runs and improves the scaffold across generations at no GPU cost.
PPO with GAE. LawBench produces dense step-level rewards, and the Feedback loop consistently selects PPO for tasks with that reward structure. GRPO and Entropic Advantage Weighting appear on tasks with sparse or right-skewed reward distributions — RNA denoising and GPU kernel optimization respectively.
Not out-of-the-box. The LoRA RL loop targets gpt-oss-120b by default. Substituting a different base requires editing the run config and ensuring Modal can load those weights. The MIT license keeps the door open for community contributions supporting alternative bases.
Read runs/run_{id}/gen_{n}/improvement.md
for the Feedback loop's rationale for that generation. Compare eval scores in agent_execution.json
across generation directories. Flat scores paired with shallow or repetitive improvement notes are the signal to switch to --meta_model sonnet
or enable weight updates.
Cost and latency. Haiku is cheap enough to run across many generations without API costs dominating the experiment budget. Override with --meta_model claude-sonnet-4-5-20251001
when you need richer harness rewrites or more substantive RL algorithm reasoning — typically after generation 3 or 4 when haiku's improvement reports start repeating themselves.
Start with a harness-only run on a bundled task — gpqa
or lawbench
— to calibrate generation cost and see what improvement.md
looks like before enabling Modal. The harness-only variant already reaches 50.0% on LawBench against a 13.5% baseline , so it is worth knowing your harness ceiling before spending GPU time on weight updates.
Once harness gains plateau — flat scores for 2–3 consecutive generations — enable weight updates and compare SIA-H
vs SIA-W+H
performance directly. For custom domains, invest time in task.md
first: a well-specified verifier is what gives the Feedback loop a meaningful signal. A weak or noisy scoring function limits how far either harness edits or weight updates can go, regardless of how many generations you run.
Full paper: arXiv:2605.27276. Code, task authoring guide, and bundled tasks: github.com/hexo-ai/sia. Background on Hexo Labs' research program (Stanford, UC Santa Barbara, Oxford partnerships): tFiR interview with Hexo Labs.
Last updated: 2026-06-01. Article reflects SIA arXiv:2605.27276v2, revised May 28, 2026 .