You don't pick the RL algorithm — SIA's Feedback loop does

wpnews.pro

SIA (Self Improving AI), released by Hexo Labs on May 26, 2026 , is the first open-source framework that co-evolves both an agent's scaffold and its model weights inside a single iterative loop. The MIT-licensed code is on github.com/hexo-ai/sia. This tutorial walks through the feedback loop logic, prerequisites, and a runnable five-generation LawBench experiment.

SIA's Feedback-Agent reads full execution trajectories, reward metrics, and task descriptions each generation, then decides whether the next step should be a scaffold edit, a LoRA weight update, or both — and selects the RL algorithm automatically based on the reward shape of the current task . Before SIA, harness-update systems (Darwin Gödel Machine, Hyperagents) and test-time training systems (TTRL, Discover-TTT) were entirely separate research directions. SIA is the first framework to combine both levers in a single self-improving loop, per the SIA paper (arXiv:2605.27276).

Quick Answer: SIA (arXiv:2605.27276, MIT license, May 2026) co-evolves agent scaffold and LoRA weights in a single loop. Run sia --task lawbench --max_gen 5

; the Feedback-Agent picks PPO+GAE, GRPO, or Entropic Advantage Weighting based on reward shape — no RL algorithm choice required. On LawBench, the combined harness+weights variant reached 70.1% accuracy , 25.1 percentage points over prior SOTA.

The three-agent loop: Meta-Agent generates the initial scaffold from a task description and reference implementation; Task-Specific Agent executes against the eval dataset in a sandbox with every step logged as a trajectory; Feedback-Agent (Claude Sonnet 4.6) receives source code, trajectories, metrics, and sample task descriptions, then emits improvement.md

and the next-generation agent .

RL algorithm selection is driven by reward shape:

SIA benchmark results, May 2026

Task	Baseline	Prior SOTA	SIA-H (harness only)
LawBench (191-class accuracy)	13.5%	45.0%	50.0%
70.1% (+25.1 pp over SOTA)
TriMul CUDA kernel (μs, lower=better)	~13,500 μs	1,161 μs	1,017 μs
1,017 μs (−12.4% vs SOTA)
MAGIC scRNA-seq denoising (mse_norm, higher=better)	0.048	0.240	0.241
0.289 (+20.4% over SOTA)

"Harness changes and weight updates do not overlap in their effect space: harness iterations produce externalized infrastructure improvements — better parsing, tools, retry logic — while weight updates encode internalized domain knowledge that no prompt engineering alone can reach." — Hexo Labs research team,

[SIA: Self Improving AI (arXiv:2605.27276v2)]

The Claude backend runs entirely on CPU — no local GPU required. Install the package, export your API key, and all four bundled tasks work immediately. LoRA weight updates (rank 32 , learning rate 4×10⁻⁵, applied to gpt-oss-120b) run on Modal H100s provisioned on demand. Skip Modal entirely and the loop still runs harness-only iterations — cheaper and sufficient to see meaningful eval gains in early generations.

Claude backend (all bundled tasks, no GPU needed):

pip install 'sia-agent[claude]'
export ANTHROPIC_API_KEY="sk-ant-..."

OpenHands backend (multi-provider task execution):

pip install 'sia-agent[openhands]'
export ANTHROPIC_API_KEY="..."
export GEMINI_API_KEY="..."
export OPENAI_API_KEY="..."

Prerequisites at a glance:

--backend openhands

Three commands take you from a clean environment to a live five-generation self-improving loop on the bundled LawBench task .

   python3 -m venv .venv && source .venv/bin/activate
pip install 'sia-agent[claude]'
sia --task lawbench --max_gen 5 --run_id 1

Each generation writes output to runs/run_1/gen_N/

:

target_agent.py

— the evolved scaffold for this generationagent_execution.json

— full execution log and per-step trajectoryimprovement.md

— Feedback-Agent's rationale for the next change (appears from generation 2 onward)All four bundled tasks run with --task <name>

: gpqa

, lawbench

, longcot-chess

, spaceship-titanic

. Key flags to know:

--max_gen

— number of self-improvement generations (default: 3)--backend claude|openhands

--meta_model

— model for Feedback/Meta agents (default: haiku

)--task_model

— model for the task-specific agent (default: claude-haiku-4-5-20251001

)The snippet below is a runnable illustration of the core mechanism — the Feedback loop maintaining a live reward signal for each available algorithm and switching when one accumulates a better signal. This code ran to completion (exit 0):

import random

def epsilon_greedy(scores, pulls, t):
    return max(scores, key=scores.get) if t % 3 else random.randrange(3)

def ucb(scores, pulls, t):
    return max(scores, key=lambda a: scores[a] + (2 * (t + 1) / (pulls[a] + 1)) ** 0.5)

algorithms = {"epsilon_greedy": epsilon_greedy, "ucb": ucb}
scores = {0: 0.0, 1: 0.0, 2: 0.0}
pulls = {0: 0, 1: 0, 2: 0}
feedback = {name: 0.0 for name in algorithms}

random.seed(7)
for t in range(12):
    chosen_algo = max(feedback, key=feedback.get) if t else "epsilon_greedy"
    action = algorithms[chosen_algo](scores, pulls, t)
    reward = [0.15, 0.55, 0.8][action] + random.uniform(-0.08, 0.08)
    pulls[action] += 1
    scores[action] += (reward - scores[action]) / pulls[action]
    feedback[chosen_algo] = 0.7 * feedback[chosen_algo] + 0.3 * reward

    if t == 5:
        feedback["ucb"] += 0.5  # new feedback changes the controller's choice

    print(f"step={t:02d} sia_selected={chosen_algo:15s} action={action} reward={reward:.2f}")

print("Takeaway: you provide feedback; SIA's loop chooses the RL algorithm.")

Watch step 07: a feedback boost applied to ucb

at step 5 causes the controller to switch algorithms at the next decision point. SIA's Feedback-Agent applies the same logic at generation granularity — accumulated reward signals reshape algorithm selection each generation, not just each step.

To run SIA on your own benchmark, create a directory with this minimum structure and point --task_dir

at it:

my-task/
├── data/
│   ├── public/
│   │   ├── task.md          # scoring function + evaluation loop
│   │   └── ...
│   └── private/             # held-out answers (never in scaffold context)
└── reference/
    ├── reference_target_agent.py   # working baseline for Meta-Agent
    └── SAMPLE_TASK_DESCRIPTIONS.md
sia --task_dir ./my-task --max_gen 5 --run_id 1

Three things worth knowing about this layout:

task.md

defines the scoring function and evaluation loop — this is what tells SIA what a correct answer looks like, and it is the primary lever for guiding the Feedback loop.reference_target_agent.py

gives the Meta-Agent a working starting point. Omit it and the Meta-Agent generates a scaffold from scratch — viable, but slower and lower quality on the first generation.data/private/

stays outside the scaffold's context window at all times. Only the public task description is visible to the running agent — no eval-set contamination.Four patterns that appear reliably in early runs, and what to do about them:

improvement.md

starts repeating the same edits verbatim, switch to --meta_model claude-sonnet-4-5-20251001

. Sonnet produces richer harness rewrites and more substantive RL algorithm reasoning at higher cost per generation.agent_execution.json

for trajectory length before pushing --max_gen

beyond 10. Trajectory length is the main driver of per-generation wall time.For independent analysis of SIA's architecture and benchmark methodology, see the MarkTechPost writeup and the Moonlight review.

No. Harness edits run entirely on CPU via the Claude API — install sia-agent[claude]

, export ANTHROPIC_API_KEY

, and run. LoRA weight updates require a Modal account with H100 credits. Skip weight updates entirely by not configuring Modal; the loop still runs and improves the scaffold across generations at no GPU cost.

PPO with GAE. LawBench produces dense step-level rewards, and the Feedback loop consistently selects PPO for tasks with that reward structure. GRPO and Entropic Advantage Weighting appear on tasks with sparse or right-skewed reward distributions — RNA denoising and GPU kernel optimization respectively.

Not out-of-the-box. The LoRA RL loop targets gpt-oss-120b by default. Substituting a different base requires editing the run config and ensuring Modal can load those weights. The MIT license keeps the door open for community contributions supporting alternative bases.

Read runs/run_{id}/gen_{n}/improvement.md

for the Feedback loop's rationale for that generation. Compare eval scores in agent_execution.json

across generation directories. Flat scores paired with shallow or repetitive improvement notes are the signal to switch to --meta_model sonnet

or enable weight updates.

Cost and latency. Haiku is cheap enough to run across many generations without API costs dominating the experiment budget. Override with --meta_model claude-sonnet-4-5-20251001

when you need richer harness rewrites or more substantive RL algorithm reasoning — typically after generation 3 or 4 when haiku's improvement reports start repeating themselves.

Start with a harness-only run on a bundled task — gpqa

or lawbench

— to calibrate generation cost and see what improvement.md

looks like before enabling Modal. The harness-only variant already reaches 50.0% on LawBench against a 13.5% baseline , so it is worth knowing your harness ceiling before spending GPU time on weight updates.

Once harness gains plateau — flat scores for 2–3 consecutive generations — enable weight updates and compare SIA-H

vs SIA-W+H

performance directly. For custom domains, invest time in task.md

first: a well-specified verifier is what gives the Feedback loop a meaningful signal. A weak or noisy scoring function limits how far either harness edits or weight updates can go, regardless of how many generations you run.

Full paper: arXiv:2605.27276. Code, task authoring guide, and bundled tasks: github.com/hexo-ai/sia. Background on Hexo Labs' research program (Stanford, UC Santa Barbara, Oxford partnerships): tFiR interview with Hexo Labs.

Last updated: 2026-06-01. Article reflects SIA arXiv:2605.27276v2, revised May 28, 2026 .

source & further reading

dev.to — original article How to run a team of AI marketing agents from Slack PQC migration does not start by replacing an algorithm I Spent 10x Longer Debugging AI Code Than Writing It — Here's What Changed

You don't pick the RL algorithm — SIA's Feedback loop does

Run your AI side-project on zahid.host