You don't pick the RL algorithm — SIA's Feedback loop does Hexo Labs released SIA (Self Improving AI), the first open-source framework that co-evolves both an agent's scaffold and its model weights in a single iterative loop. On the LawBench task, SIA achieved 70.1% accuracy, surpassing the prior state-of-the-art by 25.1 percentage points. The framework automatically selects the reinforcement learning algorithm based on reward shape, eliminating the need for manual algorithm choice. SIA Self Improving AI , released by Hexo Labs on May 26, 2026 , is the first open-source framework that co-evolves both an agent's scaffold and its model weights inside a single iterative loop. The MIT-licensed code is on github.com/hexo-ai/sia https://github.com/hexo-ai/sia . This tutorial walks through the feedback loop logic, prerequisites, and a runnable five-generation LawBench experiment. SIA's Feedback-Agent reads full execution trajectories, reward metrics, and task descriptions each generation, then decides whether the next step should be a scaffold edit, a LoRA weight update, or both — and selects the RL algorithm automatically based on the reward shape of the current task . Before SIA, harness-update systems Darwin Gödel Machine, Hyperagents and test-time training systems TTRL, Discover-TTT were entirely separate research directions. SIA is the first framework to combine both levers in a single self-improving loop, per the SIA paper arXiv:2605.27276 https://arxiv.org/abs/2605.27276 . Quick Answer: SIA arXiv:2605.27276, MIT license, May 2026 co-evolves agent scaffold and LoRA weights in a single loop. Run sia --task lawbench --max gen 5 ; the Feedback-Agent picks PPO+GAE, GRPO, or Entropic Advantage Weighting based on reward shape — no RL algorithm choice required. On LawBench, the combined harness+weights variant reached 70.1% accuracy , 25.1 percentage points over prior SOTA. The three-agent loop: Meta-Agent generates the initial scaffold from a task description and reference implementation; Task-Specific Agent executes against the eval dataset in a sandbox with every step logged as a trajectory; Feedback-Agent Claude Sonnet 4.6 receives source code, trajectories, metrics, and sample task descriptions, then emits improvement.md and the next-generation agent . RL algorithm selection is driven by reward shape: SIA benchmark results, May 2026 | Task | Baseline | Prior SOTA | SIA-H harness only | SIA-W+H harness + weights | |---|---|---|---|---| | LawBench 191-class accuracy | 13.5% | 45.0% | 50.0% | 70.1% +25.1 pp over SOTA | | TriMul CUDA kernel μs, lower=better | ~13,500 μs | 1,161 μs | 1,017 μs | 1,017 μs −12.4% vs SOTA | | MAGIC scRNA-seq denoising mse norm, higher=better | 0.048 | 0.240 | 0.241 | 0.289 +20.4% over SOTA | "Harness changes and weight updates do not overlap in their effect space: harness iterations produce externalized infrastructure improvements — better parsing, tools, retry logic — while weight updates encode internalized domain knowledge that no prompt engineering alone can reach." — Hexo Labs research team, SIA: Self Improving AI arXiv:2605.27276v2 The Claude backend runs entirely on CPU — no local GPU required. Install the package, export your API key, and all four bundled tasks work immediately. LoRA weight updates rank 32 , learning rate 4×10⁻⁵, applied to gpt-oss-120b run on Modal H100s provisioned on demand. Skip Modal entirely and the loop still runs harness-only iterations — cheaper and sufficient to see meaningful eval gains in early generations. Claude backend all bundled tasks, no GPU needed : pip install 'sia-agent claude ' export ANTHROPIC API KEY="sk-ant-..." OpenHands backend multi-provider task execution : pip install 'sia-agent openhands ' export ANTHROPIC API KEY="..." export GEMINI API KEY="..." export OPENAI API KEY="..." Prerequisites at a glance: --backend openhands Three commands take you from a clean environment to a live five-generation self-improving loop on the bundled LawBench task . python3 -m venv .venv && source .venv/bin/activate pip install 'sia-agent claude ' sia --task lawbench --max gen 5 --run id 1 Each generation writes output to runs/run 1/gen N/ : target agent.py — the evolved scaffold for this generation agent execution.json — full execution log and per-step trajectory improvement.md — Feedback-Agent's rationale for the next change appears from generation 2 onward All four bundled tasks run with --task