{"slug": "you-don-t-pick-the-rl-algorithm-sia-s-feedback-loop-does", "title": "You don't pick the RL algorithm — SIA's Feedback loop does", "summary": "Hexo Labs released SIA (Self Improving AI), the first open-source framework that co-evolves both an agent's scaffold and its model weights in a single iterative loop. On the LawBench task, SIA achieved 70.1% accuracy, surpassing the prior state-of-the-art by 25.1 percentage points. The framework automatically selects the reinforcement learning algorithm based on reward shape, eliminating the need for manual algorithm choice.", "body_md": "SIA (Self Improving AI), released by Hexo Labs on May 26, 2026 , is the first open-source framework that co-evolves both an agent's scaffold and its model weights inside a single iterative loop. The MIT-licensed code is on [github.com/hexo-ai/sia](https://github.com/hexo-ai/sia). This tutorial walks through the feedback loop logic, prerequisites, and a runnable five-generation LawBench experiment.\n\nSIA's Feedback-Agent reads full execution trajectories, reward metrics, and task descriptions each generation, then decides whether the next step should be a scaffold edit, a LoRA weight update, or both — and selects the RL algorithm automatically based on the reward shape of the current task . Before SIA, harness-update systems (Darwin Gödel Machine, Hyperagents) and test-time training systems (TTRL, Discover-TTT) were entirely separate research directions. SIA is the first framework to combine both levers in a single self-improving loop, per the [SIA paper (arXiv:2605.27276)](https://arxiv.org/abs/2605.27276).\n\n**Quick Answer:** SIA (arXiv:2605.27276, MIT license, May 2026) co-evolves agent scaffold and LoRA weights in a single loop. Run `sia --task lawbench --max_gen 5`\n\n; the Feedback-Agent picks PPO+GAE, GRPO, or Entropic Advantage Weighting based on reward shape — no RL algorithm choice required. On LawBench, the combined harness+weights variant reached 70.1% accuracy , 25.1 percentage points over prior SOTA.\n\nThe three-agent loop: **Meta-Agent** generates the initial scaffold from a task description and reference implementation; **Task-Specific Agent** executes against the eval dataset in a sandbox with every step logged as a trajectory; **Feedback-Agent** (Claude Sonnet 4.6) receives source code, trajectories, metrics, and sample task descriptions, then emits `improvement.md`\n\nand the next-generation agent .\n\nRL algorithm selection is driven by reward shape:\n\nSIA benchmark results, May 2026\n\n| Task | Baseline | Prior SOTA | SIA-H (harness only) | SIA-W+H (harness + weights) |\n|---|---|---|---|---|\n| LawBench (191-class accuracy) | 13.5% | 45.0% | 50.0% |\n70.1% (+25.1 pp over SOTA) |\n| TriMul CUDA kernel (μs, lower=better) | ~13,500 μs | 1,161 μs | 1,017 μs |\n1,017 μs (−12.4% vs SOTA) |\n| MAGIC scRNA-seq denoising (mse_norm, higher=better) | 0.048 | 0.240 | 0.241 |\n0.289 (+20.4% over SOTA) |\n\n\"Harness changes and weight updates do not overlap in their effect space: harness iterations produce externalized infrastructure improvements — better parsing, tools, retry logic — while weight updates encode internalized domain knowledge that no prompt engineering alone can reach.\" — Hexo Labs research team,\n\n[SIA: Self Improving AI (arXiv:2605.27276v2)]\n\nThe Claude backend runs entirely on CPU — no local GPU required. Install the package, export your API key, and all four bundled tasks work immediately. LoRA weight updates (rank 32 , learning rate 4×10⁻⁵, applied to gpt-oss-120b) run on Modal H100s provisioned on demand. Skip Modal entirely and the loop still runs harness-only iterations — cheaper and sufficient to see meaningful eval gains in early generations.\n\n**Claude backend (all bundled tasks, no GPU needed):**\n\n```\npip install 'sia-agent[claude]'\nexport ANTHROPIC_API_KEY=\"sk-ant-...\"\n```\n\n**OpenHands backend (multi-provider task execution):**\n\n```\npip install 'sia-agent[openhands]'\nexport ANTHROPIC_API_KEY=\"...\"\nexport GEMINI_API_KEY=\"...\"\nexport OPENAI_API_KEY=\"...\"\n```\n\nPrerequisites at a glance:\n\n`--backend openhands`\n\nThree commands take you from a clean environment to a live five-generation self-improving loop on the bundled LawBench task .\n\n```\n   python3 -m venv .venv && source .venv/bin/activate\npip install 'sia-agent[claude]'\nsia --task lawbench --max_gen 5 --run_id 1\n```\n\nEach generation writes output to `runs/run_1/gen_N/`\n\n:\n\n`target_agent.py`\n\n— the evolved scaffold for this generation`agent_execution.json`\n\n— full execution log and per-step trajectory`improvement.md`\n\n— Feedback-Agent's rationale for the next change (appears from generation 2 onward)All four bundled tasks run with `--task <name>`\n\n: `gpqa`\n\n, `lawbench`\n\n, `longcot-chess`\n\n, `spaceship-titanic`\n\n. Key flags to know:\n\n`--max_gen`\n\n— number of self-improvement generations (default: 3)`--backend claude|openhands`\n\n`--meta_model`\n\n— model for Feedback/Meta agents (default: `haiku`\n\n)`--task_model`\n\n— model for the task-specific agent (default: `claude-haiku-4-5-20251001`\n\n)The snippet below is a runnable illustration of the core mechanism — the Feedback loop maintaining a live reward signal for each available algorithm and switching when one accumulates a better signal. This code ran to completion (exit 0):\n\n``` python\nimport random\n\ndef epsilon_greedy(scores, pulls, t):\n    return max(scores, key=scores.get) if t % 3 else random.randrange(3)\n\ndef ucb(scores, pulls, t):\n    return max(scores, key=lambda a: scores[a] + (2 * (t + 1) / (pulls[a] + 1)) ** 0.5)\n\nalgorithms = {\"epsilon_greedy\": epsilon_greedy, \"ucb\": ucb}\nscores = {0: 0.0, 1: 0.0, 2: 0.0}\npulls = {0: 0, 1: 0, 2: 0}\nfeedback = {name: 0.0 for name in algorithms}\n\nrandom.seed(7)\nfor t in range(12):\n    # SIA's feedback loop picks the RL algorithm with the best live reward signal.\n    chosen_algo = max(feedback, key=feedback.get) if t else \"epsilon_greedy\"\n    action = algorithms[chosen_algo](scores, pulls, t)\n    reward = [0.15, 0.55, 0.8][action] + random.uniform(-0.08, 0.08)\n    pulls[action] += 1\n    scores[action] += (reward - scores[action]) / pulls[action]\n    feedback[chosen_algo] = 0.7 * feedback[chosen_algo] + 0.3 * reward\n\n    if t == 5:\n        feedback[\"ucb\"] += 0.5  # new feedback changes the controller's choice\n\n    print(f\"step={t:02d} sia_selected={chosen_algo:15s} action={action} reward={reward:.2f}\")\n\nprint(\"Takeaway: you provide feedback; SIA's loop chooses the RL algorithm.\")\n```\n\nWatch step 07: a feedback boost applied to `ucb`\n\nat step 5 causes the controller to switch algorithms at the next decision point. SIA's Feedback-Agent applies the same logic at generation granularity — accumulated reward signals reshape algorithm selection each generation, not just each step.\n\nTo run SIA on your own benchmark, create a directory with this minimum structure and point `--task_dir`\n\nat it:\n\n```\nmy-task/\n├── data/\n│   ├── public/\n│   │   ├── task.md          # scoring function + evaluation loop\n│   │   └── ...\n│   └── private/             # held-out answers (never in scaffold context)\n└── reference/\n    ├── reference_target_agent.py   # working baseline for Meta-Agent\n    └── SAMPLE_TASK_DESCRIPTIONS.md\nsia --task_dir ./my-task --max_gen 5 --run_id 1\n```\n\nThree things worth knowing about this layout:\n\n`task.md`\n\ndefines the scoring function and evaluation loop — this is what tells SIA what a correct answer looks like, and it is the primary lever for guiding the Feedback loop.`reference_target_agent.py`\n\ngives the Meta-Agent a working starting point. Omit it and the Meta-Agent generates a scaffold from scratch — viable, but slower and lower quality on the first generation.`data/private/`\n\nstays outside the scaffold's context window at all times. Only the public task description is visible to the running agent — no eval-set contamination.Four patterns that appear reliably in early runs, and what to do about them:\n\n`improvement.md`\n\nstarts repeating the same edits verbatim, switch to `--meta_model claude-sonnet-4-5-20251001`\n\n. Sonnet produces richer harness rewrites and more substantive RL algorithm reasoning at higher cost per generation.`agent_execution.json`\n\nfor trajectory length before pushing `--max_gen`\n\nbeyond 10. Trajectory length is the main driver of per-generation wall time.For independent analysis of SIA's architecture and benchmark methodology, see the [MarkTechPost writeup](https://www.marktechpost.com/2026/05/29/hexo-labs-open-sources-sia-a-self-improving-agent-that-updates-both-the-harness-and-the-model-weights/) and the [Moonlight review](https://www.themoonlight.io/en/review/sia-self-improving-ai-with-harness-weight-updates).\n\nNo. Harness edits run entirely on CPU via the Claude API — install `sia-agent[claude]`\n\n, export `ANTHROPIC_API_KEY`\n\n, and run. LoRA weight updates require a Modal account with H100 credits. Skip weight updates entirely by not configuring Modal; the loop still runs and improves the scaffold across generations at no GPU cost.\n\nPPO with GAE. LawBench produces dense step-level rewards, and the Feedback loop consistently selects PPO for tasks with that reward structure. GRPO and Entropic Advantage Weighting appear on tasks with sparse or right-skewed reward distributions — RNA denoising and GPU kernel optimization respectively.\n\nNot out-of-the-box. The LoRA RL loop targets gpt-oss-120b by default. Substituting a different base requires editing the run config and ensuring Modal can load those weights. The MIT license keeps the door open for community contributions supporting alternative bases.\n\nRead `runs/run_{id}/gen_{n}/improvement.md`\n\nfor the Feedback loop's rationale for that generation. Compare eval scores in `agent_execution.json`\n\nacross generation directories. Flat scores paired with shallow or repetitive improvement notes are the signal to switch to `--meta_model sonnet`\n\nor enable weight updates.\n\nCost and latency. Haiku is cheap enough to run across many generations without API costs dominating the experiment budget. Override with `--meta_model claude-sonnet-4-5-20251001`\n\nwhen you need richer harness rewrites or more substantive RL algorithm reasoning — typically after generation 3 or 4 when haiku's improvement reports start repeating themselves.\n\nStart with a harness-only run on a bundled task — `gpqa`\n\nor `lawbench`\n\n— to calibrate generation cost and see what `improvement.md`\n\nlooks like before enabling Modal. The harness-only variant already reaches 50.0% on LawBench against a 13.5% baseline , so it is worth knowing your harness ceiling before spending GPU time on weight updates.\n\nOnce harness gains plateau — flat scores for 2–3 consecutive generations — enable weight updates and compare `SIA-H`\n\nvs `SIA-W+H`\n\nperformance directly. For custom domains, invest time in `task.md`\n\nfirst: a well-specified verifier is what gives the Feedback loop a meaningful signal. A weak or noisy scoring function limits how far either harness edits or weight updates can go, regardless of how many generations you run.\n\nFull paper: [arXiv:2605.27276](https://arxiv.org/abs/2605.27276). Code, task authoring guide, and bundled tasks: [github.com/hexo-ai/sia](https://github.com/hexo-ai/sia). Background on Hexo Labs' research program (Stanford, UC Santa Barbara, Oxford partnerships): [tFiR interview with Hexo Labs](https://tfir.io/self-improving-ai-sia-hexo-labs-kunal-bhatia/).\n\n*Last updated: 2026-06-01. Article reflects SIA arXiv:2605.27276v2, revised May 28, 2026 .*", "url": "https://wpnews.pro/news/you-don-t-pick-the-rl-algorithm-sia-s-feedback-loop-does", "canonical_source": "https://dev.to/creeta/you-dont-pick-the-rl-algorithm-sias-feedback-loop-does-48ki", "published_at": "2026-06-18 10:37:33+00:00", "updated_at": "2026-06-18 10:51:36.283757+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "ai-research", "ai-agents"], "entities": ["Hexo Labs", "SIA", "LawBench", "Claude Sonnet 4.6", "Modal", "gpt-oss-120b", "arXiv", "MIT"], "alternates": {"html": "https://wpnews.pro/news/you-don-t-pick-the-rl-algorithm-sia-s-feedback-loop-does", "markdown": "https://wpnews.pro/news/you-don-t-pick-the-rl-algorithm-sia-s-feedback-loop-does.md", "text": "https://wpnews.pro/news/you-don-t-pick-the-rl-algorithm-sia-s-feedback-loop-does.txt", "jsonld": "https://wpnews.pro/news/you-don-t-pick-the-rl-algorithm-sia-s-feedback-loop-does.jsonld"}}