AI Agent that at inference time updates it's harness and model weights

Researchers from the SIA project have released a self-improving AI framework that enables language-model agents to autonomously update both their operational harness and model weights during inference. The system, detailed in a 2026 paper by Hebbar and colleagues, achieved a 56.6% accuracy gain on LawBench, a 91.9% runtime reduction on GPU kernel optimization, and a 502% improvement on single-cell RNA denoising compared to baseline methods.

Official implementation of SIA: Self Improving AI with Harness & Weight Updates https://arxiv.org/abs/2605.27276 Hebbar et al., 2026 — a self-improving loop where a language-model agent updates both the harness and the weights of a task-specific agent. The paper reports a 56.6% gain on LawBench, 91.9% runtime reduction on GPU kernels, and 502% improvement on single-cell RNA denoising over baseline. SIA is a Self Improving AI framework to autonomously improve the performance of any AI system Model / Agent on a benchmark task. Just want to try it?Skip to Run SIA locally . Control flow between Meta, Target, and Feedback agents over successive generations. SIA operates by coordinating three main types of AI agents that work together to continuously improve task performance: Meta-Agent : Reads the task description and generates an initial Target Agent tailored to the task. Target / Task Specific Agent : Attempts to complete the task and records its actions and results. Feedback/Improvement Agent : Reviews the Target Agent's performance logs, identifies improvements, and updates the Target Agent accordingly. This iterative process allows the system to autonomously refine and enhance its ability to solve scientific tasks. OpenAI MLE-Bench Hard: a gauntlet of real Kaggle ML competitions where agents must write, run, and iterate full ML pipelines. SIA ranks 1 across all generations tested. LawBench: predict the criminal charge from Chinese court case descriptions across 191 charge categories. SIA-W+H reaches 70.1% Top-1 accuracy, beating the prior SOTA of 45%. AlphaFold-3 TriMul Triton Kernel: implement and optimize the Triangle Multiplicative Update as a Triton kernel, preserving correctness while hitting H100 latency targets. SIA-W+H achieves 14x speedup over baseline. scRNA-seq Denoising: impute missing gene expression values in single-cell RNA sequencing data. SIA-W+H scores 0.289 MSE norm, surpassing the prior SOTA of 0.220. SIA ships with four built-in tasks: gpqa , lawbench , longcot-chess , spaceship-titanic . Pick the Agent backend that matches the LLMs you want to run. Claude backend Claude Agent SDK, Claude models only : python3 -m venv .venv && source .venv/bin/activate pip install 'sia-agent claude ' export ANTHROPIC API KEY="..." OpenHands backend multi-provider — Gemini, OpenAI, Anthropic, etc. : python3 -m venv .venv && source .venv/bin/activate pip install 'sia-agent openhands ' Export the key s for the provider s you'll use: export ANTHROPIC API KEY="..." for anthropic/ models export GEMINI API KEY="..." for gemini/ models or GOOGLE API KEY export OPENAI API KEY="..." for openai/ models Full provider/model reference: docs/configuration.md /hexo-ai/sia/blob/main/docs/configuration.md api-keys . sia --task gpqa --max gen 5 --run id 1 Swap --task for any of the four bundled tasks. Artifacts land in runs/run {run id}/gen {n}/ : target agent.py — the agent for that generation agent execution.json — execution logs improvement.md — diff rationale gen 2+ | Flag | Default | Description | |---|---|---| --task | — | Bundled task name mutually exclusive with --task dir | --task dir | — | Path to an external task directory | --max gen | 3 | Number of self-improvement generations | --run id | 1 | Unique run identifier | --backend | claude | claude Claude Agent SDK or openhands multi-provider | --meta model | haiku | Meta/feedback model e.g. haiku , sonnet , opus , or gemini/... , openai/... with openhands | --task model | claude-haiku-4-5-20251001 | Target agent model | Full backend, model, and API-key reference: docs/configuration.md /hexo-ai/sia/blob/main/docs/configuration.md . Hit a snag? docs/troubleshooting.md /hexo-ai/sia/blob/main/docs/troubleshooting.md . Prepare a task directory with the layout below and point --task dir at it: my-task/ ├── data/ │ ├── public/ │ │ ├── task.md Task description — SIA reads this │ │ └── ... Inputs the agent is allowed to see │ └── private/ Held-out eval data; never exposed to the agent └── reference/ ├── reference target agent.py Template; copy from sia/tasks/ shared/ └── SAMPLE TASK DESCRIPTIONS.md Optional: example tasks for the meta-agent sia --task dir ./my-task --max gen 5 --run id 1 Or bring an MLE-Bench competition. SIA can bootstrap a task directory directly from any MLE-Bench https://github.com/openai/mle-bench competition — it pulls the dataset via the Kaggle API, sets up the public/private split, and drops in the reference agent template: python -m sia.prepare mlebench dataset -c "spaceship-titanic" sia --task dir ./tasks/spaceship-titanic --max gen 5 --run id 1 Full step-by-step for both paths: docs/walkthrough.md /hexo-ai/sia/blob/main/docs/walkthrough.md . docs/architecture.md /hexo-ai/sia/blob/main/docs/architecture.md — directory layout, generation flow, prompt customization docs/walkthrough.md /hexo-ai/sia/blob/main/docs/walkthrough.md — detailed custom-task walkthrough docs/configuration.md /hexo-ai/sia/blob/main/docs/configuration.md — backends, models, API keys, CLI reference docs/troubleshooting.md /hexo-ai/sia/blob/main/docs/troubleshooting.md — common errors and fixes If you use SIA in your research, please cite: @article{hebbar2026sia, title = {SIA: Self Improving AI with Harness \& Weight Updates}, author = {Hebbar, Prannay and Manawat, Yogendra and Verboomen, Samuel and Ivanova, Alesia and Palanimalai, Selvam and Bhatia, Kunal and Baskaran, Vignesh}, journal = {arXiv preprint arXiv:2605.27276}, year = {2026}, url = {https://arxiv.org/abs/2605.27276} }