{"slug": "ai-agent-that-at-inference-time-updates-it-s-harness-and-model-weights", "title": "AI Agent that at inference time updates it's harness and model weights", "summary": "Researchers from the SIA project have released a self-improving AI framework that enables language-model agents to autonomously update both their operational harness and model weights during inference. The system, detailed in a 2026 paper by Hebbar and colleagues, achieved a 56.6% accuracy gain on LawBench, a 91.9% runtime reduction on GPU kernel optimization, and a 502% improvement on single-cell RNA denoising compared to baseline methods.", "body_md": "Official implementation of [ SIA: Self Improving AI with Harness & Weight Updates](https://arxiv.org/abs/2605.27276) (Hebbar et al., 2026) — a self-improving loop where a language-model agent updates both the harness and the weights of a task-specific agent. The paper reports a 56.6% gain on LawBench, 91.9% runtime reduction on GPU kernels, and 502% improvement on single-cell RNA denoising over baseline.\n\nSIA is a Self Improving AI framework to autonomously improve the performance of any AI system (Model / Agent) on a benchmark task.\n\nJust want to try it?Skip to[Run SIA locally].\n\n*Control flow between Meta, Target, and Feedback agents over successive generations.*\n\nSIA operates by coordinating three main types of AI agents that work together to continuously improve task performance:\n\n**Meta-Agent**: Reads the task description and generates an initial Target Agent tailored to the task.** Target / Task Specific Agent**: Attempts to complete the task and records its actions and results.** Feedback/Improvement Agent**: Reviews the Target Agent's performance logs, identifies improvements, and updates the Target Agent accordingly.\n\nThis iterative process allows the system to autonomously refine and enhance its ability to solve scientific tasks.\n\n*OpenAI MLE-Bench Hard: a gauntlet of real Kaggle ML competitions where agents must write, run, and iterate full ML pipelines. SIA ranks #1 across all generations tested.*\n\n*LawBench: predict the criminal charge from Chinese court case descriptions across 191 charge categories. SIA-W+H reaches 70.1% Top-1 accuracy, beating the prior SOTA of 45%.*\n\n*AlphaFold-3 TriMul Triton Kernel: implement and optimize the Triangle Multiplicative Update as a Triton kernel, preserving correctness while hitting H100 latency targets. SIA-W+H achieves 14x speedup over baseline.*\n\n*scRNA-seq Denoising: impute missing gene expression values in single-cell RNA sequencing data. SIA-W+H scores 0.289 MSE norm, surpassing the prior SOTA of 0.220.*\n\nSIA ships with four built-in tasks: `gpqa`\n\n, `lawbench`\n\n, `longcot-chess`\n\n, `spaceship-titanic`\n\n.\n\nPick the Agent backend that matches the LLMs you want to run.\n\n**Claude backend** (Claude Agent SDK, Claude models only):\n\n```\npython3 -m venv .venv && source .venv/bin/activate\npip install 'sia-agent[claude]'\nexport ANTHROPIC_API_KEY=\"...\"\n```\n\n**OpenHands backend** (multi-provider — Gemini, OpenAI, Anthropic, etc.):\n\n```\npython3 -m venv .venv && source .venv/bin/activate\npip install 'sia-agent[openhands]'\n\n# Export the key(s) for the provider(s) you'll use:\nexport ANTHROPIC_API_KEY=\"...\"   # for anthropic/* models\nexport GEMINI_API_KEY=\"...\"      # for gemini/* models (or GOOGLE_API_KEY)\nexport OPENAI_API_KEY=\"...\"      # for openai/* models\n```\n\nFull provider/model reference: [docs/configuration.md](/hexo-ai/sia/blob/main/docs/configuration.md#api-keys).\n\n```\nsia --task gpqa --max_gen 5 --run_id 1\n```\n\nSwap `--task`\n\nfor any of the four bundled tasks.\n\nArtifacts land in `runs/run_{run_id}/gen_{n}/`\n\n:\n\n`target_agent.py`\n\n— the agent for that generation`agent_execution.json`\n\n— execution logs`improvement.md`\n\n— diff rationale (gen 2+)\n\n| Flag | Default | Description |\n|---|---|---|\n`--task` |\n— | Bundled task name (mutually exclusive with `--task_dir` ) |\n`--task_dir` |\n— | Path to an external task directory |\n`--max_gen` |\n3 | Number of self-improvement generations |\n`--run_id` |\n1 | Unique run identifier |\n`--backend` |\n`claude` |\n`claude` (Claude Agent SDK) or `openhands` (multi-provider) |\n`--meta_model` |\n`haiku` |\nMeta/feedback model (e.g. `haiku` , `sonnet` , `opus` , or `gemini/...` , `openai/...` with openhands) |\n`--task_model` |\n`claude-haiku-4-5-20251001` |\nTarget agent model |\n\nFull backend, model, and API-key reference: [docs/configuration.md](/hexo-ai/sia/blob/main/docs/configuration.md). Hit a snag? [docs/troubleshooting.md](/hexo-ai/sia/blob/main/docs/troubleshooting.md).\n\nPrepare a task directory with the layout below and point `--task_dir`\n\nat it:\n\n```\nmy-task/\n├── data/\n│   ├── public/\n│   │   ├── task.md          # Task description — SIA reads this\n│   │   └── ...              # Inputs the agent is allowed to see\n│   └── private/             # Held-out eval data; never exposed to the agent\n└── reference/\n    ├── reference_target_agent.py     # Template; copy from sia/tasks/_shared/\n    └── SAMPLE_TASK_DESCRIPTIONS.md   # Optional: example tasks for the meta-agent\nsia --task_dir ./my-task --max_gen 5 --run_id 1\n```\n\n**Or bring an MLE-Bench competition.** SIA can bootstrap a task directory directly from any [MLE-Bench](https://github.com/openai/mle-bench) competition — it pulls the dataset via the Kaggle API, sets up the public/private split, and drops in the reference agent template:\n\n```\npython -m sia.prepare_mlebench_dataset -c \"spaceship-titanic\"\nsia --task_dir ./tasks/spaceship-titanic --max_gen 5 --run_id 1\n```\n\nFull step-by-step for both paths: [docs/walkthrough.md](/hexo-ai/sia/blob/main/docs/walkthrough.md).\n\n[docs/architecture.md](/hexo-ai/sia/blob/main/docs/architecture.md)— directory layout, generation flow, prompt customization[docs/walkthrough.md](/hexo-ai/sia/blob/main/docs/walkthrough.md)— detailed custom-task walkthrough[docs/configuration.md](/hexo-ai/sia/blob/main/docs/configuration.md)— backends, models, API keys, CLI reference[docs/troubleshooting.md](/hexo-ai/sia/blob/main/docs/troubleshooting.md)— common errors and fixes\n\nIf you use SIA in your research, please cite:\n\n```\n@article{hebbar2026sia,\n  title   = {SIA: Self Improving AI with Harness \\& Weight Updates},\n  author  = {Hebbar, Prannay and Manawat, Yogendra and Verboomen, Samuel and Ivanova, Alesia and Palanimalai, Selvam and Bhatia, Kunal and Baskaran, Vignesh},\n  journal = {arXiv preprint arXiv:2605.27276},\n  year    = {2026},\n  url     = {https://arxiv.org/abs/2605.27276}\n}\n```\n\n", "url": "https://wpnews.pro/news/ai-agent-that-at-inference-time-updates-it-s-harness-and-model-weights", "canonical_source": "https://github.com/hexo-ai/sia", "published_at": "2026-05-31 11:13:43+00:00", "updated_at": "2026-05-31 11:47:25.018571+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "ai-agents", "ai-research"], "entities": ["SIA", "LawBench", "OpenAI", "MLE-Bench", "Hebbar"], "alternates": {"html": "https://wpnews.pro/news/ai-agent-that-at-inference-time-updates-it-s-harness-and-model-weights", "markdown": "https://wpnews.pro/news/ai-agent-that-at-inference-time-updates-it-s-harness-and-model-weights.md", "text": "https://wpnews.pro/news/ai-agent-that-at-inference-time-updates-it-s-harness-and-model-weights.txt", "jsonld": "https://wpnews.pro/news/ai-agent-that-at-inference-time-updates-it-s-harness-and-model-weights.jsonld"}}