{"slug": "relent-less-ai-self-evolution", "title": "Relent less AI self-evolution", "summary": "Harness Forge, a Claude Code skill, reimplements the Meta-Harness optimization loop natively, reducing code from ~1,260 lines to ~75 lines by leveraging Claude Code's built-in agent runtime. The skill iteratively proposes, scores, and Pareto-optimizes harness code around a fixed model, achieving a reported +7.7 accuracy points at ~4× fewer context tokens on text classification. It is available via a one-line install or as a Claude Code plugin.", "body_md": "Harness Forge is a [Claude Code](https://claude.com/claude-code) skill that runs an end-to-end\n**harness-optimization loop** — propose → score → keep the Pareto-best → repeat — to improve the\ncode *around* a fixed model: its memory, retrieval, context construction, summarization, prompt\ntemplates, and tool-selection logic. The model never changes; the scaffolding gets better.\n\nIt is a **native reimplementation** of the method in\n[ Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052)\n(Lee, Nair, Zhang, Lee, Khattab & Finn, 2026). The original\n\n[reference repo](https://github.com/stanford-iris-lab/meta-harness)ships ~1,260 lines of Python (\n\n`claude_wrapper.py`\n\n+ `meta_harness.py`\n\n) whose job is to *drive a headless Claude*: spawn a session, parse its output, track tool calls, log everything, loop.\n\n**Inside Claude Code, that runtime already exists as first-class tools.** So Harness Forge keeps only the irreducible domain logic — a cheap scorer — and expresses the entire outer loop as native orchestration. The whole search becomes ~75 lines instead of ~1,260.\n\n```\nseed the frontier with the incumbent harness (the thing to beat)\nrepeat:\n    PROPOSE   k candidate harness variants     ← parallel proposer agents write code\n    VALIDATE  each imports / type-checks\n    SCORE     each on a held-out-protected eval ← a $0, deterministic scorer\n    FRONTIER  Pareto-merge: quality up, cost down, floor-respecting\nfinal: score the frontier once on the untouched test split\n```\n\nThe proposer is the mutation operator. The frontier is the search memory. The model is frozen\nthroughout — which is exactly why this fits a fixed / off-the-shelf-API deployment, where you\n*can't* change the weights and the gain has to come from the harness.\n\nThe paper's headline result was **+7.7 accuracy points at ~4× fewer context tokens** on text\nclassification — a pure harness-side win. Harness Forge reproduces that shape of result natively.\n\n`claude_wrapper.py`\n\nis a hand-rolled agent runtime. Claude Code *is* an agent runtime. So every\norchestration piece has a native equivalent, and the Python driver becomes redundant:\n\n| Meta-Harness (Python) | Harness Forge (native) |\n|---|---|\n`claude_wrapper.run()` — drive a headless Claude |\n`Agent` / `agent()` inside a `Workflow` |\n`meta_harness.py` outer loop |\na `Workflow` script (`parallel` / `while` ) |\n`pending_eval.json` handshake |\na typed `schema` return — no file round-trip |\n`evolution_summary.jsonl` / `frontier.json` |\nworkflow variables + a results JSONL |\n`SKILL.md` proposer prior |\na skill / prior file the proposer agent reads |\n| \"run N iterations\" | the workflow loop, `/loop` , or `CronCreate` |\n| 3 candidates / iteration (serial) | `parallel()` — proposers run concurrently |\n`inner_loop.py` scorer |\nstays a script — the one irreducible piece |\n\nThe only thing you still write is the **cheap scorer + rubric + candidate interface**. Everything\norchestration-shaped is free.\n\n**1. Install the skill** — one line:\n\n```\ncurl -fsSL https://raw.githubusercontent.com/001TMF/harness-forge/main/install.sh | bash\n```\n\nOr as a Claude Code **plugin** (inside Claude Code):\n\n```\n/plugin marketplace add 001TMF/harness-forge\n/plugin install harness-forge@tmf-skills\n```\n\n## Other ways\n\n```\n# project-scoped (./.claude/skills, this repo only)\ncurl -fsSL https://raw.githubusercontent.com/001TMF/harness-forge/main/install.sh | bash -s -- --project\n\n# via skills.sh (vercel-labs/skills)\nnpx skills add 001TMF/harness-forge --skill meta-harness -a claude-code\n\n# manual\ngit clone https://github.com/001TMF/harness-forge.git\ncp -r harness-forge/skills/meta-harness ~/.claude/skills/meta-harness\n```\n\nIt auto-triggers when you talk about optimizing a harness, scaffold, prompt system, memory or\nretrieval policy, or summarizer — or invoke it directly as the `meta-harness`\n\nskill.\n\n**2. Run the worked example** ($0, no model, no network):\n\n``` php\ncd harness-forge/examples/memory-summary\npython score_baselines.py\n# -> baseline_incumbent  fidelity=1.000 chars=269   (the system to beat)\n```\n\n**3. Run a real search** — invoke the `Workflow`\n\ntool with the example's loop script:\n\n```\nWorkflow({ scriptPath: \"<abs>/examples/memory-summary/native_meta_harness_workflow.js\",\n           args: { dir: \"<abs>/examples/memory-summary\", rounds: 2, k: 3 } })\n```\n\nProposer agents run on your Claude subscription; the scorer is $0; there is **no solver model and\nno metered API**. A successful round produces a compressor holding fidelity at **< 269 chars**.\n\nThe loop is native; the **domain** is yours. Templates are in\n[ skills/meta-harness/assets/](/001TMF/harness-forge/blob/main/skills/meta-harness/assets); how-to is in\n\n[:](/001TMF/harness-forge/blob/main/skills/meta-harness/references/building-blocks.md)\n\n`references/building-blocks.md`\n\n**Candidate interface**— one clean, swappable boundary (an ABC / Protocol).** A $0 deterministic scorer + rubric**— the inner loop; runs hundreds of times, so no LLM, no network. It** must vary with the candidate**(see the trap below).** An eval corpus with a held-out split.****A proposer prior**— a mini-skill steering proposers toward*mechanism-level*changes (not constant-tuning) and forbidding eval-set leakage.**A frontier + run log**— the state.computes the floor-respecting frontier deterministically.`scripts/pareto.py`\n\n**The frozen-replay defect.** If your scorer *replays cached outputs* (a recorded run, a frozen\ntrace), a scaffolding candidate **cannot change the recorded result** — only the cost axis moves.\nA naive \"maximize quality, minimize cost\" search then wins by emptying the context while the\nfrozen quality score never drops, producing a confident, meaningless frontier.\n\nTest:\"If I swap in a wildly different candidate, can this number change for aqualityreason?\" If only cost can move, you are replaying frozen outputs.\n\n**Fix:** grade something the candidate genuinely controls (retrieval relevance, compression\nfidelity, a counterfactual decision), and/or run quality as a one-sided *do-no-harm floor* rather\nthan a maximize axis. The skill makes this — plus held-out discipline, an anti-Goodhart floor, and\nanti-leakage — load-bearing. Full treatment in\n[ references/method.md](/001TMF/harness-forge/blob/main/skills/meta-harness/references/method.md).\n\n```\nharness-forge/\n├── .claude-plugin/marketplace.json   # installable as a Claude Code plugin\n├── install.sh                        # one-line curl|bash install\n├── skills/\n│   └── meta-harness/             # the installable skill\n│       ├── SKILL.md              #   what/when, the loop, the 5 blocks, the guardrails\n│       ├── references/           #   method · native-execution · building-blocks · worked example\n│       ├── assets/               #   templates: workflow loop, scorer, interface, proposer prior\n│       └── scripts/pareto.py     #   reusable floor-respecting Pareto frontier\n└── examples/\n    └── memory-summary/           # a complete, runnable search (the $0 demo + the native loop)\n```\n\n**Use it** when the base model is fixed, there are repeated tasks, and a cheap measurable eval\nexists (or can be built) — i.e. the gain has to come from the harness. Classic targets: context\nbloat, weak retrieval, lossy summarization, brittle prompt scaffolds.\n\n**Don't** when the gain must come from the model weights (do RL / fine-tuning instead), or when\nthere is no stable evaluation loop. Meta-Harness and RL are complementary: in a fixed-base-model\nphase, Harness Forge is the *only* available optimizer — and it forces the eval-hardening a later\nRL phase also depends on, at near-zero cost. See\n[ references/method.md](/001TMF/harness-forge/blob/main/skills/meta-harness/references/method.md) §6.\n\nThe method is **Meta-Harness** by Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar\nKhattab, and Chelsea Finn. This repo is an independent native reimplementation as a Claude Code\nskill; it vendors **no** code from the original repo. If you use it, please cite the paper:\n\n```\n@misc{lee2026metaharnessendtoendoptimizationmodel,\n  title={Meta-Harness: End-to-End Optimization of Model Harnesses},\n  author={Yoonho Lee and Roshen Nair and Qizheng Zhang and Kangwook Lee and Omar Khattab and Chelsea Finn},\n  year={2026},\n  eprint={2603.28052},\n  archivePrefix={arXiv},\n  primaryClass={cs.AI},\n  url={https://arxiv.org/abs/2603.28052},\n}\n```\n\n- Paper:\n[https://arxiv.org/abs/2603.28052](https://arxiv.org/abs/2603.28052) - Reference implementation:\n[https://github.com/stanford-iris-lab/meta-harness](https://github.com/stanford-iris-lab/meta-harness)\n\n[MIT](/001TMF/harness-forge/blob/main/LICENSE) © 2026 Tristan Farmer", "url": "https://wpnews.pro/news/relent-less-ai-self-evolution", "canonical_source": "https://github.com/001TMF/harness-forge", "published_at": "2026-06-14 11:29:23+00:00", "updated_at": "2026-06-14 11:41:43.137139+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-tools", "ai-agents", "developer-tools"], "entities": ["Harness Forge", "Claude Code", "Meta-Harness", "Stanford IRIS Lab", "Lee", "Nair", "Zhang", "Khattab"], "alternates": {"html": "https://wpnews.pro/news/relent-less-ai-self-evolution", "markdown": "https://wpnews.pro/news/relent-less-ai-self-evolution.md", "text": "https://wpnews.pro/news/relent-less-ai-self-evolution.txt", "jsonld": "https://wpnews.pro/news/relent-less-ai-self-evolution.jsonld"}}