cd /news/artificial-intelligence/relent-less-ai-self-evolution ยท home โ€บ topics โ€บ artificial-intelligence โ€บ article
[ARTICLE ยท art-26919] src=github.com โ†— pub= topic=artificial-intelligence verified=true sentiment=โ†‘ positive

Relent less AI self-evolution

Harness Forge, a Claude Code skill, reimplements the Meta-Harness optimization loop natively, reducing code from ~1,260 lines to ~75 lines by leveraging Claude Code's built-in agent runtime. The skill iteratively proposes, scores, and Pareto-optimizes harness code around a fixed model, achieving a reported +7.7 accuracy points at ~4ร— fewer context tokens on text classification. It is available via a one-line install or as a Claude Code plugin.

read6 min publishedJun 14, 2026

Harness Forge is a Claude Code skill that runs an end-to-end harness-optimization loop โ€” propose โ†’ score โ†’ keep the Pareto-best โ†’ repeat โ€” to improve the code around a fixed model: its memory, retrieval, context construction, summarization, prompt templates, and tool-selection logic. The model never changes; the scaffolding gets better.

It is a native reimplementation of the method in Meta-Harness: End-to-End Optimization of Model Harnesses (Lee, Nair, Zhang, Lee, Khattab & Finn, 2026). The original

reference repoships ~1,260 lines of Python (

claude_wrapper.py

  • meta_harness.py

) whose job is to drive a headless Claude: spawn a session, parse its output, track tool calls, log everything, loop.

Inside Claude Code, that runtime already exists as first-class tools. So Harness Forge keeps only the irreducible domain logic โ€” a cheap scorer โ€” and expresses the entire outer loop as native orchestration. The whole search becomes ~75 lines instead of ~1,260.

seed the frontier with the incumbent harness (the thing to beat)
repeat:
    PROPOSE   k candidate harness variants     โ† parallel proposer agents write code
    VALIDATE  each imports / type-checks
    SCORE     each on a held-out-protected eval โ† a $0, deterministic scorer
    FRONTIER  Pareto-merge: quality up, cost down, floor-respecting
final: score the frontier once on the untouched test split

The proposer is the mutation operator. The frontier is the search memory. The model is frozen throughout โ€” which is exactly why this fits a fixed / off-the-shelf-API deployment, where you can't change the weights and the gain has to come from the harness.

The paper's headline result was +7.7 accuracy points at ~4ร— fewer context tokens on text classification โ€” a pure harness-side win. Harness Forge reproduces that shape of result natively.

claude_wrapper.py

is a hand-rolled agent runtime. Claude Code is an agent runtime. So every orchestration piece has a native equivalent, and the Python driver becomes redundant:

Meta-Harness (Python) Harness Forge (native)
claude_wrapper.run() โ€” drive a headless Claude
Agent / agent() inside a Workflow
meta_harness.py outer loop
a Workflow script (parallel / while )
pending_eval.json handshake
a typed schema return โ€” no file round-trip
evolution_summary.jsonl / frontier.json
workflow variables + a results JSONL
SKILL.md proposer prior
a skill / prior file the proposer agent reads
"run N iterations" the workflow loop, /loop , or CronCreate
3 candidates / iteration (serial) parallel() โ€” proposers run concurrently
inner_loop.py scorer
stays a script โ€” the one irreducible piece

The only thing you still write is the cheap scorer + rubric + candidate interface. Everything orchestration-shaped is free.

1. Install the skill โ€” one line:

curl -fsSL https://raw.githubusercontent.com/001TMF/harness-forge/main/install.sh | bash

Or as a Claude Code plugin (inside Claude Code):

/plugin marketplace add 001TMF/harness-forge
/plugin install harness-forge@tmf-skills

Other ways #

curl -fsSL https://raw.githubusercontent.com/001TMF/harness-forge/main/install.sh | bash -s -- --project

npx skills add 001TMF/harness-forge --skill meta-harness -a claude-code

git clone https://github.com/001TMF/harness-forge.git
cp -r harness-forge/skills/meta-harness ~/.claude/skills/meta-harness

It auto-triggers when you talk about optimizing a harness, scaffold, prompt system, memory or retrieval policy, or summarizer โ€” or invoke it directly as the meta-harness

skill.

2. Run the worked example ($0, no model, no network):

cd harness-forge/examples/memory-summary
python score_baselines.py

3. Run a real search โ€” invoke the Workflow

tool with the example's loop script:

Workflow({ scriptPath: "<abs>/examples/memory-summary/native_meta_harness_workflow.js",
           args: { dir: "<abs>/examples/memory-summary", rounds: 2, k: 3 } })

Proposer agents run on your Claude subscription; the scorer is $0; there is no solver model and no metered API. A successful round produces a compressor holding fidelity at < 269 chars.

The loop is native; the domain is yours. Templates are in skills/meta-harness/assets/; how-to is in

:

references/building-blocks.md

Candidate interfaceโ€” one clean, swappable boundary (an ABC / Protocol).** A $0 deterministic scorer + rubric**โ€” the inner loop; runs hundreds of times, so no LLM, no network. It** must vary with the candidate**(see the trap below).** An eval corpus with a held-out split.A proposer priorโ€” a mini-skill steering proposers towardmechanism-levelchanges (not constant-tuning) and forbidding eval-set leakage.A frontier + run logโ€” the state.computes the floor-respecting frontier deterministically.scripts/pareto.py

The frozen-replay defect. If your scorer replays cached outputs (a recorded run, a frozen trace), a scaffolding candidate cannot change the recorded result โ€” only the cost axis moves. A naive "maximize quality, minimize cost" search then wins by emptying the context while the frozen quality score never drops, producing a confident, meaningless frontier.

Test:"If I swap in a wildly different candidate, can this number change for aqualityreason?" If only cost can move, you are replaying frozen outputs.

Fix: grade something the candidate genuinely controls (retrieval relevance, compression fidelity, a counterfactual decision), and/or run quality as a one-sided do-no-harm floor rather than a maximize axis. The skill makes this โ€” plus held-out discipline, an anti-Goodhart floor, and anti-leakage โ€” load-bearing. Full treatment in references/method.md.

harness-forge/
โ”œโ”€โ”€ .claude-plugin/marketplace.json   # installable as a Claude Code plugin
โ”œโ”€โ”€ install.sh                        # one-line curl|bash install
โ”œโ”€โ”€ skills/
โ”‚   โ””โ”€โ”€ meta-harness/             # the installable skill
โ”‚       โ”œโ”€โ”€ SKILL.md              #   what/when, the loop, the 5 blocks, the guardrails
โ”‚       โ”œโ”€โ”€ references/           #   method ยท native-execution ยท building-blocks ยท worked example
โ”‚       โ”œโ”€โ”€ assets/               #   templates: workflow loop, scorer, interface, proposer prior
โ”‚       โ””โ”€โ”€ scripts/pareto.py     #   reusable floor-respecting Pareto frontier
โ””โ”€โ”€ examples/
    โ””โ”€โ”€ memory-summary/           # a complete, runnable search (the $0 demo + the native loop)

Use it when the base model is fixed, there are repeated tasks, and a cheap measurable eval exists (or can be built) โ€” i.e. the gain has to come from the harness. Classic targets: context bloat, weak retrieval, lossy summarization, brittle prompt scaffolds.

Don't when the gain must come from the model weights (do RL / fine-tuning instead), or when there is no stable evaluation loop. Meta-Harness and RL are complementary: in a fixed-base-model phase, Harness Forge is the only available optimizer โ€” and it forces the eval-hardening a later RL phase also depends on, at near-zero cost. See references/method.md ยง6.

The method is Meta-Harness by Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. This repo is an independent native reimplementation as a Claude Code skill; it vendors no code from the original repo. If you use it, please cite the paper:

@misc{lee2026metaharnessendtoendoptimizationmodel,
  title={Meta-Harness: End-to-End Optimization of Model Harnesses},
  author={Yoonho Lee and Roshen Nair and Qizheng Zhang and Kangwook Lee and Omar Khattab and Chelsea Finn},
  year={2026},
  eprint={2603.28052},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2603.28052},
}

MIT ยฉ 2026 Tristan Farmer

โ”€โ”€ more in #artificial-intelligence 4 stories ยท sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain โ€” perfect for shipping the agent you just read about.

$git push zahid main
โ†’ Live at https://your-agent.zahid.host โœ“
Get free account โ†’ Pricing
from โ‚ฌ0/mo ยท no card required
LIVE [news/relent-less-ai-self-โ€ฆ] indexed:0 read:6min 2026-06-14 ยท โ€”