Relent less AI self-evolution

Harness Forge, a Claude Code skill, reimplements the Meta-Harness optimization loop natively, reducing code from ~1,260 lines to ~75 lines by leveraging Claude Code's built-in agent runtime. The skill iteratively proposes, scores, and Pareto-optimizes harness code around a fixed model, achieving a reported +7.7 accuracy points at ~4× fewer context tokens on text classification. It is available via a one-line install or as a Claude Code plugin.

Harness Forge is a Claude Code https://claude.com/claude-code skill that runs an end-to-end harness-optimization loop — propose → score → keep the Pareto-best → repeat — to improve the code around a fixed model: its memory, retrieval, context construction, summarization, prompt templates, and tool-selection logic. The model never changes; the scaffolding gets better. It is a native reimplementation of the method in Meta-Harness: End-to-End Optimization of Model Harnesses https://arxiv.org/abs/2603.28052 Lee, Nair, Zhang, Lee, Khattab & Finn, 2026 . The original reference repo https://github.com/stanford-iris-lab/meta-harness ships ~1,260 lines of Python claude wrapper.py + meta harness.py whose job is to drive a headless Claude : spawn a session, parse its output, track tool calls, log everything, loop. Inside Claude Code, that runtime already exists as first-class tools. So Harness Forge keeps only the irreducible domain logic — a cheap scorer — and expresses the entire outer loop as native orchestration. The whole search becomes ~75 lines instead of ~1,260. seed the frontier with the incumbent harness the thing to beat repeat: PROPOSE k candidate harness variants ← parallel proposer agents write code VALIDATE each imports / type-checks SCORE each on a held-out-protected eval ← a $0, deterministic scorer FRONTIER Pareto-merge: quality up, cost down, floor-respecting final: score the frontier once on the untouched test split The proposer is the mutation operator. The frontier is the search memory. The model is frozen throughout — which is exactly why this fits a fixed / off-the-shelf-API deployment, where you can't change the weights and the gain has to come from the harness. The paper's headline result was +7.7 accuracy points at ~4× fewer context tokens on text classification — a pure harness-side win. Harness Forge reproduces that shape of result natively. claude wrapper.py is a hand-rolled agent runtime. Claude Code is an agent runtime. So every orchestration piece has a native equivalent, and the Python driver becomes redundant: | Meta-Harness Python | Harness Forge native | |---|---| claude wrapper.run — drive a headless Claude | Agent / agent inside a Workflow | meta harness.py outer loop | a Workflow script parallel / while | pending eval.json handshake | a typed schema return — no file round-trip | evolution summary.jsonl / frontier.json | workflow variables + a results JSONL | SKILL.md proposer prior | a skill / prior file the proposer agent reads | | "run N iterations" | the workflow loop, /loop , or CronCreate | | 3 candidates / iteration serial | parallel — proposers run concurrently | inner loop.py scorer | stays a script — the one irreducible piece | The only thing you still write is the cheap scorer + rubric + candidate interface . Everything orchestration-shaped is free. 1. Install the skill — one line: curl -fsSL https://raw.githubusercontent.com/001TMF/harness-forge/main/install.sh | bash Or as a Claude Code plugin inside Claude Code : /plugin marketplace add 001TMF/harness-forge /plugin install harness-forge@tmf-skills Other ways project-scoped ./.claude/skills, this repo only curl -fsSL https://raw.githubusercontent.com/001TMF/harness-forge/main/install.sh | bash -s -- --project via skills.sh vercel-labs/skills npx skills add 001TMF/harness-forge --skill meta-harness -a claude-code manual git clone https://github.com/001TMF/harness-forge.git cp -r harness-forge/skills/meta-harness ~/.claude/skills/meta-harness It auto-triggers when you talk about optimizing a harness, scaffold, prompt system, memory or retrieval policy, or summarizer — or invoke it directly as the meta-harness skill. 2. Run the worked example $0, no model, no network : php cd harness-forge/examples/memory-summary python score baselines.py - baseline incumbent fidelity=1.000 chars=269 the system to beat 3. Run a real search — invoke the Workflow tool with the example's loop script: Workflow { scriptPath: "<abs /examples/memory-summary/native meta harness workflow.js", args: { dir: "<abs /examples/memory-summary", rounds: 2, k: 3 } } Proposer agents run on your Claude subscription; the scorer is $0; there is no solver model and no metered API . A successful round produces a compressor holding fidelity at < 269 chars . The loop is native; the domain is yours. Templates are in skills/meta-harness/assets/ /001TMF/harness-forge/blob/main/skills/meta-harness/assets ; how-to is in : /001TMF/harness-forge/blob/main/skills/meta-harness/references/building-blocks.md references/building-blocks.md Candidate interface — one clean, swappable boundary an ABC / Protocol . A $0 deterministic scorer + rubric — the inner loop; runs hundreds of times, so no LLM, no network. It must vary with the candidate see the trap below . An eval corpus with a held-out split. A proposer prior — a mini-skill steering proposers toward mechanism-level changes not constant-tuning and forbidding eval-set leakage. A frontier + run log — the state.computes the floor-respecting frontier deterministically. scripts/pareto.py The frozen-replay defect. If your scorer replays cached outputs a recorded run, a frozen trace , a scaffolding candidate cannot change the recorded result — only the cost axis moves. A naive "maximize quality, minimize cost" search then wins by emptying the context while the frozen quality score never drops, producing a confident, meaningless frontier. Test:"If I swap in a wildly different candidate, can this number change for aqualityreason?" If only cost can move, you are replaying frozen outputs. Fix: grade something the candidate genuinely controls retrieval relevance, compression fidelity, a counterfactual decision , and/or run quality as a one-sided do-no-harm floor rather than a maximize axis. The skill makes this — plus held-out discipline, an anti-Goodhart floor, and anti-leakage — load-bearing. Full treatment in references/method.md /001TMF/harness-forge/blob/main/skills/meta-harness/references/method.md . harness-forge/ ├── .claude-plugin/marketplace.json installable as a Claude Code plugin ├── install.sh one-line curl|bash install ├── skills/ │ └── meta-harness/ the installable skill │ ├── SKILL.md what/when, the loop, the 5 blocks, the guardrails │ ├── references/ method · native-execution · building-blocks · worked example │ ├── assets/ templates: workflow loop, scorer, interface, proposer prior │ └── scripts/pareto.py reusable floor-respecting Pareto frontier └── examples/ └── memory-summary/ a complete, runnable search the $0 demo + the native loop Use it when the base model is fixed, there are repeated tasks, and a cheap measurable eval exists or can be built — i.e. the gain has to come from the harness. Classic targets: context bloat, weak retrieval, lossy summarization, brittle prompt scaffolds. Don't when the gain must come from the model weights do RL / fine-tuning instead , or when there is no stable evaluation loop. Meta-Harness and RL are complementary: in a fixed-base-model phase, Harness Forge is the only available optimizer — and it forces the eval-hardening a later RL phase also depends on, at near-zero cost. See references/method.md /001TMF/harness-forge/blob/main/skills/meta-harness/references/method.md §6. The method is Meta-Harness by Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. This repo is an independent native reimplementation as a Claude Code skill; it vendors no code from the original repo. If you use it, please cite the paper: @misc{lee2026metaharnessendtoendoptimizationmodel, title={Meta-Harness: End-to-End Optimization of Model Harnesses}, author={Yoonho Lee and Roshen Nair and Qizheng Zhang and Kangwook Lee and Omar Khattab and Chelsea Finn}, year={2026}, eprint={2603.28052}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2603.28052}, } - Paper: https://arxiv.org/abs/2603.28052 https://arxiv.org/abs/2603.28052 - Reference implementation: https://github.com/stanford-iris-lab/meta-harness https://github.com/stanford-iris-lab/meta-harness MIT /001TMF/harness-forge/blob/main/LICENSE © 2026 Tristan Farmer