{"slug": "harness-1-state-externalizing-search-harness", "title": "Harness-1: State-Externalizing Search Harness", "summary": "The Harness-1 paper introduces a 20B reinforcement-learning-trained search agent that externalizes its working memory into a structured harness—candidate pools, evidence links, and verification records—rather than an ever-growing transcript. This design keeps context cost flat as search deepens, with the harness rendering only a budget-bounded slice into the model's context each step. The agent achieves 0.730 average curated recall across 8 retrieval benchmarks, outperforming the next-best open search agent by 11.4 points.", "body_md": "**What:** The **Harness-1 paper** introduces a **20B RL-trained search agent that externalizes its working memory into a structured harness** — candidate pools, evidence links, and verification records — instead of an ever-growing transcript.\n\n**Why:** A deep search agent that **replays its whole history every step** runs the context window dry. Harness-1 makes **context cost stay flat as the search deepens**, which is the harness-as-state idea the agent-engineering world preaches, made concrete and **RL-trained**.\n\n**vs prior:** Earlier search agents **train over a growing transcript**, so every candidate, observation, and verification lands back in context. Harness-1 trains over an **external workspace** and renders only a **budget-bounded slice** — the policy decides what to search and verify; the harness owns the memory.\n\nA detective's case-board on the wall, briefed by index card.\n\n```\n                  THE GROWING CASE\n                         │\n              ┌──────────┴──────────┐\n              │                     │\n      ┌───────▼────────┐    ┌───────▼────────┐\n      │  HARNESS-1     │    │ GROWING        │\n      │  case-board    │    │ TRANSCRIPT     │\n      │  on the wall   │    │ lug whole file │\n      └───────┬────────┘    └───────┬────────┘\n              │                     │\n      carry one index card  haul the entire box\n      into each interview    into every interview\n              │                     │\n              ▼                     ▼\n      ✓ desk stays clear    ✗ desk overflows\n        context stays flat     window overruns\n```\n\n**Harness** — The **scaffolding around the model** that owns tools, state, and exactly what gets shown to the model each step. The model is the brain; the harness is the desk, filing cabinet, and notepad.\n\n**Context window** — The **fixed token budget** the model can read on any single step. Anything outside it is invisible to the model — and tokens are not free, so a full window is both a cost and a hard ceiling.\n\n**Growing transcript** — The naïve agent-memory design: **concatenate the full action-and-observation history** and feed it back every step. It grows without bound, so a long search eventually overruns the context window.\n\n**State externalization** — Keeping durable working memory **outside the model's context** — in the harness — so accumulated evidence does not spend context budget. The model reads a rendered view, not the raw store.\n\n**Budget-bounded rendering** — Each step, the harness selects only a **token-budgeted slice** of the workspace to render into context, so context size is **constant regardless of search depth**.\n\n**Curated set** — The agent's running shortlist of **importance-tagged, verified evidence** — distinct from the raw candidate pool. Harness-1's headline metric is **curated recall**: how much of the gold evidence lands in this set.\n\n**Curated recall** — The fraction of the gold (correct) evidence that ends up in the curated set, averaged across **8 retrieval benchmarks**. Harness-1 reports **0.730**, +11.4 points over the next-best open search agent.\n\nThe news.On June 1, 2026,Harness-1([arXiv:2606.02373]) introduced a20B-parameter search agentthat separates semantic decision-making from state management. The policy decides what to search, inspect, curate, verify, and when to stop; astate-externalizing harnessholds the working memory — candidate pools, importance-tagged curated sets, evidence links, verification records, and compressed observations — and renders only a budget-bounded slice into the model's context each step. Rather than training over an ever-growing transcript, the agent is trained withreinforcement learning over a structured external workspace. It reports0.730 average curated recall across 8 retrieval benchmarks (web, finance, patents, multi-hop QA), +11.4 pointsover the next-strongest open search sub-agent.[Read the paper →]\n\nPicture a detective working a long case. Every lead, photo, and verified alibi gets pinned to the **case-board on the wall** and connected with red string — the board is the durable record, and it only ever grows. When the detective walks into an interview, they don't wheel the entire case file into the room; they carry a single **index-card briefing** with just what this conversation needs. The board stays on the wall; only a briefing walks in. A rookie who instead lugs the whole growing file box into every interview eventually runs out of desk space — that is exactly what happens when a search agent replays its entire transcript into a finite context window.\n\nThat is the move Harness-1 makes concrete. The naïve design treats the agent's memory as a **growing transcript**: every observation, every candidate document, every verification step is concatenated and fed back to the model on the next step. It works for a few steps, then the transcript balloons and the search has to stop — not because the agent ran out of leads, but because it ran out of room. Harness-1 instead keeps that durable state in the harness — the case-board — and lets the policy decide where the agent's working state lives. Each step, the harness performs **budget-bounded rendering**: it selects a token-bounded slice of the workspace — the briefing — and shows only that to the model. The board can grow to hundreds of items while the briefing stays the same size, so **context cost stays flat no matter how deep the search goes**. Crucially, the agent is trained with reinforcement learning *over this workspace*, not over transcripts, so the policy learns the harness skills — curate, importance-tag, verify, compress, stop — as first-class actions.\n\n| Design | What lives in context | Context cost as search deepens | Failure mode |\n|---|---|---|---|\n| Growing transcript | The full action + observation history, replayed every step | Grows with every step | Overflows the window; the search stalls on length, not leads |\n| State-externalizing harness | A\n|\n\n*The two rows describe the contrast Harness-1 draws between transcript-style memory and its externalized workspace; the \"budget-bounded slice\" claim is from the paper. Token figures in the hero animation are illustrative.*\n\nWalk the budget with some round numbers *(illustrative)*. Say each search step adds about **2,000 tokens** of fresh observations. Under the growing-transcript design, those tokens never leave: after 8 steps the model is reading roughly 16,000 tokens of history, after 20 steps about 40,000, and a genuinely deep multi-hop search marches straight past a typical working window. Under the state-externalizing harness, those 2,000-token observations land in the workspace, but the model is only ever shown a fixed ~6,000-token render — step 8 and step 20 cost the *same* **6,000 tokens** in context. The accumulated evidence still exists; it just lives on the case-board instead of in the briefing. That is why Harness-1 can keep curating to **0.730 recall** across deep benchmarks where a transcript agent would have run out of room — and it's the same lever the agent-engineering track frames as durable state the harness owns, rather than state smeared across a prompt.\n\nIt lands as a sharp companion to the recent push on *how* search agents act — GrepSeek learns a better **action space** (shell commands over a corpus), while Harness-1 learns a better **state substrate** (an externalized workspace). Same RL-trained-search-agent family, orthogonal levers. As the work frames it, the model should make the semantic calls and the harness should own the memory — a clean division that the standard fixes for an overflowing context have been circling, now learned end-to-end.\n\n*Goes deeper in: AI Agents → The Agent Loop & State → The Anatomy of a Harness*\n\nHarness-1 is a 20B-parameter, RL-trained search agent that separates the model's semantic decisions (what to search, inspect, curate, verify, and when to stop) from state management. A state-externalizing harness holds the durable working memory — candidate pools, importance-tagged curated sets, evidence links, verification records, and compressed observations — and renders only a budget-bounded slice into the model's context each step. It reports 0.730 average curated recall across 8 retrieval benchmarks, +11.4 points over the next-strongest open search sub-agent.\n\nA search agent that replays its full transcript into context each step grows that context with every observation, so a deep search eventually overruns the context window and stops on length rather than on evidence. Externalizing state keeps the accumulated evidence in the harness and renders only a fixed-size slice, so context cost stays flat regardless of search depth — letting the agent keep curating across deep, multi-hop benchmarks.\n\nA growing transcript concatenates the entire action-and-observation history and feeds it back every step, so its size scales with the number of steps. Harness-1 instead stores that history in a structured external workspace and trains the policy with reinforcement learning over that workspace — so the model learns to curate, verify, and compress as explicit actions, and the context the model reads is a budget-bounded rendering of the workspace rather than the raw, unbounded log.\n\nOriginally posted on [Learn AI Visually](https://learnaivisually.com/ai-explained/harness-1-externalized-state).", "url": "https://wpnews.pro/news/harness-1-state-externalizing-search-harness", "canonical_source": "https://dev.to/pueding/harness-1-state-externalizing-search-harness-2c9b", "published_at": "2026-06-03 11:16:02+00:00", "updated_at": "2026-06-03 11:43:15.683145+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "artificial-intelligence", "ai-research", "machine-learning"], "entities": ["Harness-1"], "alternates": {"html": "https://wpnews.pro/news/harness-1-state-externalizing-search-harness", "markdown": "https://wpnews.pro/news/harness-1-state-externalizing-search-harness.md", "text": "https://wpnews.pro/news/harness-1-state-externalizing-search-harness.txt", "jsonld": "https://wpnews.pro/news/harness-1-state-externalizing-search-harness.jsonld"}}