# Harness-1: State-Externalizing Search Harness

> Source: <https://dev.to/pueding/harness-1-state-externalizing-search-harness-2c9b>
> Published: 2026-06-03 11:16:02+00:00

**What:** The **Harness-1 paper** introduces a **20B RL-trained search agent that externalizes its working memory into a structured harness** — candidate pools, evidence links, and verification records — instead of an ever-growing transcript.

**Why:** A deep search agent that **replays its whole history every step** runs the context window dry. Harness-1 makes **context cost stay flat as the search deepens**, which is the harness-as-state idea the agent-engineering world preaches, made concrete and **RL-trained**.

**vs prior:** Earlier search agents **train over a growing transcript**, so every candidate, observation, and verification lands back in context. Harness-1 trains over an **external workspace** and renders only a **budget-bounded slice** — the policy decides what to search and verify; the harness owns the memory.

A detective's case-board on the wall, briefed by index card.

```
                  THE GROWING CASE
                         │
              ┌──────────┴──────────┐
              │                     │
      ┌───────▼────────┐    ┌───────▼────────┐
      │  HARNESS-1     │    │ GROWING        │
      │  case-board    │    │ TRANSCRIPT     │
      │  on the wall   │    │ lug whole file │
      └───────┬────────┘    └───────┬────────┘
              │                     │
      carry one index card  haul the entire box
      into each interview    into every interview
              │                     │
              ▼                     ▼
      ✓ desk stays clear    ✗ desk overflows
        context stays flat     window overruns
```

**Harness** — The **scaffolding around the model** that owns tools, state, and exactly what gets shown to the model each step. The model is the brain; the harness is the desk, filing cabinet, and notepad.

**Context window** — The **fixed token budget** the model can read on any single step. Anything outside it is invisible to the model — and tokens are not free, so a full window is both a cost and a hard ceiling.

**Growing transcript** — The naïve agent-memory design: **concatenate the full action-and-observation history** and feed it back every step. It grows without bound, so a long search eventually overruns the context window.

**State externalization** — Keeping durable working memory **outside the model's context** — in the harness — so accumulated evidence does not spend context budget. The model reads a rendered view, not the raw store.

**Budget-bounded rendering** — Each step, the harness selects only a **token-budgeted slice** of the workspace to render into context, so context size is **constant regardless of search depth**.

**Curated set** — The agent's running shortlist of **importance-tagged, verified evidence** — distinct from the raw candidate pool. Harness-1's headline metric is **curated recall**: how much of the gold evidence lands in this set.

**Curated recall** — The fraction of the gold (correct) evidence that ends up in the curated set, averaged across **8 retrieval benchmarks**. Harness-1 reports **0.730**, +11.4 points over the next-best open search agent.

The news.On June 1, 2026,Harness-1([arXiv:2606.02373]) introduced a20B-parameter search agentthat separates semantic decision-making from state management. The policy decides what to search, inspect, curate, verify, and when to stop; astate-externalizing harnessholds the working memory — candidate pools, importance-tagged curated sets, evidence links, verification records, and compressed observations — and renders only a budget-bounded slice into the model's context each step. Rather than training over an ever-growing transcript, the agent is trained withreinforcement learning over a structured external workspace. It reports0.730 average curated recall across 8 retrieval benchmarks (web, finance, patents, multi-hop QA), +11.4 pointsover the next-strongest open search sub-agent.[Read the paper →]

Picture a detective working a long case. Every lead, photo, and verified alibi gets pinned to the **case-board on the wall** and connected with red string — the board is the durable record, and it only ever grows. When the detective walks into an interview, they don't wheel the entire case file into the room; they carry a single **index-card briefing** with just what this conversation needs. The board stays on the wall; only a briefing walks in. A rookie who instead lugs the whole growing file box into every interview eventually runs out of desk space — that is exactly what happens when a search agent replays its entire transcript into a finite context window.

That is the move Harness-1 makes concrete. The naïve design treats the agent's memory as a **growing transcript**: every observation, every candidate document, every verification step is concatenated and fed back to the model on the next step. It works for a few steps, then the transcript balloons and the search has to stop — not because the agent ran out of leads, but because it ran out of room. Harness-1 instead keeps that durable state in the harness — the case-board — and lets the policy decide where the agent's working state lives. Each step, the harness performs **budget-bounded rendering**: it selects a token-bounded slice of the workspace — the briefing — and shows only that to the model. The board can grow to hundreds of items while the briefing stays the same size, so **context cost stays flat no matter how deep the search goes**. Crucially, the agent is trained with reinforcement learning *over this workspace*, not over transcripts, so the policy learns the harness skills — curate, importance-tag, verify, compress, stop — as first-class actions.

| Design | What lives in context | Context cost as search deepens | Failure mode |
|---|---|---|---|
| Growing transcript | The full action + observation history, replayed every step | Grows with every step | Overflows the window; the search stalls on length, not leads |
| State-externalizing harness | A
|

*The two rows describe the contrast Harness-1 draws between transcript-style memory and its externalized workspace; the "budget-bounded slice" claim is from the paper. Token figures in the hero animation are illustrative.*

Walk the budget with some round numbers *(illustrative)*. Say each search step adds about **2,000 tokens** of fresh observations. Under the growing-transcript design, those tokens never leave: after 8 steps the model is reading roughly 16,000 tokens of history, after 20 steps about 40,000, and a genuinely deep multi-hop search marches straight past a typical working window. Under the state-externalizing harness, those 2,000-token observations land in the workspace, but the model is only ever shown a fixed ~6,000-token render — step 8 and step 20 cost the *same* **6,000 tokens** in context. The accumulated evidence still exists; it just lives on the case-board instead of in the briefing. That is why Harness-1 can keep curating to **0.730 recall** across deep benchmarks where a transcript agent would have run out of room — and it's the same lever the agent-engineering track frames as durable state the harness owns, rather than state smeared across a prompt.

It lands as a sharp companion to the recent push on *how* search agents act — GrepSeek learns a better **action space** (shell commands over a corpus), while Harness-1 learns a better **state substrate** (an externalized workspace). Same RL-trained-search-agent family, orthogonal levers. As the work frames it, the model should make the semantic calls and the harness should own the memory — a clean division that the standard fixes for an overflowing context have been circling, now learned end-to-end.

*Goes deeper in: AI Agents → The Agent Loop & State → The Anatomy of a Harness*

Harness-1 is a 20B-parameter, RL-trained search agent that separates the model's semantic decisions (what to search, inspect, curate, verify, and when to stop) from state management. A state-externalizing harness holds the durable working memory — candidate pools, importance-tagged curated sets, evidence links, verification records, and compressed observations — and renders only a budget-bounded slice into the model's context each step. It reports 0.730 average curated recall across 8 retrieval benchmarks, +11.4 points over the next-strongest open search sub-agent.

A search agent that replays its full transcript into context each step grows that context with every observation, so a deep search eventually overruns the context window and stops on length rather than on evidence. Externalizing state keeps the accumulated evidence in the harness and renders only a fixed-size slice, so context cost stays flat regardless of search depth — letting the agent keep curating across deep, multi-hop benchmarks.

A growing transcript concatenates the entire action-and-observation history and feeds it back every step, so its size scales with the number of steps. Harness-1 instead stores that history in a structured external workspace and trains the policy with reinforcement learning over that workspace — so the model learns to curate, verify, and compress as explicit actions, and the context the model reads is a budget-bounded rendering of the workspace rather than the raw, unbounded log.

Originally posted on [Learn AI Visually](https://learnaivisually.com/ai-explained/harness-1-externalized-state).