When Does Data Help Automated Context Engineering?

Claude Code can improve other AI agents without training data in four of seven tested applications, performing as well as with data. Data helps only where Claude Code's prior knowledge of the task runs out, and the drift between its self-generated inputs and real data predicts this gap with a Spearman correlation of +0.79. The finding suggests that automated agent engineering can often skip data collection when the model already knows the task domain.

When does data help automated agent engineering? Claude Code can often improve another agent with no training data at all. Across seven applications, data helps only where Claude Code’s own prior knowledge of the task runs out, and how far its guesses drift from the real data mostly tells you which cases those are. The catch is that drift reveals when Claude Code is flying blind, not whether flying blind actually costs anything. An agent is more than a model. It is also the prompts, tools, context management, guardrails, orchestration, and compute infrastructure designed around the model. 1, 2 When the agent does something wrong, an agent engineer tunes one or more of these knobs so that the error never happens again. Automated agent engineering goes meta by putting an AI agent in the role of agent engineer. 3 https://www.tensorzero.com/blog/automated-ai-engineer/ , 4 https://arxiv.org/abs/2603.28052 Claude Code and Codex can improve agent prompts /blog/the-engineering-practices-claude-code-and-codex-use-to-improve-ai-agents/ and perform engineering practices like iteratively evaluating their changes. But what happens without training data? Surprisingly, Claude Code’s improvements performed as well without training data as with it on several of the applications I tested. On three Wordle, scientific paper reproduction, business simulation data did improve task success rate by between five and twenty percentage points, but on the other four entity extraction, contract extraction, customer service, software engineering the without-data version performed roughly the same. Looking into the conversation histories explained why. Even without training data, Claude Code still ran ad hoc evaluations: it made inferences on inputs it generated itself and judged whether the output matched its prompt edit. So the operative question is not whether Claude Code has data; it is whether Claude Code already knows what the data looks like. Data helps exactly when Claude Code’s prior knowledge of the application runs out, and that prior runs out in specific, identifiable ways: the corpus is not memorized, the task shape is unfamiliar, or the harness never says what the agent will actually be shown. That suggests a way to measure the missing prior from the outside: how far Claude Code’s self-generated inputs drift from the real data. Across the seven applications, this drift tracks the data-ablation gap the test-set success rate with data minus the rate without it at a Spearman rank correlation of +0.79 exact two-sided p = 0.040; Pearson r = +0.96 . The lone clean exception is the instructive part: drift measures whether Claude Code is flying blind, not whether flying blind hurts the metric. Those are two different axes, and one task scores high on the first while shrugging off the second. The experiment The seven applications I explored are: named entity extraction https://arxiv.org/abs/1909.01441v1 NER , NDA clause extraction https://github.com/applicaai/kleister-nda , Wordle, customer service https://taubench.com/ home , software engineering https://www.tbench.ai/ , business management https://www.ycbench.com/ , and scientific paper reproduction https://arxiv.org/abs/2510.24591 Science . For each application agent I ran the same experiment under two conditions, with data and without, using five independent seeds per condition. Claude Code was given configuration files for the agent prompts, models, tool lists, … . It was instructed to improve the initial prompts. The new prompts were scored on a held-out set of test tasks. The only difference was whether Claude Code was given 100 real traces as training data. In the without-data condition, the same baseline-data paths existed but the trace files were empty. Claude Code as Agent Harness Engineer - The optimizer is Claude Code running against an internal harness that lets it edit the application agent configuration file. Tool access is Read , Bash , Edit , Write . No external MCP servers. - The optimizer runs inside an isolated Docker container base image node:24-slim containing only the Claude Code CLI, curl , and git . There is no Python and no eval source code on the filesystem. The container shares a Docker network with the gateway, so Bash can curl http://gateway:3000/inference ... to test prompts but has no other route to the application code. - Claude Code is running claude-sonnet-4-6 . The application agent model is gpt-5.4-mini across all seven applications. Claude Code is given the following instruction at the start of every run, with placeholders like {config dir} and {function name} resolved per application before the run begins. The contents are held constant across all conditions. TensorZero Function Optimizer You are optimizing a TensorZero function to improve its performance metric. Environment - T0 config files: {config dir}/ only these and the baseline data below are relevant — don't explore elsewhere - Gateway URL: {gateway url} - Pre-dumped baseline data: {baseline data dir}/ read-only; direct DB access is not available - Restart after config edits: curl -sf -X POST http://eval:5111/restart-gateway - Isolated container. No Python or pip ; node and curl are on $PATH ; jq is not installed. Use node -e "..." for JSONL parsing readline + JSON.parse + project to stdout — prefer it over shell pipelines when you need fields per row. - Don't set temperature on any variant some models reject non-default values . Keep an initial variant as a baseline reference. - Don't run evaluation episodes yourself — the harness does that after you exit. Task - Function: {function name} - Metric: {metric name} . Check the metric's optimize field in tensorzero.toml for direction boolean and float metrics may minimize or maximize . - Baseline performance: {baseline metrics} Available Models {model list} Baseline data - {baseline data dir}/inferences.jsonl — one row per inference what the model said per task . - {baseline data dir}/feedback.jsonl — one row per metric value. - {baseline data dir}/initial config/ — read-only copy of the starting T0 config tree. Files are often 20+ MB. Don't cat them whole. Start by head -3 on each to learn the row shape field names and nesting vary by env , then project out the fields you need. The projection pattern grep first to narrow, then node -e to project: bash grep $TARGET ID {baseline data dir}/inferences.jsonl \ | node -e " require 'readline' .createInterface {input: process.stdin} .on 'line', l = { const r = JSON.parse l ; console.log r.id, r.variant name, JSON.stringify r.output .slice 0,200 ; } ;" cat inferences.jsonl | ... loads the whole file; grep -first keeps the pipeline cheap. Cross-record one-liners Adapt the failure predicate to your metric — boolean uses "value":0 / "value":1 ; float values depend on optimize direction. bash Inferences per episode grep -o '"episode id":" ^" "' {baseline data dir}/inferences.jsonl | sort | uniq -c | sort -rn | head Last inference of a failing episode grep $FAIL ID {baseline data dir}/inferences.jsonl | tail -1 Which metrics are present grep -o '"metric name":" ^" "' {baseline data dir}/feedback.jsonl | sort | uniq -c target ids of failures boolean example — adapt the predicate for float metrics grep '"metric name":"{metric name}"' {baseline data dir}/feedback.jsonl \ | node -e " require 'readline' .createInterface {input: process.stdin} .on 'line', l = { const r = JSON.parse l ; if r.value === 0 || r.value === false console.log r.target id ; } ;" /tmp/failed.txt head -5 /tmp/failed.txt | while read id; do grep "$id" {baseline data dir}/inferences.jsonl | head -1; done Templates, schemas, and the required content shape TensorZero has two co-existing config styles. Check which one the function uses in tensorzero.toml : Legacy per-role : toml functions."my fn" user schema = "functions/my fn/user schema.json" and system schema, assistant schema functions."my fn".variants.initial user template = "functions/my fn/initial/user template.minijinja" New named : toml functions."my fn" schemas.user query.path = "functions/my fn/user query schema.json" functions."my fn".variants.initial templates.user query.path = "functions/my fn/initial/user query.minijinja" Canonical content block for a templated message both styles : json "content": { "type": "template", "name": "<template name ", "arguments": { / object matching the schema / } } For legacy, "name" is the role "user" / "system" / "assistant" . For new, it's the key under schemas. / templates. . For a role with no schema: "content": "Hello" or {"type":"text","text":"Hello"} . Example — τ-retail user schema.json and the matching curl body: json // user schema.json { "properties": { "observation": { "type": "string" } }, "required": "observation" , "type": "object" } // curl body { "function name": "tau bench retail v0::act", "variant name": "your new variant", "input": { "messages": { "role": "user", "content": { "type": "template", "name": "user", "arguments": { "observation": "Hello, I need to cancel my order." } } } } } Methodology The core loop is: survey the baseline → add variants → test one → iterate. The decisions worth getting right: - Metric direction defines "failure." Don't assume value:0 is bad; read the metric's optimize field. - Judge manual variant tests by the curl /inference output itself — right tool call, right JSON, right content. - Multi-turn agentic envs customer service, business management, coding need real conversational state to be representative. Pick a real episode from inferences.jsonl , copy its first 2–3 messages into your curl body, check how the variant continues. A turn-0 probe alone tells you little. - When done, leave the best config in place with the experimentation section below, and exit. Routing: Experimentation Config After creating new variants, add an experimentation section — otherwise the gateway round-robins and wastes test episodes on bad variants. Keep candidates to your best ~3–4, including initial as a baseline. toml functions."{function name}".experimentation type = "track and stop" metric = "{metric name}" candidate variants = "initial", "your new variant 1", "your new variant 2" fallback variants = min samples per variant = 5 delta = 0.1 epsilon = 0.0 update period s = 5 min prob = 0.0 max samples per variant = 10000 I report the gap in success rate between two runs, score with data − score without data , on the y-axis of the chart above. If data helps, the gap is positive. If data is unnecessary, the gap is around zero. Evaluation of optimized variants - The per-application metric is binary success / no success , measured on a held-out test set of up to 100 episodes per seed, variant . What counts as success depends on the application: NER : exact match. The agent correctly identifies and classifies every named entity in the input sentence. NDA : exact match. The agent correctly extracts the four target fields effective date , jurisdiction , party , term from the document. Wordle : the 5-letter target word is guessed within six attempts. Customer service τ-bench retail : the agent completes the user’s request with the correct database changes verified by state comparison and communicates all required information. Software engineering terminal-bench : every test case in the task suite passes. Business management YC bench : the company survives to the simulation horizon without bankruptcy. Scientific paper reproduction replicationbench : every per-paper verifier reward component evaluates to ≥ 1.0 Harbor’s resolved metric . - Five seeds per application, condition . - The y-axis of the chart is the per-application gap in success rate: score with data − score without data , with the median across seeds plotted as the point and the inter-quartile range as the vertical whisker. How to estimate novelty? Claude Code constructed examples to test the prompts it wrote. It did this several times per run by making inferences on example user inputs and checking the response. With access to real data, it copied those examples from the trace dump. Without access, I was surprised to see that it generated synthetic examples without any additional prompting. Immediately, I wanted to know how different the synthetic examples were from the real ones. Japan July refined zinc imports off 47.5 pct yr / yr . { "person": , "location": "Japan" , "organization": , "miscellaneous": "July", "refined zinc", "refined zinc imports", "47.5 pct yr / yr" } The COP30 climate summit in Belém, Brazil drew delegates from 190 nations. UN Climate Chief Simon Stiell praised pledges from China, India, and the European Union. US Climate Envoy John Podesta and Brazilian President Luiz Inácio Lula da Silva co-chaired the closing session. { "person": "Simon Stiell", "John Podesta", "Luiz In\u00e1cio Lula da Silva" , "organization": "UN", "European Union" , "location": "Bel\u00e9m", "Brazil", "China", "India" , "miscellaneous": "COP30" } To investigate the difference, I devised a dataset-synthesis pipeline and ran it on the seven applications. Given just the application agent’s config and the dataset schema, Claude Code was instructed to generate 20–40 example conversations. Across five independent seeds per application, I compute the maximum mean discrepancy in embedding space Voyage voyage-3-large between the synthetic corpus and the real-trace corpus, and report the median across seeds as the per-application novelty score. In other words, novelty measures how different the real traces are from what Claude Code guesses the data should look like without access to those traces. Across the seven applications this score tracks the data-ablation gap at Spearman ρ = +0.79 exact two-sided p = 0.040; at n = 7 the asymptotic approximation is unreliable, so I report the exact permutation value . MMD² was the first and only drift estimator I tried: a standard non-parametric two-sample distance, fixed before I looked at the gaps. So this is a single pre-chosen statistic, not the best of a search over estimators. Dataset-novelty estimator MMD² The goal is to estimate how surprising a dataset is to a coding agent like Claude Code or Codex that is instructed to be an agent engineer. To do this, I compare a real dataset to a dataset generated by the coding agent. Each application is an LLM function with a defined input/output contract, like answering a customer-service ticket, extracting entities from a sentence, or playing a turn of Wordle. An inference is one call to that function: the input it received plus the output it returned, recorded as one row in a JSONL file. An episode is one logical interaction with the function, identified by a shared episode id . A single-turn application like NER has exactly one inference per episode. A multi-turn application like Wordle chains several inferences into one episode one inference per turn of the game . For each application I assume two corpora of such rows: | corpus | source | size | |---|---|---| Real baseline | actual rows logged from prior runs of the function on real users / tasks | hundreds to ~20 k rows | Synthetic | rows invented by an agent given only the function’s spec no real data seen | 25–170 rows per seed | Because the coding agent is not conditioned on real data to generate the synthetic dataset, the divergence between its distribution over datasets given the task and the distribution over real datasets is an indicator of novelty. Therefore, I want a scalar that measures the divergence between the distribution of rows in and the distribution of rows in . I chose Maximum Mean Discrepancy 5, which is a standard non-parametric estimator. It compares the kernel-induced means of two finite samples and goes to zero as the two samples are drawn from the same underlying distribution. A larger MMD² means the coding agent’s knowledge of the application, given the config, covers less of the actual deployment. Generating the synthetic corpus The synthetic corpus is produced by a coding agent Claude Code or Codex given only: - the function’s machine-readable specification: input schema, output schema, system prompt, available tools, and the set of defined evaluation metrics; - the schemas of the two output files inferences.jsonl row schema and feedback.jsonl row schema . The agent has no access to real data during synthesis. The procedure has five steps read the spec, plan input coverage, generate inputs and outputs, calibrate periodically with a few live probe calls, then emit feedback values , reproduced verbatim in the SKILL.md and methodology.md instruction files below. The output is two files: inferences.jsonl one row per inference and feedback.jsonl one row per metric value, linked to the inference or episode it scores . Both are schema-validated before the run exits. The episode budget is a parameter set per run; in this analysis it was set to 20–40 episodes per application. I run K independent agent seeds per application K = 5 in this analysis , so the dataset-novelty estimator can be aggregated across runs. The instruction files the synthesis agent reads are reproduced below. SKILL.md The top-level instruction the synthesis agent receives: --- name: dataset-synthesis description: Synthesize representative inferences and feedback for an LLM application described by a TensorZero configuration. Use when a plausible baseline corpus is needed for a function that has not yet collected real data. --- TensorZero Dataset Synthesis You are synthesizing a plausible dataset for a TensorZero function. You will produce two JSONL files that look like what inferences.jsonl and feedback.jsonl would contain after the function had run live for a while. Crucially, you do not have any real baseline data to draw from, but the configuration files should provide you with enough information about the application to generate sensible examples. Environment - T0 config files: {config dir}/ - Gateway URL: {gateway url} you may POST to /inference to spot-check your understanding of the input structure - Output directory: {output dir}/ — write inferences.jsonl and feedback.jsonl here - Isolated container. No Python or pip ; node and curl are on $PATH ; jq is not installed. Use node -e "..." for JSONL parsing. - Emit rows with variant name: "initial" only. Task - Function: {function name} - Metrics defined for this function: {metric name list} read their kind , level , and optimize fields from tensorzero.toml - Budget: at least {min episodes} , but no more than {max episodes} episodes. An episode is one logical interaction with the function — a single inference for single-turn functions, a chain of inferences sharing one episode id for multi-turn. - Output files: - {output dir}/inferences.jsonl — one row per inference call see reference/inferences schema.md - {output dir}/feedback.jsonl — one row per metric value, with target id referring to the inference id or episode id of a row in the inferences file see reference/feedback schema.md Workflow Five steps. See reference/methodology.md for the long form; the short version: 1. Read the spec. Open {config dir}/tensorzero.toml and the linked schema / template files. Note: input schema, output schema, function type chat vs tool , defined metrics, and whether the function is one-shot or part of a multi-turn episode. 2. Hypothesize the input distribution. What kinds of users / states does this function see in deployment? Sketch a coverage plan: how many length buckets, which schema slots vary, which edge cases matter. Aim for diversity, not just a single canonical mode. 3. Generate inputs. Plan out at least {min episodes} but no more than {max episodes} distinct episodes. For multi-turn functions, decide each episode's length up front based on what's realistic for the task a 4-turn episode contributes 4 rows sharing one episode id . 4. Spot-check via the gateway. Periodically POST a synthetic input to {gateway url}/inference to confirm your understanding of the input structure is correct and to see what the initial variant's output actually looks like. Gateway calls are expensive — treat this as a calibration step, not as the way to generate every row. Generate outputs yourself in between checks. 5. Generate feedback rows. For each metric in {metric name list} , emit one feedback row per appropriate target per-inference or per-episode based on the metric's level . After generating, validate: bash node /skill/scripts/validate.js {output dir} \ --config {config dir}/tensorzero.toml \ --min-episodes {min episodes} --max-episodes {max episodes} The validator checks schema compliance, referential integrity, budget, and the variant name == "initial" invariant. Fix any errors it reports before exiting. Output contract When you exit, {output dir}/ must contain exactly: - inferences.jsonl — every row conforms to the schema in reference/inferences schema.md ; the rows span at least {min episodes} and at most {max episodes} distinct episode id s - feedback.jsonl — one or more rows per metric, with every target id referring to an id for inference-level metrics or episode id for episode-level metrics that exists in inferences.jsonl Do not write any other files in {output dir}/ . Do not modify {config dir}/ . Stay within budget — don't issue gateway calls indefinitely. Principles - Quality of coverage beats quantity of duplicates. - Use the gateway as a calibration tool. A periodic /inference call confirms your understanding of the input structure and shows you what the initial variant actually emits. It's not a way to generate every row — gateway calls are expensive, and it's fine to generate outputs yourself between checks. - Don't peek. You don't have baseline data. If you find yourself wanting to "look at a real example," that's the signal to make a better-reasoned guess from the spec instead. - Plausibility includes failure. Some inferences will fail their metric. Your feedback distribution should reflect a realistic failure rate for the task — not 100% success. reference/methodology.md The longer methodology the agent can consult: Synthesis methodology The recipe for producing a faithful inferences.jsonl + feedback.jsonl from spec alone. Five steps, each with concrete things to look for. 1. Read the spec carefully Open {config dir}/tensorzero.toml . For the target function, capture: - Type : chat vs tool / json . This determines output shape. - Schemas : input per-role or named , output for tool-call functions . Read every referenced .json and every .minijinja template. - Metrics : which are defined, their type , level , optimize . These dictate the feedback.jsonl rows you'll write. - System prompt : usually inside the variant's template. Read it — this is the strongest signal about what the function is for. - Tool list for tool functions : names, descriptions, argument schemas. Two patterns to check early: bash What kind of function? grep -A2 "^\ functions\.\"{function name}\"\ " {config dir}/tensorzero.toml Which metrics are defined? grep -E "^\ metrics\." {config dir}/tensorzero.toml 2. Hypothesize the input distribution Before generating anything, sketch a plan. For each schema slot in the input: - What value ranges / shapes does it plausibly take in deployment? - Are there subpopulations long vs short, simple vs nested, single vs multi-entity ? - What's the realistic length / complexity distribution? Write the plan as a comment-level outline before the first row. Something like: Plan for {function name} budget: at least {min episodes}, at most {max episodes} episodes - Target ~N episodes × K turns each - Mix of the major user intents the function supports - Vary user tone / register across episodes - Cover authentication / setup steps the function expects before the main action Don't skip this step. Generating without a plan reliably produces a stack of near-duplicates of the same canonical input. 3. Generate inputs For each row in your plan: - Construct the input.messages array per the schema rules in inferences schema.md. - For multi-turn: episode by episode. Within one episode, mint a fresh episode id , then chain inferences — each turn's input.messages is the previous turns' input plus assistant reply plus next user turn. Tools you'll use: - node -e for any structured generation writing JSON bodies, looping, minting UUIDs . - A working directory in /tmp for intermediate files probe bodies, response captures . - curl to call the gateway. 4. Spot-check via the gateway Periodically — not on every row — POST a synthetic input to the gateway and look at the response. The purpose is calibration, not generation: - Confirm the input shape you've been building actually parses template name correct, schema arguments well-formed . - See what the initial variant's output structure looks like for that input, so the outputs you generate yourself stay faithful to it. - Catch drift early — if the first spot-check shows your arguments object missing a required field, fix the generator before producing more rows. bash node -e " const body = { / function name, variant name: 'initial', input: ... / }; process.stdout.write JSON.stringify body ; " /tmp/req.json curl -sf {gateway url}/inference \ -H 'Content-Type: application/json' \ --data @/tmp/req.json /tmp/resp.json A reasonable cadence: one spot-check before you start generating, one after the first episode, and one every ~5 episodes thereafter. Cheaper than per-row, sufficient to catch most schema mistakes. Assemble each inference row from: - Your minted id and episode id - The current created at - "initial" as variant name - Your input from step 3 - An output you write yourself, matching the structure you saw in the spot-checks gateway response → guide for your own generation For multi-turn: within one episode, each turn's input.messages extends the previous turn's by appending the assistant reply and the next user message. Keep the chain coherent across turns of the same episode id . Write each row to {output dir}/inferences.jsonl immediately — don't batch, so a crash mid-run preserves progress. 5. Generate feedback rows For each metric in {metric name list} : - Determine its level inference vs episode from the TZ config. - For inference-level: walk every inference row and emit one feedback row per inference, metric . - For episode-level: walk every distinct episode id and emit one feedback row per episode, metric . For the value : - If the metric is verifiable from the row alone e.g. exact match against a known gold answer, or a length / format check , compute it programmatically. - Otherwise , predict the value from input + output using your understanding of the task. Stay calibrated — see "realistic value distributions" in feedback schema.md. Write to {output dir}/feedback.jsonl . Iteration / self-audit After roughly a third of the planned episodes, stop and inspect what you've produced: bash Count rows and unique episodes wc -l {output dir}/inferences.jsonl grep -o '"episode id":" ^" "' {output dir}/inferences.jsonl | sort -u | wc -l Variety of input templates / first 80 chars node -e " require 'readline' .createInterface {input: require 'fs' .createReadStream '{output dir}/inferences.jsonl' } .on 'line', l = { const r = JSON.parse l ; const m = r.input.messages 0 ; const c = Array.isArray m.content ? m.content 0 : m.content; const s = typeof c === 'string' ? c : JSON.stringify c.arguments || c ; console.log s.slice 0, 80 ; } ;" | sort -u | head -20 Ask: - Am I converging on one mode? lots of near-identical first lines - Did I cover all the schema slots I planned for? - Does my feedback distribution look reasonable? If yes to mode collapse — diversify the remaining episodes by deliberately picking cases that look different from what's there. You have room to add more episodes up to {max episodes} ; you do not have to stop at the planned count if your coverage feels thin. When to stop and validate Once you've reached at least {min episodes} episodes and no more than {max episodes} with each metric covered, run: bash node /skill/scripts/validate.js {output dir} Read its output and fix any errors. Then exit. Anti-patterns - Skipping step 2. "I'll just start generating" gives mode collapse 100% of the time. - Skipping step 4 entirely. Without any spot-checks you have no signal that your input shape parses or that your outputs resemble what the model actually emits. - Treating all episodes as length 1. For multi-turn functions, single-turn episodes are unrepresentative . - Generating one giant batch and writing at the end. Write incrementally so a crash doesn't lose work. - Ignoring the metric level . Inference-level vs episode-level changes which target id you reference. reference/inferences schema.md The schema for inferences.jsonl rows: inferences.jsonl row schema One JSON object per line. Every row represents one call to /inference against the function. Multiple rows can share an episode id multi-turn episodes . Fields | field | type | required | notes | | -------------- | ---------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------- | | id | string UUID v7 | yes | Unique per inference. UUID v7 sorts by timestamp — see "minting UUIDs" below. | | episode id | string UUID v7 | yes | One UUID per logical episode. For single-turn functions, this is fresh per row. For multi-turn, all rows in the same episode share it. | | created at | string ISO 8601, UTC | yes | E.g. "2026-05-15T18:42:11.123Z" . Should be monotonic within an episode. | | variant name | string | yes | Always "initial" for this skill. | | input | object | yes | {"messages": ... } — the request body's input field. See "input shape" below. | | output | array | yes | The gateway's response content blocks. Shape depends on function type. See "output shape" below. | Minting UUIDs UUID v7 is required because TensorZero uses the embedded timestamp to order rows. js // Node-only UUID v7 minter no external deps function uuidv7 { const ts = BigInt Date.now ; const tsHex = ts.toString 16 .padStart 12, "0" ; const rand = crypto.randomBytes 10 ; rand 0 = rand 0 & 0x0f | 0x70; // version 7 rand 2 = rand 2 & 0x3f | 0x80; // RFC 4122 variant const r = rand.toString "hex" ; return ${tsHex.slice 0, 8 }-${tsHex.slice 8, 12 }-${r.slice 0, 4 }-${r.slice 4, 8 }-${r.slice 8, 20 } ; } For an episode of N turns, mint one episode id , then mint N id s, advancing created at by ~1s between them. Input shape The input.messages field follows the standard chat-message format. Each message is {role, content} where: - role : "system" | "user" | "assistant" - content : either a string rare or an array of content blocks The most common content block for a templated function is: json { "type": "template", "name": "<template name ", "arguments": { / object matching the schema / } } <template name and the arguments shape come from the TZ config. Two co-existing styles: Legacy per-role schemas : toml functions."my fn" user schema = "functions/my fn/user schema.json" functions."my fn".variants.initial user template = "functions/my fn/initial/user template.minijinja" <template name is the role name "user" , "system" , "assistant" . New named schemas : toml functions."my fn" schemas.user query.path = "functions/my fn/user query schema.json" functions."my fn".variants.initial templates.user query.path = "functions/my fn/initial/user query.minijinja" <template name is the key under schemas. / templates. e.g. "user query" . For roles that have no schema, use either "content": "Hello" or {"type":"text","text":"Hello"} . Filesystem path mangling : function and tool names containing :: e.g. "my function::act" appear in the TZ config as functions."my function::act" , but on disk the corresponding directory is functions/my function act/ four underscores . When reading template / schema files, translate :: → in the path. A quick find /config -type f confirms the actual layout if you're unsure. Example: a templated user input json "input": { "messages": { "role": "user", "content": { "type": "template", "name": "user", "arguments": { "observation": "Hello, this is a sample user message." } } } } For a multi-turn episode, append assistant + tool result messages between user turns. The third turn's input.messages will hold 5 entries system?, user₀, assistant₀, user₁, assistant₁ . Output shape Depends on the function's type in the TZ config — there are three forms. type = "chat" — list of content blocks: json "output": { "type": "text", "text": "The model's reply." } Tools and text can mix in the same list a text block followed by a tool call , or several tool call s . type = "chat" with tools — same list, with tool call blocks: json "output": { "type": "tool call", "name": "<tool name ", "arguments": { / matching tool schema / } } Real rows often include extra fields like id , raw name , raw arguments carried back from the underlying model API. Reproduce only type + name + arguments unless you also call the gateway; the extras are post-hoc. type = "json" — a single object with raw the unparsed string and parsed the matched JSON : json "output": { "raw": "{\"person\": , \"location\": \"Japan\" }", "parsed": { "person": , "location": "Japan" } } parsed must conform to the function's output schema . raw is the literal string the model emitted; usually it's just JSON.stringify parsed with whatever whitespace the model used. If you're not sure which form applies, look at functions."<fn " in tensorzero.toml — the type field tells you. Common mistakes - id == episode id . They must be distinct UUIDs even for single-turn functions. - String content where the schema expects template. If the function has a user schema.json , the user message MUST use {"type":"template", "name":"user", "arguments":{...}} — a plain string will be rejected. - variant name set to something other than "initial" . This skill only emits initial -variant rows; we're characterizing the baseline distribution. - Outputs invented by hand. Always ground via the gateway see methodology.md . A hand-written tool call argument is very likely to drift from how the model actually phrases things. - created at in the wrong format. ISO 8601 UTC, either with the Z suffix e.g. "2026-05-15T18:42:11.123Z" or the explicit +00:00 offset. Non-UTC timezone offsets are rejected. reference/feedback schema.md The schema for feedback.jsonl rows: feedback.jsonl row schema One JSON object per line. Every row represents one piece of feedback associated with either a single inference or a whole episode. Fields | field | type | required | notes | | ------------- | ------------- | -------- | -------------------------------------------------------------------------------------------------------------------- | | kind | string enum | yes | One of "boolean" , "float" , "comment" , "demonstration" . Determines the value type. | | metric name | string | yes | Must match a metric defined under metrics.<name in tensorzero.toml . | | target id | string UUID | yes | Resolves to an inferences.id for inference-level metrics or inferences.episode id for episode-level metrics . | | value | varies | yes | Type depends on kind . See below. | Reading the metric definition For each metric you emit feedback for, locate its definition in the TZ config: toml metrics.exact match type = "boolean" → kind in feedback row level = "inference" → target id resolves to inferences.id optimize = "max" informational; bigger value is better metrics.cost type = "float" level = "episode" → target id resolves to inferences.episode id optimize = "min" Three rules that drop out of this: - kind in the feedback row matches type in the metric definition. - level = "inference" ⇒ target id is one of the id s in inferences.jsonl . One feedback row per inference, metric pair. - level = "episode" ⇒ target id is one of the episode id s. One feedback row per episode, metric pair. value shape by kind | kind | type | example | notes | | --------------- | ----------- | ---------------------------- | -------------------------------------------------------------------- | | boolean | bool or 0/1 | true , false , 1 , 0 | Both forms are accepted; prefer true / false . | | float | number | 0.73 , 12.4 | Range is metric-defined — read its bounds from the TZ config if any. | | comment | string | "Failed: incorrect output" | Natural-language feedback from users or developers. | | demonstration | object | { "output": ... } | Edited drafts, labels, human-generated content. | For this skill, focus on boolean and float — they're the metrics that drive optimization. Examples Inference-level boolean: json { "kind": "boolean", "metric name": "exact match", "target id": "<inference id ", "value": false } Episode-level float: json { "kind": "float", "metric name": "reward", "target id": "<episode id ", "value": 0.42 } Realistic value distributions You don't have ground-truth labels, but you should produce a feedback distribution that's plausible for the task — not 100% success and not 100% failure. For a boolean metric: - A 100% success rate is a red flag — it suggests you tilted your synthetic inputs toward easy cases. Re-balance. For a float metric: - Bound by the metric's natural range often 0, 1 for accuracy-style or unbounded for cost / reward . - Distribute across the range — don't pile everything at the mean. - If you don't know what the natural range is, generate a few real outputs first via the gateway and inspect them. The point of this corpus is to be a prior over what the function's baseline behavior looks like — it does not need to be correct, but it must be plausible. The downstream measurement input/output/feedback novelty against the real baseline will surface where the prior was wrong. Common mistakes - target id points at an episode id for an inference-level metric or vice versa . Read the metric's level first. - kind mismatched with metric.type . A float metric must receive kind: "float" feedback rows, even if the values look 0/1. - metric name not in the TZ config. Emitting feedback for a metric the function doesn't define will fail validation. - Missing rows. Every inference should be covered by at least one feedback row from an inference-level metric, and every episode by at least one episode-level metric if any are defined . The validator counts coverage. scripts/validate.js The validator the agent runs before exiting: bash /usr/bin/env node / Validate a dataset-synthesis run's output. Mirrors the contract from the skill's reference docs. Usage: node validate.js <output dir --config <tensorzero.toml --min-episodes <N --max-episodes <N Exits 0 on success, 1 on any error. Errors go to stderr; the summary line "PASS" or "FAIL — N error s :" and per-file counts go to stdout so the agent can validate.log 2 &1 for a single file. / "use strict"; const fs = require "fs" ; const path = require "path" ; const UUID RE = /^ 0-9a-f {8}- 0-9a-f {4}- 0-9a-f {4}- 0-9a-f {4}- 0-9a-f {12}$/i; const ALLOWED FEEDBACK KINDS = new Set "boolean", "float", "comment", "demonstration", ; const ALLOWED OUTPUT TYPES = new Set "text", "tool call", "raw text", "thought", ; const REQUIRED INF FIELDS = "id", "episode id", "created at", "variant name", "input", "output", ; const REQUIRED FB FIELDS = "kind", "metric name", "target id", "value" ; // ── Arg parsing ────────────────────────────────────────────────────────── function parseArgs argv { const args = { outputDir: null, config: null, minEpisodes: null, maxEpisodes: null, }; for let i = 0; i < argv.length; i++ { const a = argv i ; if a === "--config" args.config = argv ++i ; else if a === "--min-episodes" args.minEpisodes = Number argv ++i ; else if a === "--max-episodes" args.maxEpisodes = Number argv ++i ; else if a === "-h" || a === "--help" { console.log "usage: validate.js <output dir --config <toml --min-episodes N --max-episodes N ", ; process.exit 0 ; } else if args.outputDir args.outputDir = a; else { console.error unexpected arg: ${a} ; process.exit 2 ; } } if args.outputDir { console.error "output dir is required" ; process.exit 2 ; } return args; } // ── JSONL loader ───────────────────────────────────────────────────────── function loadJsonl filePath, errors { if fs.existsSync filePath { errors.push missing file: ${filePath} ; return ; } const raw = fs.readFileSync filePath, "utf8" ; const rows = ; raw.split "\n" .forEach line, idx = { if line.trim return; try { const obj = JSON.parse line ; if typeof obj == "object" || obj === null || Array.isArray obj { errors.push ${path.basename filePath }:${idx + 1}: top-level value must be an object , ; return; } rows.push obj ; } catch e { errors.push ${path.basename filePath }:${idx + 1}: bad JSON ${e.message} , ; } } ; return rows; } // ── Inferences ─────────────────────────────────────────────────────────── function validateInferences rows, errors { if rows.length === 0 { errors.push "inferences.jsonl is empty" ; return; } const idsSeen = new Set ; rows.forEach r, i = { const tag = inferences.jsonl:${i + 1} ; for const k of REQUIRED INF FIELDS { if k in r errors.push ${tag}: missing required field '${k}' ; } const rid = r.id; const eid = r.episode id; if typeof rid === "string" { if UUID RE.test rid errors.push ${tag}: id is not a valid UUID: '${rid}' ; if idsSeen.has rid errors.push ${tag}: duplicate id '${rid}' ; idsSeen.add rid ; } if typeof eid === "string" && UUID RE.test eid { errors.push ${tag}: episode id is not a valid UUID: '${eid}' ; } if typeof rid === "string" && rid === eid { errors.push ${tag}: id and episode id are identical must be distinct UUIDs , ; } if r.variant name == "initial" { errors.push ${tag}: variant name must be 'initial', got ${JSON.stringify r.variant name } , ; } const inp = r.input; if typeof inp == "object" || inp === null || "messages" in inp { errors.push ${tag}: input must be an object with a 'messages' array ; } else { const msgs = inp.messages; if Array.isArray msgs || msgs.length === 0 { errors.push ${tag}: input.messages must be a non-empty array ; } } const out = r.output; if Array.isArray out { // chat / tool function: list of content blocks out.forEach blk, j = { if typeof blk == "object" || blk === null { errors.push ${tag}: output ${j} must be an object ; return; } const t = blk.type; if ALLOWED OUTPUT TYPES.has t { const allowed = ...ALLOWED OUTPUT TYPES .sort ; errors.push ${tag}: output ${j} .type '${t}' not in ${allowed.map x = '${x}' .join ", " } , ; } } ; } else if typeof out === "object" && out == null { // json function: {raw, parsed} if "raw" in out && "parsed" in out { errors.push ${tag}: output is an object but has neither 'raw' nor 'parsed' + json-function output expects both , ; } } else { errors.push ${tag}: output must be a list of content blocks chat/tool + or an object with 'raw'+'parsed' json , got ${typeof out} , ; } } ; } // ── Feedback ───────────────────────────────────────────────────────────── function validateFeedback rows, errors { rows.forEach r, i = { const tag = feedback.jsonl:${i + 1} ; for const k of REQUIRED FB FIELDS { if k in r errors.push ${tag}: missing required field '${k}' ; } const kind = r.kind; if ALLOWED FEEDBACK KINDS.has kind { const allowed = ...ALLOWED FEEDBACK KINDS .sort ; errors.push ${tag}: kind '${kind}' not in ${allowed.map x = '${x}' .join ", " } , ; } const tid = r.target id; if typeof tid === "string" && UUID RE.test tid { errors.push ${tag}: target id is not a valid UUID: '${tid}' ; } const v = r.value; if kind === "boolean" && typeof v === "boolean" || typeof v === "number" { errors.push ${tag}: boolean feedback value must be bool or 0/1, got ${typeof v} , ; } if kind === "float" && typeof v == "number" { errors.push ${tag}: float feedback value must be a number, got ${typeof v} , ; } } ; } // ── Cross-validation referential integrity, metric resolution ────────── function validateCross inferences, feedback, metricDefs, errors, warnings { const inferenceIds = new Set inferences.filter r = typeof r.id === "string" .map r = r.id , ; const episodeIds = new Set inferences .filter r = typeof r.episode id === "string" .map r = r.episode id , ; const targetsInference = new Map ; // inference id → Set<metric name const targetsEpisode = new Map ; // episode id → Set<metric name feedback.forEach r, i = { const tag = feedback.jsonl:${i + 1} ; const mname = r.metric name; const tid = r.target id; const kind = r.kind; if metricDefs && mname in metricDefs { const defined = Object.keys metricDefs .sort .join ", " || " none "; errors.push ${tag}: metric name '${mname}' not defined in tensorzero.toml + defined metrics: ${defined} , ; return; } const mdef = metricDefs ? metricDefs mname : null; if mdef { if kind && mdef.type && kind == mdef.type { errors.push ${tag}: kind '${kind}' mismatches metric.type '${mdef.type}' + for metric '${mname}' , ; } const level = mdef.level; if level === "inference" { if inferenceIds.has tid { const hint = episodeIds.has tid ? "this might be an episode id — try matching against inferences.id instead" : "the value does not appear as any row's id in inferences.jsonl"; errors.push ${tag}: target id '${tid}' does not match any inference id + metric '${mname}' is inference-level; ${hint} , ; } else { if targetsInference.has tid targetsInference.set tid, new Set ; targetsInference.get tid .add mname ; } } else if level === "episode" { if episodeIds.has tid { const hint = inferenceIds.has tid ? "this looks like an inference id — try matching against episode id instead" : "the value does not appear as any row's episode id in inferences.jsonl"; errors.push ${tag}: target id '${tid}' does not match any episode id + metric '${mname}' is episode-level; ${hint} , ; } else { if targetsEpisode.has tid targetsEpisode.set tid, new Set ; targetsEpisode.get tid .add mname ; } } } else { // No metric defs → just verify target id exists somewhere if inferenceIds.has tid && episodeIds.has tid { errors.push ${tag}: target id '${tid}' does not match any inference id or + episode id in inferences.jsonl , ; } } } ; if metricDefs { const infMetrics = Object.entries metricDefs .filter , d = d.level === "inference" .map n = n ; const epMetrics = Object.entries metricDefs .filter , d = d.level === "episode" .map n = n ; if infMetrics.length { const uncovered = ...inferenceIds .filter id = targetsInference.has id , ; if uncovered.length { warnings.push ${uncovered.length}/${inferenceIds.size} inferences have no inference-level feedback , ; } } if epMetrics.length { const uncovered = ...episodeIds .filter id = targetsEpisode.has id ; if uncovered.length { warnings.push ${uncovered.length}/${episodeIds.size} episodes have no episode-level feedback , ; } } } } // ── Budget ─────────────────────────────────────────────────────────────── function validateBudget inferences, args, errors { const nEpisodes = new Set inferences .filter r = typeof r.episode id === "string" .map r = r.episode id , .size; if args.minEpisodes == null && nEpisodes < args.minEpisodes { errors.push episode count ${nEpisodes} is below the minimum ${args.minEpisodes} , ; } if args.maxEpisodes == null && nEpisodes args.maxEpisodes { errors.push episode count ${nEpisodes} exceeds the maximum ${args.maxEpisodes} , ; } } // ── Minimal TOML parser for metrics. blocks ─────────────────────────── function parseMetricDefs configPath { const metrics = {}; if fs.existsSync configPath return metrics; const blockRe = /^\s \ metrics\. "' ? ^"'\ +? "' ?\ \s $/; const kvRe = /^\s type|level|optimize \s =\s "' ? ^"'\s + /; const sectionRe = /^\s \ .+\ \s $/; let current = null; for const line of fs.readFileSync configPath, "utf8" .split "\n" { const m = line.match blockRe ; if m { current = m 1 ; metrics current = {}; continue; } if sectionRe.test line && blockRe.test line { current = null; continue; } if current { const kv = line.match kvRe ; if kv metrics current kv 1 = kv 2 ; } } return metrics; } // ── Driver ─────────────────────────────────────────────────────────────── function main { const args = parseArgs process.argv.slice 2 ; if fs.existsSync args.outputDir || fs.statSync args.outputDir .isDirectory { console.error output dir does not exist: ${args.outputDir} ; process.exit 1 ; } const errors = ; const warnings = ; const inferences = loadJsonl path.join args.outputDir, "inferences.jsonl" , errors, ; const feedback = loadJsonl path.join args.outputDir, "feedback.jsonl" , errors, ; // Extras: ignore subdirectories orchestrator's meta/ sits there . const extras = fs.readdirSync args.outputDir .filter name = { if name === "inferences.jsonl" || name === "feedback.jsonl" return false; return fs.statSync path.join args.outputDir, name .isFile ; } ; if extras.length warnings.push unexpected files in ${args.outputDir}: ${extras.join ", " } , ; validateInferences inferences, errors ; validateFeedback feedback, errors ; const metricDefs = args.config ? parseMetricDefs args.config : {}; validateCross inferences, feedback, Object.keys metricDefs .length ? metricDefs : null, errors, warnings, ; validateBudget inferences, args, errors ; const nEpisodes = new Set inferences.map r = r.episode id .size; const fbByMetric = {}; for const r of feedback fbByMetric r.metric name = fbByMetric r.metric name || 0 + 1; console.log inferences: ${inferences.length} rows ${nEpisodes} unique episodes , ; console.log feedback: ${feedback.length} rows ; if Object.keys fbByMetric .length { console.log per metric: ${JSON.stringify fbByMetric } ; } if warnings.length { console.log "\nWARNINGS:" ; for const w of warnings console.log · ${w} ; } if errors.length { console.error \nFAIL — ${errors.length} error s : ; for const e of errors console.error ✗ ${e} ; process.exit 1 ; } console.log "\nPASS" ; process.exit 0 ; } main ; Embedding step The MMD² analysis runs offline on the eval host outside the Claude Code sandbox , where Python is available. Each row is rendered to a single string via , which keeps the model’s actual outputs alongside the inputs and drops per-row bookkeeping id , episode id , created at , variant name . That string is then passed through a text-embedding model to produce a fixed-dimensional vector. The same embedding function is applied to both corpora and the output vectors are L2-normalized so that pairwise squared distances fall in . Truncation cap per input depends on the embedder’s context window Voyage and ZeroEntropy at 32 k tokens, OpenAI and Gemini at 8 k tokens ; the same cap is applied symmetrically to synth and baseline so their inputs see the same content. The MMD² estimator Maximum Mean Discrepancy MMD is a kernel-based two-sample distance for testing whether two finite samples and come from the same underlying distribution. Fix a positive-definite kernel with associated reproducing-kernel Hilbert space and feature map . Provided the kernel is measurable and satisfies the moment condition for the distributions and being compared, the mean embeddings are well-defined elements of . Bounded kernels such as the Gaussian RBF below, where automatically satisfy this condition for every . The population MMD² is then defined as the squared RKHS distance between the two mean embeddings: Given finite samples and from and , plug in the empirical mean embeddings and , expand the squared norm, and apply the reproducing-kernel identity to get an estimator written purely in terms of pairwise kernel evaluations: With a characteristic kernel such as the Gaussian RBF used below , if and only if . The metric is then a faithful distributional distance on the space of probability measures, not just a moment comparison. For my use case single-sample novelty against a fixed baseline , I treat the synthetic corpus as and the deployment baseline as , and report the resulting per- env, seed as the novelty score. Kernel choice I use the Gaussian radial basis function RBF kernel : The median heuristic sets per env, seed to the median squared pairwise distance over a random 500-row subsample of the aggregate sample : U-statistic MMD² For finite samples and of any sizes and , I use the unbiased U-statistic estimator: Being unbiased matters in this setup specifically because varies substantially across envs 25–174 synthetic rows depending on env and seed : a biased estimator would introduce a per-env offset that contaminates cross-env comparisons. The unbiased estimator can return slightly negative values when the two distributions are nearly identical, which is sample variance around a true MMD² of zero, not an error. I report this estimator as the per- env, seed novelty score: Aggregating across seeds For each env, embedder there are K MMD² values, one per synthesis seed. I report the median across the K seeds as the per-env point estimate, with the inter-quartile range as the seed-spread error bar: The IQR captures variation across agent runs. Median + IQR is robust to a single anomalous seed e.g. one Wordle run that happens to land in a less typical region of the distribution . In the chart, the median is the X-axis point position and the IQR is the horizontal whisker. The same convention is used for the Y-axis Δ success rate across eval seeds so both axes display the same kind of error bar. Inside the chart Two axes organize the chart. The first is visibility : can Claude Code see or correctly guess the real input distribution? That is what drift measures, so low drift means high visibility. The second is tolerance : even when Claude Code guesses wrong, does the task care? Data helps only when an application scores low on both : Claude Code is flying blind and the metric punishes it for it. Read application by application, the seven sort into a few groups. NER is the clean no-effect pole: the optimization model has the corpus memorized and its broader knowledge of NER covers the rest, so visibility is high and data has nothing to add. YC bench is the data-helps pole: the simulator postdates the model’s training cutoff and the harness underspecifies what the agent observes, so visibility is low, and the metric is unforgiving. NDA is the outlier that forces the second axis onto the page: its visibility is just as low as YC bench’s, but the task tolerates the gap, so data barely moves the needle. Two more no-ops fall out for a reason the two axes don’t cover, which I flag below. | Application | Sees the real distribution? | Does the gap hurt? | Data helps? | |---|---|---|---| | NER | Yes: corpus memorized, generic task shape | — | No | | Customer service | Yes: model already recognizes τ-bench | — | No | | Software Eng. | — capability ceiling no prompt could move it | — | No | | NDA | No: invents the wrong document genre | No: extraction is genre-agnostic | No | | Wordle | Partly | Somewhat | A little | | Science | No | Yes | Yes | | Business mgmt YC bench | No | Yes | Yes largest gap | The first three rows are no-ops because Claude Code is not flying blind, or for software engineering because the agent model cannot improve no matter what the prompt says. The bottom three are the cases where data earns its keep. NDA is the row that does the conceptual work, sitting between them: blind, but on a task that does not punish blindness. Entity extraction NER With or without 100 real traces, Claude Code improved the NER agent by around 60 percentage points. The data made no measurable difference, and once I started looking at why, I could see that NER is cooked into Claude Code at two levels: corpus memorization and task-shape knowledge. Either alone would have been enough to make data ablation a no-op. Claude Code has the specific corpus memorized. In the without-data condition, when Claude Code constructed probes to test its prompt edits, I noticed something striking: 24 of 31 probes across seeds were character-for-character copies of CoNLL++ rows three examples below . Claude Code was pulling these sentences straight from its training distribution and using them to check the prompts it was writing. 100 real traces have nothing to add. Three verbatim probes vs. their CoNLL++ matches These are the top-three highest-similarity probes from the without-data run, paired with the baseline row each probe’s nearest-neighbor search points at. Cosine similarity tops out at 0.95 rather than 1.00 only because the role prefix differs user vs user:text ; the body text is character-identical. Probe seed 1 : West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and 39 runs in two days to take over at the head of the county championship . CoNLL++ match validation split : identical. Probe seed 2 : Germany ‘s representative to the European Union ‘s veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer . CoNLL++ match training split : identical. Probe seed 1 : The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep . CoNLL++ match training split : identical. Even without that memorization, Claude Code’s knowledge of what NER data looks like is good enough. When I asked Claude Code to synthesize an NER corpus from the application spec alone, it produced 0 verbatim CoNLL++ rows out of 130 . Yet the synthetic and real CoNLL++ corpora still land in the same embedding-space neighborhood, which is what produces the low novelty score. The synthetic rows are modern, naturally-punctuated prose “The COP30 climate summit in Belém, Brazil drew delegates from 190 nations.” ; CoNLL++ is Reuters-style August-1996 news with Penn-tokenization “BRUSSELS 1996-08-22 . EU rejects German call to boycott British lamb .” . The two corpora share zero sentences and zero persons; overlap concentrates in geopolitical place names and perennial organizations. Claude Code’s knowledge covers the shape of NER data, namely entity-rich short news prose in the four CoNLL categories, even when it does not reproduce the specific corpus. Business management YC bench YC bench sits at the opposite extreme. Without 100 real traces, the optimized CEO agent averaged 0.6 successful tasks per episode a task is a contract the CEO accepted, assigned to an employee, and completed before its deadline ; with the traces, it averaged 8 . This operational lift translated to a 20 percentage point increase in survival rate, the simulation’s primary metric. The data was doing essentially all the work, and when I dug in, none of the things that made NER a no-op were in place: Claude Code did not memorize the benchmark, the knowledge it inherits from the configuration only covers half the application, and the optimization-time instruction barely elicits even that half. Claude Code almost certainly did not see this benchmark. YC bench was published April 6, 2026 https://arxiv.org/abs/2604.02378 , three months after Sonnet 4.6’s training data cutoff of January 2026 https://platform.claude.com/docs/en/about-claude/models/overview . Barring some pre-release artifact that found its way into training, the CLI grammar, the structured opener, and the state schema are not in Claude Code’s training corpus. The knowledge Claude Code inherits from the configuration covers actions, not observations. When I asked Claude Code to synthesize a YC-bench corpus from the application spec alone, it used the correct yc-bench CLI vocabulary on its output side. Every major subcommand market browse , task accept , task assign , task dispatch , sim resume appeared within ten percentage points of the real distribution, because the application’s system prompt lists every command and flag verbatim. The user-side observation schema, however, is just observation: string , a pass-through with no structure documented anywhere, so Claude Code had to guess. It guessed a plausible JSON-event format: { "event": "simulation started", "funds cents": ..., "employee count": ... } The format is internally consistent with the spec’s hints “All commands return JSON”, “Funds are in cents” but disjoint from the real Markdown opener Simulation Start — Take Immediate Action . 0 of 636 synth rows reproduced that canonical header. Claude Code knew what to do ; it did not know what the environment would show it. The optimization-time instruction elicited even less. In the without-data condition, Claude Code constructed probes to test the prompts it wrote, but its instruction did not ask for full episode simulation. 0 of 11 probes across seeds contained any yc-bench-specific token see below ; the 11 collapsed to 5 generic “Simulation started. You are the CEO” strings. Even the action-side knowledge, which is right there in the system template, never surfaced. With access to real traces, Claude Code copied the structured opener nearly verbatim top NN cosine = 0.95 and wrote a prompt that handled the actual CLI workflow. The data closes a gap the harness underspecifies and the optimization instruction cannot bridge. A real opener vs. the entire without-data probe set The with-data run copies real baseline rows nearly verbatim top NN cosine = 0.95 . The without-data run fabricates generic CEO roleplay. Real opener also reproduced by the with-data run : Simulation Start — Take Immediate Action - current time: 2025-01-01T00:00:00 - horizon end: 2026-01-01T00:00:00 - funds: $250,000.00 - monthly payroll: $22,340.00 - runway: ~11.2 months - employees: 3 - active tasks: 0 - planned tasks: 0 Your immediate priority : generate revenue before payroll drains your runway. You MUST complete these steps now: 1. yc-bench market browse --required-prestige-lte 1 — find tasks you can accept 2. yc-bench task accept --task-id <UUID — accept 2-3 suitable tasks 3. yc-bench employee list — get employee IDs 4. yc-bench task assign --task-id <UUID --employee-id <UUID — assign employees 5. yc-bench task dispatch --task-id <UUID — dispatch tasks 6. yc-bench sim resume — advance simulation Synthetic openers without-data run , all 11 probes collapsing to 5 distinct strings: Simulation started. You are the CEO. What is your first action? Simulation started. Company initialized with $50,000 funds. You have 3 employees. Simulation started. You are the CEO. Begin by checking company status. Simulation started. What is your first action? Simulation started. No yc-bench CLI, no structured state fields, no immediate-action list. The one number that does appear $50,000 is off by 5x from the real $250,000 initial funds. Contract extraction NDA NDA caught my eye as a clear outlier on the chart. Its novelty score is high, comparable to YC bench, which predicts a large data-ablation gap. But the actual gap was small. On F1, the optimized extraction agent reached 66% with 100 real traces and 64% with none , about two percentage points apart. On strict exact-match, the chart’s primary metric, the gap is essentially zero. That broke the trend the other six applications followed and warranted a closer look. High novelty: Claude Code invents the wrong document genre. When I asked Claude Code to synthesize an NDA corpus from the application spec alone, it produced clean, short, contemporary template-style NDAs “This Non-Disclosure Agreement is entered into as of March 5, 2024, by and between…” , averaging 443 characters per document. The real Kleister-NDA corpus is SEC-EDGAR filings, averaging 19,328 characters , about forty-four times longer, with multi-section legalese WHEREAS , IN WITNESS WHEREOF , NOW, THEREFORE , full confidentiality clauses, and OCR provenance markers from their EX-10.x exhibit form Exhibit , dex .htm , page-number artifacts . Those markers appear in 39% to 80% of real rows and in zero synth rows. The cause is again a harness underspecification: the system template says only “Given the OCR text of an NDA, extract the following fields”, and the user-side schema does not constrain length, provenance, or structure. Claude Code extrapolates from the words “NDA” and “OCR text” and writes a perfectly reasonable contemporary NDA template, which happens not to be what Kleister-NDA contains. The without-data optimization probes shared the same template register: 18 probes across seeds collapsed to 11 unique openers, three of them repetitions of the same “This Non-Disclosure Agreement is entered into as of…” phrase. Small gap: the extraction task is genre-agnostic. The output side stayed faithful in both runs. 100% of synth outputs parsed, all four fields effective date , jurisdiction , party , term were always populated, and null rates per field landed within fifteen percentage points of the real corpus. The application agent’s prior knowledge of how to read an NDA and pull out four fields generalizes across genres. It handles the contemporary templates Claude Code practiced against and the SEC-EDGAR filings the test set actually contains. The data adds two F1 points and roughly zero exact-match points, not twenty, because the extraction skill is already in the application agent’s knowledge, whichever corpus Claude Code practiced on. The novelty score measures a real distributional gap on NDA. For this task, that gap turns out to be orthogonal to the metric. The remaining applications The remaining four split along the same two axes. Scientific paper reproduction and Wordle both sit in the data-helps quadrant Claude Code is at least partly flying blind and the metric cares , which is why they land on the positive side of the chart. Science behaves like a milder YC bench: novelty is high, the metric is unforgiving, and the data does real work. Wordle is milder still, worth a few percentage points. Software engineering and customer service are the two no-ops the visibility/tolerance axes do not explain: both lose the data dependence for reasons upstream of prior knowledge. Software engineering lands near zero because gpt-5.4-mini hits a performance ceiling on terminal-bench that no prompt proposed by Claude Code could move, with or without data: the agent model, not its visibility into the data, is the binding constraint. Customer service τ-bench retail lands near zero for an adjacent reason: gpt-5.4-mini has likely been trained on enough τ-bench traces that it recognizes the task from the user turn alone, so prompt optimization makes no difference either way. What I take away Data matters when the agent engineer’s prior knowledge does not. NER works without traces because the corpus is in Claude Code’s training data and the task shape is generic enough that its invented probes still land in the right neighborhood. YC bench falls apart without traces because the simulator postdates the training cutoff and the harness does not tell Claude Code enough to fill the gap. Embedding-space drift between Claude Code’s guesses and the real data tracks that pattern across all seven applications Spearman ρ = +0.79, exact two-sided p = 0.040 . But with n = 7 and one deliberate exception, I read it as evidence for the mechanism, not a law. That exception is the second half of the lesson. NDA’s drift is high but its data-ablation gap is small: the task only requires reading each document and extracting four fields, which the application agent does on any reasonable NDA whether or not Claude Code practiced on the right genre. Drift tells you whether Claude Code is flying blind, not whether the task punishes it for that. Visibility and tolerance are two different axes, and only their conjunction means data will help. One caveat for anyone hoping to use this as a pre-flight check: the drift estimator needs the real corpus to measure against, so it explains when data helped after the fact; it cannot, on its own, tell you whether to collect data before you have any. The mechanism still hands you two questions you can answer with no corpus at all. Is the task newer than your optimizer model’s training cutoff? And does the harness leave the input the agent will see underspecified a bare observation: string , an OCR blob with no stated genre ? Those two questions flagged YC bench and NDA on their own. When both answers are no, reach for prompt optimization first, and save the trace collection for when they are not. References - @Vtrivedy10. X post https://x.com/Vtrivedy10/status/2031408954517971368 . - Osmani, A. April 19, 2026 . Agent Harness Engineering https://addyosmani.com/blog/agent-harness-engineering/ . - Mehta, V., & Bianconi, G. March 23, 2026 . We’re building an automated AI engineer, and it works https://www.tensorzero.com/blog/automated-ai-engineer/ . TensorZero blog . - Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. March 30, 2026 . Meta-Harness: End-to-End Optimization of Model Harnesses https://arxiv.org/abs/2603.28052 . arXiv preprint arXiv:2603.28052 . - Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. 2012 . A Kernel Two-Sample Test https://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf . Journal of Machine Learning Research , 13, 723–773. Citation @misc{jesson2026whendoesdatahelp, title = {When does data help automated agent engineering?}, author = {Jesson, Andrew}, year = {2026}, month = may, howpublished = {andrewjesson.com}, url = {https://andrewjesson.com/blog/when-does-data-help-automated-agent-engineering/}, }