{"slug": "when-does-data-help-automated-context-engineering", "title": "When Does Data Help Automated Context Engineering?", "summary": "Claude Code can improve other AI agents without training data in four of seven tested applications, performing as well as with data. Data helps only where Claude Code's prior knowledge of the task runs out, and the drift between its self-generated inputs and real data predicts this gap with a Spearman correlation of +0.79. The finding suggests that automated agent engineering can often skip data collection when the model already knows the task domain.", "body_md": "# When does data help automated agent engineering?\n\n*Claude Code can often improve another agent with no training data at all. Across seven applications, data helps only where Claude Code’s own prior knowledge of the task runs out, and how far its guesses drift from the real data mostly tells you which cases those are. The catch is that drift reveals when Claude Code is flying blind, not whether flying blind actually costs anything.*\n\nAn agent is more than a model.\nIt is also the prompts, tools, context management, guardrails, orchestration, and compute infrastructure designed around the model. 1, 2\nWhen the agent does something wrong, an agent engineer tunes one or more of these knobs so that the error never happens again.\nAutomated agent engineering goes meta by putting an AI agent in the role of agent engineer.\n\n[3](https://www.tensorzero.com/blog/automated-ai-engineer/),[4](https://arxiv.org/abs/2603.28052)Claude Code and Codex [can improve agent prompts](/blog/the-engineering-practices-claude-code-and-codex-use-to-improve-ai-agents/) and perform engineering practices like iteratively evaluating their changes.\nBut what happens without training data?\nSurprisingly, Claude Code’s improvements performed as well without training data as with it on several of the applications I tested.\nOn three (Wordle, scientific paper reproduction, business simulation) data did improve task success rate by between five and twenty percentage points, but on the other four (entity extraction, contract extraction, customer service, software engineering) the without-data version performed roughly the same.\n\nLooking into the conversation histories explained why. Even without training data, Claude Code still ran ad hoc evaluations: it made inferences on inputs it generated itself and judged whether the output matched its prompt edit.\nSo the operative question is not whether Claude Code *has* data; it is whether Claude Code already knows what the data looks like.\nData helps exactly when Claude Code’s prior knowledge of the application runs out, and that prior runs out in specific, identifiable ways: the corpus is not memorized, the task shape is unfamiliar, or the harness never says what the agent will actually be shown.\n\nThat suggests a way to measure the missing prior from the outside: how far Claude Code’s self-generated inputs drift from the real data.\nAcross the seven applications, this drift tracks the data-ablation gap (the test-set success rate with data minus the rate without it) at a Spearman rank correlation of +0.79 (exact two-sided *p* = 0.040; Pearson *r* = +0.96).\nThe lone clean exception is the instructive part: drift measures whether Claude Code is flying blind, not whether flying blind hurts the metric. Those are two different axes, and one task scores high on the first while shrugging off the second.\n\n## The experiment\n\nThe seven applications I explored are: [named entity extraction](https://arxiv.org/abs/1909.01441v1) (NER), [NDA clause extraction](https://github.com/applicaai/kleister-nda), Wordle, [customer service](https://taubench.com/#home), [software engineering](https://www.tbench.ai/), [business management](https://www.ycbench.com/), and [scientific paper reproduction](https://arxiv.org/abs/2510.24591) (Science).\n\nFor each application agent I ran the same experiment under two conditions, with data and without, using five independent seeds per condition.\nClaude Code was given configuration files for the agent (prompts, models, tool lists, …).\nIt was instructed to improve the initial prompts.\nThe new prompts were scored on a held-out set of test tasks.\n**The only difference was whether Claude Code was given 100 real traces as training data.**\nIn the without-data condition, the same baseline-data paths existed but the trace files were empty.\n\n## Claude Code as Agent Harness Engineer\n\n- The optimizer is Claude Code running against an internal harness that lets it edit the application agent configuration file. Tool access is\n`Read`\n\n,`Bash`\n\n,`Edit`\n\n,`Write`\n\n. No external MCP servers. - The optimizer runs inside an isolated Docker container (base image\n`node:24-slim`\n\n) containing only the Claude Code CLI,`curl`\n\n, and`git`\n\n. There is no Python and no eval source code on the filesystem. The container shares a Docker network with the gateway, so`Bash`\n\ncan`curl http://gateway:3000/inference ...`\n\nto test prompts but has no other route to the application code. - Claude Code is running\n`claude-sonnet-4-6`\n\n. The application agent model is`gpt-5.4-mini`\n\nacross all seven applications.\n\nClaude Code is given the following instruction at the start of every run, with placeholders like `{config_dir}`\n\nand `{function_name}`\n\nresolved per application before the run begins. The contents are held constant across all conditions.\n\n```\n# TensorZero Function Optimizer\n\nYou are optimizing a TensorZero function to improve its performance metric.\n\n## Environment\n\n- T0 config files: {config_dir}/ (only these and the baseline data below are relevant — don't explore elsewhere)\n- Gateway URL: {gateway_url}\n- Pre-dumped baseline data: {baseline_data_dir}/ (read-only; direct DB access is not available)\n- Restart after config edits: `curl -sf -X POST http://eval:5111/restart-gateway`\n- Isolated container. No Python or `pip`; `node` and `curl` are on `$PATH`; `jq` is not installed. Use `node -e \"...\"` for JSONL parsing (`readline` + `JSON.parse` + project to stdout) — prefer it over shell pipelines when you need fields per row.\n- Don't set `temperature` on any variant (some models reject non-default values). Keep an `initial` variant as a baseline reference.\n- Don't run evaluation episodes yourself — the harness does that after you exit.\n\n## Task\n\n- Function: `{function_name}`\n- Metric: `{metric_name}`. Check the metric's `optimize` field in `tensorzero.toml` for direction (boolean and float metrics may minimize or maximize).\n- Baseline performance: {baseline_metrics}\n\n## Available Models\n\n{model_list}\n\n## Baseline data\n\n- `{baseline_data_dir}/inferences.jsonl` — one row per inference (what the model said per task).\n- `{baseline_data_dir}/feedback.jsonl` — one row per metric value.\n- `{baseline_data_dir}/initial_config/` — read-only copy of the starting T0 config tree.\n\nFiles are often 20+ MB. Don't `cat` them whole. Start by `head -3` on each to learn the row shape (field names and nesting vary by env), then project out the fields you need.\n\n### The projection pattern\n\n`grep` first to narrow, then `node -e` to project:\n\n``` bash\ngrep $TARGET_ID {baseline_data_dir}/inferences.jsonl \\\n  | node -e \"\n      require('readline').createInterface({input: process.stdin}).on('line', l => {\n        const r = JSON.parse(l);\n        console.log(r.id, r.variant_name, JSON.stringify(r.output).slice(0,200));\n      });\"\n```\n\n`cat inferences.jsonl | ...` loads the whole file; `grep`-first keeps the pipeline cheap.\n\n### Cross-record one-liners\n\nAdapt the failure predicate to your metric — boolean uses `\"value\":0` / `\"value\":1`; float values depend on `optimize` direction.\n\n``` bash\n# Inferences per episode\ngrep -o '\"episode_id\":\"[^\"]*\"' {baseline_data_dir}/inferences.jsonl | sort | uniq -c | sort -rn | head\n\n# Last inference of a failing episode\ngrep $FAIL_ID {baseline_data_dir}/inferences.jsonl | tail -1\n\n# Which metrics are present\ngrep -o '\"metric_name\":\"[^\"]*\"' {baseline_data_dir}/feedback.jsonl | sort | uniq -c\n\n# target_ids of failures (boolean example — adapt the predicate for float metrics)\ngrep '\"metric_name\":\"{metric_name}\"' {baseline_data_dir}/feedback.jsonl \\\n  | node -e \"\n      require('readline').createInterface({input: process.stdin}).on('line', l => {\n        const r = JSON.parse(l);\n        if (r.value === 0 || r.value === false) console.log(r.target_id);\n      });\" > /tmp/failed.txt\nhead -5 /tmp/failed.txt | while read id; do grep \"$id\" {baseline_data_dir}/inferences.jsonl | head -1; done\n```\n\n### Templates, schemas, and the required `content` shape\n\nTensorZero has two co-existing config styles. Check which one the function uses in `tensorzero.toml`:\n\n**Legacy** (per-role):\n\n``` toml\n[functions.\"my_fn\"]\nuser_schema = \"functions/my_fn/user_schema.json\"   # and system_schema, assistant_schema\n\n[functions.\"my_fn\".variants.initial]\nuser_template = \"functions/my_fn/initial/user_template.minijinja\"\n```\n\n**New** (named):\n\n``` toml\n[functions.\"my_fn\"]\nschemas.user_query.path = \"functions/my_fn/user_query_schema.json\"\n\n[functions.\"my_fn\".variants.initial]\ntemplates.user_query.path = \"functions/my_fn/initial/user_query.minijinja\"\n```\n\n**Canonical `content` block for a templated message** (both styles):\n\n``` json\n\"content\": [{\n  \"type\": \"template\",\n  \"name\": \"<template_name>\",\n  \"arguments\": { /* object matching the schema */ }\n}]\n```\n\nFor legacy, `\"name\"` is the role (`\"user\"` / `\"system\"` / `\"assistant\"`). For new, it's the key under `schemas.` / `templates.`.\n\nFor a role with no schema: `\"content\": \"Hello\"` or `[{\"type\":\"text\",\"text\":\"Hello\"}]`.\n\n**Example** — τ-retail `user_schema.json` and the matching curl body:\n\n``` json\n// user_schema.json\n{ \"properties\": { \"observation\": { \"type\": \"string\" } },\n   \"required\": [\"observation\"], \"type\": \"object\" }\n\n// curl body\n{ \"function_name\": \"tau_bench_retail_v0::act\",\n   \"variant_name\": \"your_new_variant\",\n   \"input\": { \"messages\": [{ \"role\": \"user\", \"content\": [{\n     \"type\": \"template\", \"name\": \"user\",\n     \"arguments\": { \"observation\": \"Hello, I need to cancel my order.\" }\n   }] }] } }\n```\n\n## Methodology\n\nThe core loop is: survey the baseline → add variants → test one → iterate. The decisions worth getting right:\n\n- **Metric direction defines \"failure.\"** Don't assume `value:0` is bad; read the metric's `optimize` field.\n- **Judge manual variant tests by the `curl /inference` output itself** — right tool call, right JSON, right content.\n- **Multi-turn agentic envs** (customer service, business management, coding) need real conversational state to be representative. Pick a real episode from `inferences.jsonl`, copy its first 2–3 messages into your curl body, check how the variant continues. A turn-0 probe alone tells you little.\n- **When done, leave the best config in place** with the experimentation section below, and exit.\n\n## Routing: Experimentation Config\n\nAfter creating new variants, add an experimentation section — otherwise the gateway round-robins and wastes test episodes on bad variants. Keep candidates to your best ~3–4, including `initial` as a baseline.\n\n``` toml\n[functions.\"{function_name}\".experimentation]\ntype = \"track_and_stop\"\nmetric = \"{metric_name}\"\ncandidate_variants = [\"initial\", \"your_new_variant_1\", \"your_new_variant_2\"]\nfallback_variants = []\nmin_samples_per_variant = 5\ndelta = 0.1\nepsilon = 0.0\nupdate_period_s = 5\nmin_prob = 0.0\nmax_samples_per_variant = 10000\nI report the **gap in success rate** between two runs, `score(with data) − score(without data)`\n\n, on the y-axis of the chart above.\nIf data helps, the gap is positive.\nIf data is unnecessary, the gap is around zero.\n\n## Evaluation of optimized variants\n\n- The per-application metric is binary (success / no success), measured on a held-out test set of up to 100 episodes per (seed, variant). What counts as success depends on the application:\n**NER**: exact match. The agent correctly identifies and classifies every named entity in the input sentence.** NDA**: exact match. The agent correctly extracts the four target fields (`effective_date`\n\n,`jurisdiction`\n\n,`party`\n\n,`term`\n\n) from the document.**Wordle**: the 5-letter target word is guessed within six attempts.** Customer service (τ-bench retail)**: the agent completes the user’s request with the correct database changes (verified by state comparison) and communicates all required information.**Software engineering (terminal-bench)**: every test case in the task suite passes.** Business management (YC bench)**: the company survives to the simulation horizon without bankruptcy.** Scientific paper reproduction (replicationbench)**: every per-paper verifier reward component evaluates to ≥ 1.0 (Harbor’s`resolved`\n\nmetric).\n\n- Five seeds per (application, condition).\n- The y-axis of the chart is the per-application gap in success rate:\n`score(with data) − score(without data)`\n\n, with the median across seeds plotted as the point and the inter-quartile range as the vertical whisker.\n\n## How to estimate novelty?\n\nClaude Code constructed examples to test the prompts it wrote. It did this several times per run by making inferences on example user inputs and checking the response. With access to real data, it copied those examples from the trace dump. Without access, I was surprised to see that it generated synthetic examples without any additional prompting. Immediately, I wanted to know how different the synthetic examples were from the real ones.\n\n```\nJapan July refined zinc imports off 47.5 pct yr / yr .\n{\n  \"person\": [],\n  \"location\": [\n    \"Japan\"\n  ],\n  \"organization\": [],\n  \"miscellaneous\": [\n    \"July\",\n    \"refined zinc\",\n    \"refined zinc imports\",\n    \"47.5 pct yr / yr\"\n  ]\n}\nThe COP30 climate summit in Belém, Brazil drew delegates from 190 nations. UN Climate Chief Simon Stiell praised pledges from China, India, and the European Union. US Climate Envoy John Podesta and Brazilian President Luiz Inácio Lula da Silva co-chaired the closing session.\n{\n  \"person\": [\n    \"Simon Stiell\",\n    \"John Podesta\",\n    \"Luiz In\\u00e1cio Lula da Silva\"\n  ],\n  \"organization\": [\n    \"UN\",\n    \"European Union\"\n  ],\n  \"location\": [\n    \"Bel\\u00e9m\",\n    \"Brazil\",\n    \"China\",\n    \"India\"\n  ],\n  \"miscellaneous\": [\n    \"COP30\"\n  ]\n}\n```\n\nTo investigate the difference, I devised a dataset-synthesis pipeline and ran it on the seven applications. Given just the application agent’s config and the dataset schema, Claude Code was instructed to generate 20–40 example conversations.\nAcross five independent seeds per application, I compute the maximum mean discrepancy () in embedding space (Voyage `voyage-3-large`\n\n) between the synthetic corpus and the real-trace corpus, and report the median across seeds as the per-application novelty score.\nIn other words, novelty measures how different the real traces are from what Claude Code guesses the data should look like without access to those traces.\nAcross the seven applications this score tracks the data-ablation gap at Spearman ρ = +0.79 (exact two-sided *p* = 0.040; at n = 7 the asymptotic approximation is unreliable, so I report the exact permutation value).\nMMD² was the first and only drift estimator I tried: a standard non-parametric two-sample distance, fixed before I looked at the gaps. So this is a single pre-chosen statistic, not the best of a search over estimators.\n\n## Dataset-novelty estimator (MMD²)\n\nThe goal is to estimate how surprising a dataset is to a coding agent like Claude Code or Codex that is instructed to be an agent engineer. To do this, I compare a real dataset to a dataset generated by the coding agent.\n\nEach application is an LLM function with a defined input/output contract, like answering a customer-service ticket, extracting entities from a sentence, or playing a turn of Wordle.\nAn *inference* is one call to that function: the input it received plus the output it returned, recorded as one row in a JSONL file.\nAn *episode* is one logical interaction with the function, identified by a shared `episode_id`\n\n.\nA single-turn application like NER has exactly one inference per episode.\nA multi-turn application like Wordle chains several inferences into one episode (one inference per turn of the game).\nFor each application I assume two corpora of such rows:\n\n| corpus | source | size |\n|---|---|---|\nReal baseline | actual rows logged from prior runs of the function on real users / tasks | hundreds to ~20 k rows |\nSynthetic | rows invented by an agent given only the function’s spec (no real data seen) | 25–170 rows per seed |\n\nBecause the coding agent is not conditioned on real data to generate the synthetic dataset, the divergence between its distribution over datasets given the task and the distribution over real datasets is an indicator of novelty.\nTherefore, I want a scalar that measures the divergence between the distribution of rows in and the distribution of rows in .\nI chose Maximum Mean Discrepancy () 5, which is a standard non-parametric estimator.\nIt compares the kernel-induced means of two finite samples and goes to zero as the two samples are drawn from the same underlying distribution.\nA larger MMD² means the coding agent’s knowledge of the application, given the config, covers less of the actual deployment.\n\n### Generating the synthetic corpus\n\nThe synthetic corpus is produced by a coding agent (Claude Code or Codex) given only:\n\n- the function’s machine-readable specification: input schema, output schema, system prompt, available tools, and the set of defined evaluation metrics;\n- the schemas of the two output files (\n`inferences.jsonl`\n\nrow schema and`feedback.jsonl`\n\nrow schema).\n\n**The agent has no access to real data during synthesis.**\n\nThe procedure has five steps (read the spec, plan input coverage, generate inputs and outputs, calibrate periodically with a few live probe calls, then emit feedback values), reproduced verbatim in the `SKILL.md`\n\nand `methodology.md`\n\ninstruction files below.\n\nThe output is two files: `inferences.jsonl`\n\n(one row per inference) and `feedback.jsonl`\n\n(one row per metric value, linked to the inference or episode it scores).\nBoth are schema-validated before the run exits.\nThe episode budget is a parameter set per run; in this analysis it was set to 20–40 episodes per application.\n\nI run **K independent agent seeds per application** (K = 5 in this analysis), so the dataset-novelty estimator can be aggregated across runs.\n\nThe instruction files the synthesis agent reads are reproduced below.\n\n## SKILL.md\n\nThe top-level instruction the synthesis agent receives:\n\n```\n---\nname: dataset-synthesis\ndescription: Synthesize representative inferences and feedback for an LLM application described by a TensorZero configuration. Use when a plausible baseline corpus is needed for a function that has not yet collected real data.\n---\n\n# TensorZero Dataset Synthesis\n\nYou are synthesizing a _plausible_ dataset for a TensorZero function. You will produce two JSONL files that look like what `inferences.jsonl` and `feedback.jsonl` _would_ contain after the function had run live for a while. Crucially, you do **not** have any real baseline data to draw from, but the configuration files should provide you with enough information about the application to generate sensible examples.\n\n## Environment\n\n- T0 config files: `{config_dir}/`\n- Gateway URL: `{gateway_url}` (you may POST to `/inference` to spot-check your understanding of the input structure)\n- Output directory: `{output_dir}/` — write `inferences.jsonl` and `feedback.jsonl` here\n- Isolated container. No Python or `pip`; `node` and `curl` are on `$PATH`; `jq` is not installed. Use `node -e \"...\"` for JSONL parsing.\n- Emit rows with `variant_name: \"initial\"` only.\n\n## Task\n\n- Function: `{function_name}`\n- Metrics defined for this function: `{metric_name_list}` (read their `kind`, `level`, and `optimize` fields from `tensorzero.toml`)\n- Budget: at least `{min_episodes}`, but no more than `{max_episodes}` episodes. An episode is one logical interaction with the function — a single inference for single-turn functions, a chain of inferences sharing one `episode_id` for multi-turn.\n- Output files:\n  - `{output_dir}/inferences.jsonl` — one row per inference call (see reference/inferences_schema.md)\n  - `{output_dir}/feedback.jsonl` — one row per metric value, with `target_id` referring to the `inference_id` or `episode_id` of a row in the inferences file (see reference/feedback_schema.md)\n\n## Workflow\n\nFive steps. See reference/methodology.md for the long form; the short version:\n\n1. **Read the spec.** Open `{config_dir}/tensorzero.toml` and the linked schema / template files. Note: input schema, output schema, function type (chat vs tool), defined metrics, and whether the function is one-shot or part of a multi-turn episode.\n2. **Hypothesize the input distribution.** What kinds of users / states does this function see in deployment? Sketch a coverage plan: how many length buckets, which schema slots vary, which edge cases matter. Aim for diversity, not just a single canonical mode.\n3. **Generate inputs.** Plan out at least `{min_episodes}` but no more than `{max_episodes}` distinct episodes. For multi-turn functions, decide each episode's length up front based on what's realistic for the task (a 4-turn episode contributes 4 rows sharing one `episode_id`).\n4. **Spot-check via the gateway.** Periodically POST a synthetic input to `{gateway_url}/inference` to confirm your understanding of the input structure is correct and to see what the `initial` variant's output actually looks like. Gateway calls are expensive — treat this as a calibration step, not as the way to generate every row. Generate outputs yourself in between checks.\n5. **Generate feedback rows.** For each metric in `{metric_name_list}`, emit one feedback row per appropriate target (per-inference or per-episode based on the metric's `level`).\n\nAfter generating, validate:\n\n``` bash\nnode /skill/scripts/validate.js {output_dir} \\\n  --config {config_dir}/tensorzero.toml \\\n  --min-episodes {min_episodes} --max-episodes {max_episodes}\n```\n\nThe validator checks schema compliance, referential integrity, budget, and the `variant_name == \"initial\"` invariant. Fix any errors it reports before exiting.\n\n## Output contract\n\nWhen you exit, `{output_dir}/` must contain exactly:\n\n- `inferences.jsonl` — every row conforms to the schema in `reference/inferences_schema.md`; the rows span at least `{min_episodes}` and at most `{max_episodes}` distinct `episode_id` s\n- `feedback.jsonl` — one or more rows per metric, with every `target_id` referring to an `id` (for inference-level metrics) or `episode_id` (for episode-level metrics) that exists in `inferences.jsonl`\n\nDo not write any other files in `{output_dir}/`. Do not modify `{config_dir}/`. Stay within budget — don't issue gateway calls indefinitely.\n\n## Principles\n\n- **Quality of coverage beats quantity of duplicates.**\n- **Use the gateway as a calibration tool.** A periodic `/inference` call confirms your understanding of the input structure and shows you what the `initial` variant actually emits. It's not a way to generate every row — gateway calls are expensive, and it's fine to generate outputs yourself between checks.\n- **Don't peek.** You don't have baseline data. If you find yourself wanting to \"look at a real example,\" that's the signal to make a better-reasoned guess from the spec instead.\n- **Plausibility includes failure.** Some inferences will fail their metric. Your feedback distribution should reflect a realistic failure rate for the task — not 100% success.\n```\n\n## reference/methodology.md\n\nThe longer methodology the agent can consult:\n\n```\n# Synthesis methodology\n\nThe recipe for producing a faithful `inferences.jsonl` + `feedback.jsonl` from spec alone. Five steps, each with concrete things to look for.\n\n## 1. Read the spec carefully\n\nOpen `{config_dir}/tensorzero.toml`. For the target function, capture:\n\n- **Type**: `chat` vs `tool` / `json`. This determines `output` shape.\n- **Schemas**: input (per-role or named), output (for tool-call functions). Read every referenced `.json` and every `.minijinja` template.\n- **Metrics**: which are defined, their `type`, `level`, `optimize`. These dictate the `feedback.jsonl` rows you'll write.\n- **System prompt**: usually inside the variant's template. Read it — this is the strongest signal about what the function is for.\n- **Tool list** (for tool functions): names, descriptions, argument schemas.\n\nTwo patterns to check early:\n\n``` bash\n# What kind of function?\ngrep -A2 \"^\\[functions\\.\\\"{function_name}\\\"\\]\" {config_dir}/tensorzero.toml\n\n# Which metrics are defined?\ngrep -E \"^\\[metrics\\.\" {config_dir}/tensorzero.toml\n```\n\n## 2. Hypothesize the input distribution\n\nBefore generating anything, sketch a plan. For each schema slot in the input:\n\n- What value ranges / shapes does it plausibly take in deployment?\n- Are there subpopulations (long vs short, simple vs nested, single vs multi-entity)?\n- What's the realistic length / complexity distribution?\n\nWrite the plan as a comment-level outline before the first row. Something like:\n\n```\nPlan for {function_name}\n(budget: at least {min_episodes}, at most {max_episodes} episodes)\n  - Target ~N episodes × K turns each\n  - Mix of the major user intents the function supports\n  - Vary user tone / register across episodes\n  - Cover authentication / setup steps the function expects before the main action\n```\n\nDon't skip this step. Generating without a plan reliably produces a stack of near-duplicates of the same canonical input.\n\n## 3. Generate inputs\n\nFor each row in your plan:\n\n- Construct the `input.messages` array per the schema rules in inferences_schema.md.\n- For multi-turn: episode by episode. Within one episode, mint a fresh `episode_id`, then chain inferences — each turn's `input.messages` is the previous turns' `input` plus `assistant` reply plus next user turn.\n\nTools you'll use:\n\n- `node -e` for any structured generation (writing JSON bodies, looping, minting UUIDs).\n- A working directory in `/tmp` for intermediate files (probe bodies, response captures).\n- `curl` to call the gateway.\n\n## 4. Spot-check via the gateway\n\nPeriodically — not on every row — POST a synthetic input to the gateway and look at the response. The purpose is calibration, not generation:\n\n- Confirm the input shape you've been building actually parses (template name correct, schema arguments well-formed).\n- See what the `initial` variant's output structure looks like for that input, so the outputs you generate yourself stay faithful to it.\n- Catch drift early — if the first spot-check shows your `arguments` object missing a required field, fix the generator before producing more rows.\n\n``` bash\nnode -e \"\n  const body = { /* function_name, variant_name: 'initial', input: ... */ };\n  process.stdout.write(JSON.stringify(body));\n\" > /tmp/req.json\n\ncurl -sf {gateway_url}/inference \\\n     -H 'Content-Type: application/json' \\\n     --data @/tmp/req.json > /tmp/resp.json\n```\n\nA reasonable cadence: one spot-check before you start generating, one after the first episode, and one every ~5 episodes thereafter. Cheaper than per-row, sufficient to catch most schema mistakes.\n\nAssemble each inference row from:\n\n- Your minted `id` and `episode_id`\n- The current `created_at`\n- `\"initial\"` as `variant_name`\n- Your `input` from step 3\n- An `output` you write yourself, matching the structure you saw in the spot-checks (gateway response → guide for your own generation)\n\nFor multi-turn: within one episode, each turn's `input.messages` extends the previous turn's by appending the assistant reply and the next user message. Keep the chain coherent across turns of the same `episode_id`.\n\nWrite each row to `{output_dir}/inferences.jsonl` immediately — don't batch, so a crash mid-run preserves progress.\n\n## 5. Generate feedback rows\n\nFor each metric in `{metric_name_list}`:\n\n- Determine its `level` (inference vs episode) from the TZ config.\n- For inference-level: walk every inference row and emit one feedback row per (inference, metric).\n- For episode-level: walk every distinct `episode_id` and emit one feedback row per (episode, metric).\n\nFor the `value`:\n\n- **If the metric is verifiable from the row alone** (e.g. exact_match against a known gold answer, or a length / format check), compute it programmatically.\n- **Otherwise**, predict the value from input + output using your understanding of the task. Stay calibrated — see \"realistic value distributions\" in feedback_schema.md.\n\nWrite to `{output_dir}/feedback.jsonl`.\n\n## Iteration / self-audit\n\nAfter roughly a third of the planned episodes, stop and inspect what you've produced:\n\n``` bash\n# Count rows and unique episodes\nwc -l {output_dir}/inferences.jsonl\ngrep -o '\"episode_id\":\"[^\"]*\"' {output_dir}/inferences.jsonl | sort -u | wc -l\n\n# Variety of input templates / first 80 chars\nnode -e \"\n  require('readline').createInterface({input: require('fs').createReadStream('{output_dir}/inferences.jsonl')}).on('line', l => {\n    const r = JSON.parse(l);\n    const m = r.input.messages[0];\n    const c = Array.isArray(m.content) ? m.content[0] : m.content;\n    const s = typeof c === 'string' ? c : JSON.stringify(c.arguments || c);\n    console.log(s.slice(0, 80));\n  });\" | sort -u | head -20\n```\n\nAsk:\n\n- Am I converging on one mode? (lots of near-identical first lines)\n- Did I cover all the schema slots I planned for?\n- Does my feedback distribution look reasonable?\n\nIf yes to mode collapse — diversify the remaining episodes by deliberately picking cases that look different from what's there. You have room to add more episodes up to `{max_episodes}`; you do not have to stop at the planned count if your coverage feels thin.\n\n## When to stop and validate\n\nOnce you've reached at least `{min_episodes}` episodes (and no more than `{max_episodes}`) with each metric covered, run:\n\n``` bash\nnode /skill/scripts/validate.js {output_dir}\n```\n\nRead its output and fix any errors. Then exit.\n\n## Anti-patterns\n\n- **Skipping step 2.** \"I'll just start generating\" gives mode collapse 100% of the time.\n- **Skipping step 4 entirely.** Without any spot-checks you have no signal that your input shape parses or that your outputs resemble what the model actually emits.\n- **Treating all episodes as length 1.** For multi-turn functions, single-turn episodes are _unrepresentative_.\n- **Generating one giant batch and writing at the end.** Write incrementally so a crash doesn't lose work.\n- **Ignoring the metric `level`.** Inference-level vs episode-level changes which `target_id` you reference.\n```\n\n## reference/inferences_schema.md\n\nThe schema for `inferences.jsonl`\n\nrows:\n\n```\n# `inferences.jsonl` row schema\n\nOne JSON object per line. Every row represents one call to `/inference` against the function. Multiple rows can share an `episode_id` (multi-turn episodes).\n\n## Fields\n\n| field          | type                   | required | notes                                                                                                                                  |\n| -------------- | ---------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------- |\n| `id`           | string (UUID v7)       | yes      | Unique per inference. UUID v7 sorts by timestamp — see \"minting UUIDs\" below.                                                          |\n| `episode_id`   | string (UUID v7)       | yes      | One UUID per logical episode. For single-turn functions, this is fresh per row. For multi-turn, all rows in the same episode share it. |\n| `created_at`   | string (ISO 8601, UTC) | yes      | E.g. `\"2026-05-15T18:42:11.123Z\"`. Should be monotonic within an episode.                                                              |\n| `variant_name` | string                 | yes      | Always `\"initial\"` for this skill.                                                                                                     |\n| `input`        | object                 | yes      | `{\"messages\": [...]}` — the request body's `input` field. See \"input shape\" below.                                                     |\n| `output`       | array                  | yes      | The gateway's response content blocks. Shape depends on function type. See \"output shape\" below.                                       |\n\n## Minting UUIDs\n\nUUID v7 is required because TensorZero uses the embedded timestamp to order rows.\n\n``` js\n// Node-only UUID v7 minter (no external deps)\nfunction uuidv7() {\n  const ts = BigInt(Date.now());\n  const tsHex = ts.toString(16).padStart(12, \"0\");\n  const rand = crypto.randomBytes(10);\n  rand[0] = (rand[0] & 0x0f) | 0x70; // version 7\n  rand[2] = (rand[2] & 0x3f) | 0x80; // RFC 4122 variant\n  const r = rand.toString(\"hex\");\n  return `${tsHex.slice(0, 8)}-${tsHex.slice(8, 12)}-${r.slice(0, 4)}-${r.slice(4, 8)}-${r.slice(8, 20)}`;\n}\n```\n\nFor an episode of N turns, mint one `episode_id`, then mint N `id` s, advancing `created_at` by ~1s between them.\n\n## Input shape\n\nThe `input.messages` field follows the standard chat-message format. Each message is `{role, content}` where:\n\n- `role`: `\"system\" | \"user\" | \"assistant\"`\n- `content`: either a string (rare) or an array of content blocks\n\nThe most common content block for a templated function is:\n\n``` json\n{\n  \"type\": \"template\",\n  \"name\": \"<template_name>\",\n  \"arguments\": {\n    /* object matching the schema */\n  }\n}\n```\n\n`<template_name>` and the `arguments` shape come from the TZ config. Two co-existing styles:\n\n**Legacy (per-role schemas):**\n\n``` toml\n[functions.\"my_fn\"]\nuser_schema = \"functions/my_fn/user_schema.json\"\n\n[functions.\"my_fn\".variants.initial]\nuser_template = \"functions/my_fn/initial/user_template.minijinja\"\n```\n\n`<template_name>` is the role name (`\"user\"`, `\"system\"`, `\"assistant\"`).\n\n**New (named schemas):**\n\n``` toml\n[functions.\"my_fn\"]\nschemas.user_query.path = \"functions/my_fn/user_query_schema.json\"\n\n[functions.\"my_fn\".variants.initial]\ntemplates.user_query.path = \"functions/my_fn/initial/user_query.minijinja\"\n```\n\n`<template_name>` is the key under `schemas.` / `templates.` (e.g. `\"user_query\"`).\n\nFor roles that have no schema, use either `\"content\": \"Hello\"` or `[{\"type\":\"text\",\"text\":\"Hello\"}]`.\n\n> **Filesystem path mangling**: function and tool names containing `::` (e.g. `\"my_function::act\"`) appear in the TZ config as `[functions.\"my_function::act\"]`, but on disk the corresponding directory is `functions/my_function____act/` (four underscores). When reading template / schema files, translate `::` → `____` in the path. A quick `find /config -type f` confirms the actual layout if you're unsure.\n\n### Example: a templated user input\n\n``` json\n\"input\": {\n  \"messages\": [{\n    \"role\": \"user\",\n    \"content\": [{\n      \"type\": \"template\",\n      \"name\": \"user\",\n      \"arguments\": { \"observation\": \"Hello, this is a sample user message.\" }\n    }]\n  }]\n}\n```\n\nFor a multi-turn episode, append assistant + tool result messages between user turns. The third turn's `input.messages` will hold 5 entries (system?, user₀, assistant₀, user₁, assistant₁).\n\n## Output shape\n\nDepends on the function's `type` in the TZ config — there are three forms.\n\n**`type = \"chat\"`** — list of content blocks:\n\n``` json\n\"output\": [\n  { \"type\": \"text\", \"text\": \"The model's reply.\" }\n]\n```\n\nTools and text can mix in the same list (a text block followed by a `tool_call`, or several `tool_call` s).\n\n**`type = \"chat\"` with tools** — same list, with `tool_call` blocks:\n\n``` json\n\"output\": [\n  {\n    \"type\": \"tool_call\",\n    \"name\": \"<tool_name>\",\n    \"arguments\": { /* matching tool schema */ }\n  }\n]\n```\n\nReal rows often include extra fields like `id`, `raw_name`, `raw_arguments` carried back from the underlying model API. Reproduce only `type` + `name` + `arguments` unless you also call the gateway; the extras are post-hoc.\n\n**`type = \"json\"`** — a single object with `raw` (the unparsed string) and `parsed` (the matched JSON):\n\n``` json\n\"output\": {\n  \"raw\": \"{\\\"person\\\": [], \\\"location\\\": [\\\"Japan\\\"]}\",\n  \"parsed\": {\n    \"person\": [],\n    \"location\": [\"Japan\"]\n  }\n}\n```\n\n`parsed` must conform to the function's `output_schema`. `raw` is the literal string the model emitted; usually it's just `JSON.stringify(parsed)` with whatever whitespace the model used.\n\nIf you're not sure which form applies, look at `[functions.\"<fn>\"]` in `tensorzero.toml` — the `type` field tells you.\n\n## Common mistakes\n\n- **`id == episode_id`.** They must be distinct UUIDs even for single-turn functions.\n- **String content where the schema expects template.** If the function has a `user_schema.json`, the user message MUST use `{\"type\":\"template\", \"name\":\"user\", \"arguments\":{...}}` — a plain string will be rejected.\n- **`variant_name` set to something other than `\"initial\"`.** This skill only emits `initial`-variant rows; we're characterizing the baseline distribution.\n- **Outputs invented by hand.** Always ground via the gateway (see methodology.md). A hand-written `tool_call` argument is very likely to drift from how the model actually phrases things.\n- **`created_at` in the wrong format.** ISO 8601 UTC, either with the `Z` suffix (e.g. `\"2026-05-15T18:42:11.123Z\"`) or the explicit `+00:00` offset. Non-UTC timezone offsets are rejected.\n```\n\n## reference/feedback_schema.md\n\nThe schema for `feedback.jsonl`\n\nrows:\n\n```\n# `feedback.jsonl` row schema\n\nOne JSON object per line. Every row represents one piece of feedback associated with either a single inference or a whole episode.\n\n## Fields\n\n| field         | type          | required | notes                                                                                                                |\n| ------------- | ------------- | -------- | -------------------------------------------------------------------------------------------------------------------- |\n| `kind`        | string enum   | yes      | One of `\"boolean\"`, `\"float\"`, `\"comment\"`, `\"demonstration\"`. Determines the `value` type.                          |\n| `metric_name` | string        | yes      | Must match a metric defined under `[metrics.<name>]` in `tensorzero.toml`.                                           |\n| `target_id`   | string (UUID) | yes      | Resolves to an `inferences.id` (for inference-level metrics) or `inferences.episode_id` (for episode-level metrics). |\n| `value`       | varies        | yes      | Type depends on `kind`. See below.                                                                                   |\n\n## Reading the metric definition\n\nFor each metric you emit feedback for, locate its definition in the TZ config:\n\n``` toml\n[metrics.exact_match]\ntype     = \"boolean\"        # → kind in feedback row\nlevel    = \"inference\"      # → target_id resolves to inferences.id\noptimize = \"max\"            # informational; bigger value is better\n\n[metrics.cost]\ntype     = \"float\"\nlevel    = \"episode\"        # → target_id resolves to inferences.episode_id\noptimize = \"min\"\n```\n\nThree rules that drop out of this:\n\n- **`kind`** in the feedback row matches **` type`** in the metric definition.\n- **`level = \"inference\"`** ⇒ `target_id` is one of the `id` s in `inferences.jsonl`. One feedback row per (inference, metric) pair.\n- **`level = \"episode\"`** ⇒ `target_id` is one of the `episode_id` s. One feedback row per (episode, metric) pair.\n\n## `value` shape by `kind`\n\n| kind            | type        | example                      | notes                                                                |\n| --------------- | ----------- | ---------------------------- | -------------------------------------------------------------------- |\n| `boolean`       | bool or 0/1 | `true`, `false`, `1`, `0`    | Both forms are accepted; prefer `true` / `false`.                    |\n| `float`         | number      | `0.73`, `12.4`               | Range is metric-defined — read its bounds from the TZ config if any. |\n| `comment`       | string      | `\"Failed: incorrect output\"` | Natural-language feedback from users or developers.                  |\n| `demonstration` | object      | `{ \"output\": [...] }`        | Edited drafts, labels, human-generated content.                      |\n\nFor this skill, focus on `boolean` and `float` — they're the metrics that drive optimization.\n\n## Examples\n\n**Inference-level boolean:**\n\n``` json\n{\n  \"kind\": \"boolean\",\n  \"metric_name\": \"exact_match\",\n  \"target_id\": \"<inference_id>\",\n  \"value\": false\n}\n```\n\n**Episode-level float:**\n\n``` json\n{\n  \"kind\": \"float\",\n  \"metric_name\": \"reward\",\n  \"target_id\": \"<episode_id>\",\n  \"value\": 0.42\n}\n```\n\n## Realistic value distributions\n\nYou don't have ground-truth labels, but you should produce a feedback distribution that's _plausible_ for the task — not 100% success and not 100% failure.\n\nFor a boolean metric:\n\n- A 100% success rate is a red flag — it suggests you tilted your synthetic inputs toward easy cases. Re-balance.\n\nFor a float metric:\n\n- Bound by the metric's natural range (often [0, 1] for accuracy-style or unbounded for cost / reward).\n- Distribute across the range — don't pile everything at the mean.\n- If you don't know what the natural range is, generate a few real outputs first via the gateway and inspect them.\n\nThe point of this corpus is to be a _prior_ over what the function's baseline behavior looks like — it does not need to be correct, but it must be plausible. The downstream measurement (input/output/feedback novelty against the real baseline) will surface where the prior was wrong.\n\n## Common mistakes\n\n- **`target_id` points at an `episode_id` for an inference-level metric (or vice versa).** Read the metric's `level` first.\n- **`kind` mismatched with `metric.type`.** A `float` metric must receive `kind: \"float\"` feedback rows, even if the values look 0/1.\n- **`metric_name` not in the TZ config.** Emitting feedback for a metric the function doesn't define will fail validation.\n- **Missing rows.** Every inference should be covered by at least one feedback row from an inference-level metric, and every episode by at least one episode-level metric (if any are defined). The validator counts coverage.\n```\n\n## scripts/validate.js\n\nThe validator the agent runs before exiting:\n\n``` bash\n#!/usr/bin/env node\n/**\n * Validate a dataset-synthesis run's output. Mirrors the contract from the\n * skill's reference docs.\n *\n * Usage:\n *   node validate.js <output_dir> [--config <tensorzero.toml>]\n *                                 [--min-episodes <N>] [--max-episodes <N>]\n *\n * Exits 0 on success, 1 on any error. Errors go to stderr; the summary line\n * (\"PASS\" or \"FAIL — N error(s):\") and per-file counts go to stdout so the\n * agent can `> validate.log 2>&1` for a single file.\n */\n\"use strict\";\n\nconst fs = require(\"fs\");\nconst path = require(\"path\");\n\nconst UUID_RE =\n  /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i;\nconst ALLOWED_FEEDBACK_KINDS = new Set([\n  \"boolean\",\n  \"float\",\n  \"comment\",\n  \"demonstration\",\n]);\nconst ALLOWED_OUTPUT_TYPES = new Set([\n  \"text\",\n  \"tool_call\",\n  \"raw_text\",\n  \"thought\",\n]);\nconst REQUIRED_INF_FIELDS = [\n  \"id\",\n  \"episode_id\",\n  \"created_at\",\n  \"variant_name\",\n  \"input\",\n  \"output\",\n];\nconst REQUIRED_FB_FIELDS = [\"kind\", \"metric_name\", \"target_id\", \"value\"];\n\n// ── Arg parsing ──────────────────────────────────────────────────────────\n\nfunction parseArgs(argv) {\n  const args = {\n    outputDir: null,\n    config: null,\n    minEpisodes: null,\n    maxEpisodes: null,\n  };\n  for (let i = 0; i < argv.length; i++) {\n    const a = argv[i];\n    if (a === \"--config\") args.config = argv[++i];\n    else if (a === \"--min-episodes\") args.minEpisodes = Number(argv[++i]);\n    else if (a === \"--max-episodes\") args.maxEpisodes = Number(argv[++i]);\n    else if (a === \"-h\" || a === \"--help\") {\n      console.log(\n        \"usage: validate.js <output_dir> [--config <toml>] [--min-episodes N] [--max-episodes N]\",\n      );\n      process.exit(0);\n    } else if (!args.outputDir) args.outputDir = a;\n    else {\n      console.error(`unexpected arg: ${a}`);\n      process.exit(2);\n    }\n  }\n  if (!args.outputDir) {\n    console.error(\"output_dir is required\");\n    process.exit(2);\n  }\n  return args;\n}\n\n// ── JSONL loader ─────────────────────────────────────────────────────────\n\nfunction loadJsonl(filePath, errors) {\n  if (!fs.existsSync(filePath)) {\n    errors.push(`missing file: ${filePath}`);\n    return [];\n  }\n  const raw = fs.readFileSync(filePath, \"utf8\");\n  const rows = [];\n  raw.split(\"\\n\").forEach((line, idx) => {\n    if (!line.trim()) return;\n    try {\n      const obj = JSON.parse(line);\n      if (typeof obj !== \"object\" || obj === null || Array.isArray(obj)) {\n        errors.push(\n          `${path.basename(filePath)}:${idx + 1}: top-level value must be an object`,\n        );\n        return;\n      }\n      rows.push(obj);\n    } catch (e) {\n      errors.push(\n        `${path.basename(filePath)}:${idx + 1}: bad JSON (${e.message})`,\n      );\n    }\n  });\n  return rows;\n}\n\n// ── Inferences ───────────────────────────────────────────────────────────\n\nfunction validateInferences(rows, errors) {\n  if (rows.length === 0) {\n    errors.push(\"inferences.jsonl is empty\");\n    return;\n  }\n  const idsSeen = new Set();\n  rows.forEach((r, i) => {\n    const tag = `inferences.jsonl:${i + 1}`;\n    for (const k of REQUIRED_INF_FIELDS) {\n      if (!(k in r)) errors.push(`${tag}: missing required field '${k}'`);\n    }\n    const rid = r.id;\n    const eid = r.episode_id;\n    if (typeof rid === \"string\") {\n      if (!UUID_RE.test(rid))\n        errors.push(`${tag}: id is not a valid UUID: '${rid}'`);\n      if (idsSeen.has(rid)) errors.push(`${tag}: duplicate id '${rid}'`);\n      idsSeen.add(rid);\n    }\n    if (typeof eid === \"string\" && !UUID_RE.test(eid)) {\n      errors.push(`${tag}: episode_id is not a valid UUID: '${eid}'`);\n    }\n    if (typeof rid === \"string\" && rid === eid) {\n      errors.push(\n        `${tag}: id and episode_id are identical (must be distinct UUIDs)`,\n      );\n    }\n\n    if (r.variant_name !== \"initial\") {\n      errors.push(\n        `${tag}: variant_name must be 'initial', got ${JSON.stringify(r.variant_name)}`,\n      );\n    }\n\n    const inp = r.input;\n    if (typeof inp !== \"object\" || inp === null || !(\"messages\" in inp)) {\n      errors.push(`${tag}: input must be an object with a 'messages' array`);\n    } else {\n      const msgs = inp.messages;\n      if (!Array.isArray(msgs) || msgs.length === 0) {\n        errors.push(`${tag}: input.messages must be a non-empty array`);\n      }\n    }\n\n    const out = r.output;\n    if (Array.isArray(out)) {\n      // chat / tool function: list of content blocks\n      out.forEach((blk, j) => {\n        if (typeof blk !== \"object\" || blk === null) {\n          errors.push(`${tag}: output[${j}] must be an object`);\n          return;\n        }\n        const t = blk.type;\n        if (!ALLOWED_OUTPUT_TYPES.has(t)) {\n          const allowed = [...ALLOWED_OUTPUT_TYPES].sort();\n          errors.push(\n            `${tag}: output[${j}].type '${t}' not in [${allowed.map((x) => `'${x}'`).join(\", \")}]`,\n          );\n        }\n      });\n    } else if (typeof out === \"object\" && out !== null) {\n      // json function: {raw, parsed}\n      if (!(\"raw\" in out) && !(\"parsed\" in out)) {\n        errors.push(\n          `${tag}: output is an object but has neither 'raw' nor 'parsed' ` +\n            `(json-function output expects both)`,\n        );\n      }\n    } else {\n      errors.push(\n        `${tag}: output must be a list of content blocks (chat/tool) ` +\n          `or an object with 'raw'+'parsed' (json), got ${typeof out}`,\n      );\n    }\n  });\n}\n\n// ── Feedback ─────────────────────────────────────────────────────────────\n\nfunction validateFeedback(rows, errors) {\n  rows.forEach((r, i) => {\n    const tag = `feedback.jsonl:${i + 1}`;\n    for (const k of REQUIRED_FB_FIELDS) {\n      if (!(k in r)) errors.push(`${tag}: missing required field '${k}'`);\n    }\n    const kind = r.kind;\n    if (!ALLOWED_FEEDBACK_KINDS.has(kind)) {\n      const allowed = [...ALLOWED_FEEDBACK_KINDS].sort();\n      errors.push(\n        `${tag}: kind '${kind}' not in [${allowed.map((x) => `'${x}'`).join(\", \")}]`,\n      );\n    }\n    const tid = r.target_id;\n    if (typeof tid === \"string\" && !UUID_RE.test(tid)) {\n      errors.push(`${tag}: target_id is not a valid UUID: '${tid}'`);\n    }\n    const v = r.value;\n    if (\n      kind === \"boolean\" &&\n      !(typeof v === \"boolean\" || typeof v === \"number\")\n    ) {\n      errors.push(\n        `${tag}: boolean feedback value must be bool or 0/1, got ${typeof v}`,\n      );\n    }\n    if (kind === \"float\" && typeof v !== \"number\") {\n      errors.push(\n        `${tag}: float feedback value must be a number, got ${typeof v}`,\n      );\n    }\n  });\n}\n\n// ── Cross-validation (referential integrity, metric resolution) ──────────\n\nfunction validateCross(inferences, feedback, metricDefs, errors, warnings) {\n  const inferenceIds = new Set(\n    inferences.filter((r) => typeof r.id === \"string\").map((r) => r.id),\n  );\n  const episodeIds = new Set(\n    inferences\n      .filter((r) => typeof r.episode_id === \"string\")\n      .map((r) => r.episode_id),\n  );\n\n  const targetsInference = new Map(); // inference_id → Set<metric_name>\n  const targetsEpisode = new Map(); // episode_id   → Set<metric_name>\n\n  feedback.forEach((r, i) => {\n    const tag = `feedback.jsonl:${i + 1}`;\n    const mname = r.metric_name;\n    const tid = r.target_id;\n    const kind = r.kind;\n\n    if (metricDefs && !(mname in metricDefs)) {\n      const defined = Object.keys(metricDefs).sort().join(\", \") || \"(none)\";\n      errors.push(\n        `${tag}: metric_name '${mname}' not defined in tensorzero.toml ` +\n          `(defined metrics: ${defined})`,\n      );\n      return;\n    }\n    const mdef = metricDefs ? metricDefs[mname] : null;\n\n    if (mdef) {\n      if (kind && mdef.type && kind !== mdef.type) {\n        errors.push(\n          `${tag}: kind '${kind}' mismatches metric.type '${mdef.type}' ` +\n            `for metric '${mname}'`,\n        );\n      }\n      const level = mdef.level;\n      if (level === \"inference\") {\n        if (!inferenceIds.has(tid)) {\n          const hint = episodeIds.has(tid)\n            ? \"this might be an episode_id — try matching against inferences.id instead\"\n            : \"the value does not appear as any row's id in inferences.jsonl\";\n          errors.push(\n            `${tag}: target_id '${tid}' does not match any inference id ` +\n              `(metric '${mname}' is inference-level; ${hint})`,\n          );\n        } else {\n          if (!targetsInference.has(tid)) targetsInference.set(tid, new Set());\n          targetsInference.get(tid).add(mname);\n        }\n      } else if (level === \"episode\") {\n        if (!episodeIds.has(tid)) {\n          const hint = inferenceIds.has(tid)\n            ? \"this looks like an inference id — try matching against episode_id instead\"\n            : \"the value does not appear as any row's episode_id in inferences.jsonl\";\n          errors.push(\n            `${tag}: target_id '${tid}' does not match any episode_id ` +\n              `(metric '${mname}' is episode-level; ${hint})`,\n          );\n        } else {\n          if (!targetsEpisode.has(tid)) targetsEpisode.set(tid, new Set());\n          targetsEpisode.get(tid).add(mname);\n        }\n      }\n    } else {\n      // No metric defs → just verify target_id exists somewhere\n      if (!inferenceIds.has(tid) && !episodeIds.has(tid)) {\n        errors.push(\n          `${tag}: target_id '${tid}' does not match any inference id or ` +\n            `episode_id in inferences.jsonl`,\n        );\n      }\n    }\n  });\n\n  if (metricDefs) {\n    const infMetrics = Object.entries(metricDefs)\n      .filter(([, d]) => d.level === \"inference\")\n      .map(([n]) => n);\n    const epMetrics = Object.entries(metricDefs)\n      .filter(([, d]) => d.level === \"episode\")\n      .map(([n]) => n);\n    if (infMetrics.length) {\n      const uncovered = [...inferenceIds].filter(\n        (id) => !targetsInference.has(id),\n      );\n      if (uncovered.length) {\n        warnings.push(\n          `${uncovered.length}/${inferenceIds.size} inferences have no inference-level feedback`,\n        );\n      }\n    }\n    if (epMetrics.length) {\n      const uncovered = [...episodeIds].filter((id) => !targetsEpisode.has(id));\n      if (uncovered.length) {\n        warnings.push(\n          `${uncovered.length}/${episodeIds.size} episodes have no episode-level feedback`,\n        );\n      }\n    }\n  }\n}\n\n// ── Budget ───────────────────────────────────────────────────────────────\n\nfunction validateBudget(inferences, args, errors) {\n  const nEpisodes = new Set(\n    inferences\n      .filter((r) => typeof r.episode_id === \"string\")\n      .map((r) => r.episode_id),\n  ).size;\n  if (args.minEpisodes !== null && nEpisodes < args.minEpisodes) {\n    errors.push(\n      `episode count ${nEpisodes} is below the minimum ${args.minEpisodes}`,\n    );\n  }\n  if (args.maxEpisodes !== null && nEpisodes > args.maxEpisodes) {\n    errors.push(\n      `episode count ${nEpisodes} exceeds the maximum ${args.maxEpisodes}`,\n    );\n  }\n}\n\n// ── Minimal TOML parser for [metrics.*] blocks ───────────────────────────\n\nfunction parseMetricDefs(configPath) {\n  const metrics = {};\n  if (!fs.existsSync(configPath)) return metrics;\n  const blockRe = /^\\s*\\[metrics\\.[\"']?([^\"'\\]]+?)[\"']?\\]\\s*$/;\n  const kvRe = /^\\s*(type|level|optimize)\\s*=\\s*[\"']?([^\"'\\s#]+)/;\n  const sectionRe = /^\\s*\\[.+\\]\\s*$/;\n  let current = null;\n  for (const line of fs.readFileSync(configPath, \"utf8\").split(\"\\n\")) {\n    const m = line.match(blockRe);\n    if (m) {\n      current = m[1];\n      metrics[current] = {};\n      continue;\n    }\n    if (sectionRe.test(line) && !blockRe.test(line)) {\n      current = null;\n      continue;\n    }\n    if (current) {\n      const kv = line.match(kvRe);\n      if (kv) metrics[current][kv[1]] = kv[2];\n    }\n  }\n  return metrics;\n}\n\n// ── Driver ───────────────────────────────────────────────────────────────\n\nfunction main() {\n  const args = parseArgs(process.argv.slice(2));\n  if (\n    !fs.existsSync(args.outputDir) ||\n    !fs.statSync(args.outputDir).isDirectory()\n  ) {\n    console.error(`output_dir does not exist: ${args.outputDir}`);\n    process.exit(1);\n  }\n\n  const errors = [];\n  const warnings = [];\n  const inferences = loadJsonl(\n    path.join(args.outputDir, \"inferences.jsonl\"),\n    errors,\n  );\n  const feedback = loadJsonl(\n    path.join(args.outputDir, \"feedback.jsonl\"),\n    errors,\n  );\n\n  // Extras: ignore subdirectories (orchestrator's _meta/ sits there).\n  const extras = fs.readdirSync(args.outputDir).filter((name) => {\n    if (name === \"inferences.jsonl\" || name === \"feedback.jsonl\") return false;\n    return fs.statSync(path.join(args.outputDir, name)).isFile();\n  });\n  if (extras.length)\n    warnings.push(\n      `unexpected files in ${args.outputDir}: [${extras.join(\", \")}]`,\n    );\n\n  validateInferences(inferences, errors);\n  validateFeedback(feedback, errors);\n\n  const metricDefs = args.config ? parseMetricDefs(args.config) : {};\n  validateCross(\n    inferences,\n    feedback,\n    Object.keys(metricDefs).length ? metricDefs : null,\n    errors,\n    warnings,\n  );\n  validateBudget(inferences, args, errors);\n\n  const nEpisodes = new Set(inferences.map((r) => r.episode_id)).size;\n  const fbByMetric = {};\n  for (const r of feedback)\n    fbByMetric[r.metric_name] = (fbByMetric[r.metric_name] || 0) + 1;\n\n  console.log(\n    `inferences:        ${inferences.length} rows  (${nEpisodes} unique episodes)`,\n  );\n  console.log(`feedback:          ${feedback.length} rows`);\n  if (Object.keys(fbByMetric).length) {\n    console.log(`  per metric:      ${JSON.stringify(fbByMetric)}`);\n  }\n  if (warnings.length) {\n    console.log(\"\\nWARNINGS:\");\n    for (const w of warnings) console.log(`  · ${w}`);\n  }\n  if (errors.length) {\n    console.error(`\\nFAIL — ${errors.length} error(s):`);\n    for (const e of errors) console.error(`  ✗ ${e}`);\n    process.exit(1);\n  }\n  console.log(\"\\nPASS\");\n  process.exit(0);\n}\n\nmain();\n```\n\n### Embedding step\n\nThe MMD² analysis runs offline on the eval host (outside the Claude Code sandbox), where Python is available. Each row is rendered to a single string via , which keeps the model’s actual outputs alongside the inputs and drops per-row bookkeeping (`id`\n\n, `episode_id`\n\n, `created_at`\n\n, `variant_name`\n\n).\nThat string is then passed through a text-embedding model to produce a fixed-dimensional vector.\n\nThe same embedding function is applied to both corpora and the output vectors are L2-normalized so that pairwise squared distances fall in . Truncation cap per input depends on the embedder’s context window (Voyage and ZeroEntropy at 32 k tokens, OpenAI and Gemini at 8 k tokens); the same cap is applied symmetrically to synth and baseline so their inputs see the same content.\n\n### The MMD² estimator\n\n**Maximum Mean Discrepancy** (MMD) is a kernel-based two-sample distance for testing whether two finite samples and come from the same underlying distribution.\n\nFix a positive-definite kernel with associated reproducing-kernel Hilbert space and feature map .\nProvided the kernel is **measurable** and satisfies the moment condition for the distributions and being compared, the **mean embeddings**\n\nare well-defined elements of .\n**Bounded** kernels (such as the Gaussian RBF below, where ) automatically satisfy this condition for every .\nThe population MMD² is then defined as the squared RKHS distance between the two mean embeddings:\n\nGiven finite samples and from and , plug in the empirical mean embeddings and , expand the squared norm, and apply the reproducing-kernel identity to get an estimator written purely in terms of pairwise kernel evaluations:\n\nWith a **characteristic** kernel (such as the Gaussian RBF used below), if and only if .\nThe metric is then a faithful distributional distance on the space of probability measures, not just a moment comparison.\n\nFor my use case (single-sample novelty against a fixed baseline), I treat the synthetic corpus as and the deployment baseline as , and report the resulting per-(env, seed) as the novelty score.\n\n#### Kernel choice\n\nI use the **Gaussian radial basis function (RBF) kernel**:\n\nThe **median heuristic** sets per (env, seed) to the median squared pairwise distance over a random 500-row subsample of the aggregate sample :\n\n#### U-statistic MMD²\n\nFor finite samples and of any sizes and , I use the **unbiased U-statistic** estimator:\n\nBeing unbiased matters in this setup specifically because varies substantially across envs (25–174 synthetic rows depending on env and seed): a biased estimator would introduce a per-env offset that contaminates cross-env comparisons. The unbiased estimator can return slightly negative values when the two distributions are nearly identical, which is sample variance around a true MMD² of zero, not an error.\n\nI report this estimator as the per-(env, seed) novelty score:\n\n### Aggregating across seeds\n\nFor each (env, embedder) there are K MMD² values, one per synthesis seed.\nI report the **median across the K seeds** as the per-env point estimate, with the inter-quartile range as the seed-spread error bar:\n\nThe IQR captures variation across agent runs. Median + IQR is robust to a single anomalous seed (e.g. one Wordle run that happens to land in a less typical region of the distribution).\n\nIn the chart, the median is the X-axis point position and the IQR is the horizontal whisker. The same convention is used for the Y-axis (Δ success rate across eval seeds) so both axes display the same kind of error bar.\n\n## Inside the chart\n\nTwo axes organize the chart. The first is **visibility**: can Claude Code see (or correctly guess) the real input distribution? That is what drift measures, so low drift means high visibility. The second is **tolerance**: even when Claude Code guesses wrong, does the task care? Data helps only when an application scores low on *both*: Claude Code is flying blind *and* the metric punishes it for it.\n\nRead application by application, the seven sort into a few groups. NER is the clean no-effect pole: the optimization model has the corpus memorized and its broader knowledge of NER covers the rest, so visibility is high and data has nothing to add. YC bench is the data-helps pole: the simulator postdates the model’s training cutoff and the harness underspecifies what the agent observes, so visibility is low, and the metric is unforgiving. NDA is the outlier that forces the second axis onto the page: its visibility is just as low as YC bench’s, but the task tolerates the gap, so data barely moves the needle. Two more no-ops fall out for a reason the two axes don’t cover, which I flag below.\n\n| Application | Sees the real distribution? | Does the gap hurt? | Data helps? |\n|---|---|---|---|\n| NER | Yes: corpus memorized, generic task shape | — | No |\n| Customer service | Yes: model already recognizes τ-bench | — | No |\n| Software Eng. | — capability ceiling (no prompt could move it) | — | No |\n| NDA | No: invents the wrong document genre | No: extraction is genre-agnostic | No |\n| Wordle | Partly | Somewhat | A little |\n| Science | No | Yes | Yes |\n| Business mgmt (YC bench) | No | Yes | Yes (largest gap) |\n\nThe first three rows are no-ops because Claude Code is not flying blind, or (for software engineering) because the agent model cannot improve no matter what the prompt says. The bottom three are the cases where data earns its keep. NDA is the row that does the conceptual work, sitting between them: blind, but on a task that does not punish blindness.\n\n### Entity extraction (NER)\n\nWith or without 100 real traces, Claude Code improved the NER agent by around 60 percentage points. The data made no measurable difference, and once I started looking at why, I could see that NER is cooked into Claude Code at two levels: corpus memorization and task-shape knowledge. Either alone would have been enough to make data ablation a no-op.\n\n**Claude Code has the specific corpus memorized.**\nIn the without-data condition, when Claude Code constructed probes to test its prompt edits, I noticed something striking: **24 of 31 probes across seeds were character-for-character copies of CoNLL++ rows** (three examples below).\nClaude Code was pulling these sentences straight from its training distribution and using them to check the prompts it was writing.\n100 real traces have nothing to add.\n\n## Three verbatim probes vs. their CoNLL++ matches\n\nThese are the top-three highest-similarity probes from the without-data run, paired with the baseline row each probe’s nearest-neighbor search points at. Cosine similarity tops out at 0.95 (rather than 1.00) only because the role prefix differs (`[user]`\n\nvs `[user:text]`\n\n); the body text is character-identical.\n\n**Probe (seed 1):**\n\nWest Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and 39 runs in two days to take over at the head of the county championship .\n\n**CoNLL++ match (validation split):** identical.\n\n**Probe (seed 2):**\n\nGermany ‘s representative to the European Union ‘s veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .\n\n**CoNLL++ match (training split):** identical.\n\n**Probe (seed 1):**\n\nThe European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .\n\n**CoNLL++ match (training split):** identical.\n\n**Even without that memorization, Claude Code’s knowledge of what NER data looks like is good enough.**\nWhen I asked Claude Code to synthesize an NER corpus from the application spec alone, it produced **0 verbatim CoNLL++ rows out of 130**.\nYet the synthetic and real CoNLL++ corpora still land in the same embedding-space neighborhood, which is what produces the low novelty score.\nThe synthetic rows are modern, naturally-punctuated prose (“The COP30 climate summit in Belém, Brazil drew delegates from 190 nations.”); CoNLL++ is Reuters-style August-1996 news with Penn-tokenization (“BRUSSELS 1996-08-22 . EU rejects German call to boycott British lamb .”).\nThe two corpora share zero sentences and zero persons; overlap concentrates in geopolitical place names and perennial organizations.\nClaude Code’s knowledge covers the *shape* of NER data, namely entity-rich short news prose in the four CoNLL categories, even when it does not reproduce the specific corpus.\n\n### Business management (YC bench)\n\nYC bench sits at the opposite extreme.\nWithout 100 real traces, the optimized CEO agent averaged **0.6 successful tasks per episode** (a task is a contract the CEO accepted, assigned to an employee, and completed before its deadline); with the traces, it averaged **8**.\nThis operational lift translated to a **20 percentage point** increase in survival rate, the simulation’s primary metric.\nThe data was doing essentially all the work, and when I dug in, none of the things that made NER a no-op were in place: Claude Code did not memorize the benchmark, the knowledge it inherits from the configuration only covers half the application, and the optimization-time instruction barely elicits even that half.\n\n**Claude Code almost certainly did not see this benchmark.**\nYC bench was [published April 6, 2026](https://arxiv.org/abs/2604.02378), three months after Sonnet 4.6’s [training data cutoff of January 2026](https://platform.claude.com/docs/en/about-claude/models/overview).\nBarring some pre-release artifact that found its way into training, the CLI grammar, the structured opener, and the state schema are not in Claude Code’s training corpus.\n\n**The knowledge Claude Code inherits from the configuration covers actions, not observations.**\nWhen I asked Claude Code to synthesize a YC-bench corpus from the application spec alone, it used the correct `yc-bench`\n\nCLI vocabulary on its output side.\nEvery major subcommand (`market browse`\n\n, `task accept`\n\n, `task assign`\n\n, `task dispatch`\n\n, `sim resume`\n\n) appeared within ten percentage points of the real distribution, because the application’s system prompt lists every command and flag verbatim.\nThe user-side observation schema, however, is just `observation: string`\n\n, a pass-through with no structure documented anywhere, so Claude Code had to guess.\nIt guessed a plausible JSON-event format:\n\n```\n{\n  \"event\": \"simulation_started\",\n  \"funds_cents\": ...,\n  \"employee_count\": ...\n}\n```\n\nThe format is internally consistent with the spec’s hints (“All commands return JSON”, “Funds are in cents”) but disjoint from the real Markdown opener (`## Simulation Start — Take Immediate Action`\n\n).\n**0 of 636 synth rows reproduced that canonical header.**\nClaude Code knew what to *do*; it did not know what the environment would show it.\n\n**The optimization-time instruction elicited even less.**\nIn the without-data condition, Claude Code constructed probes to test the prompts it wrote, but its instruction did not ask for full episode simulation.\n**0 of 11 probes across seeds contained any yc-bench-specific token** (see below); the 11 collapsed to 5 generic “Simulation started. You are the CEO” strings.\nEven the action-side knowledge, which is right there in the system template, never surfaced.\nWith access to real traces, Claude Code copied the structured opener nearly verbatim (top NN cosine = 0.95) and wrote a prompt that handled the actual CLI workflow.\nThe data closes a gap the harness underspecifies and the optimization instruction cannot bridge.\n\n## A real opener vs. the entire without-data probe set\n\nThe with-data run copies real baseline rows nearly verbatim (top NN cosine = 0.95). The without-data run fabricates generic CEO roleplay.\n\n**Real opener (also reproduced by the with-data run):**\n\n```\n## Simulation Start — Take Immediate Action\n- current_time: 2025-01-01T00:00:00\n- horizon_end: 2026-01-01T00:00:00\n- funds: $250,000.00\n- monthly_payroll: $22,340.00\n- runway: ~11.2 months\n- employees: 3\n- active_tasks: 0\n- planned_tasks: 0\n\n**Your immediate priority**: generate revenue before payroll drains your runway.\nYou MUST complete these steps now:\n1. `yc-bench market browse --required-prestige-lte 1` — find tasks you can accept\n2. `yc-bench task accept --task-id <UUID>` — accept 2-3 suitable tasks\n3. `yc-bench employee list` — get employee IDs\n4. `yc-bench task assign --task-id <UUID> --employee-id <UUID>` — assign employees\n5. `yc-bench task dispatch --task-id <UUID>` — dispatch tasks\n6. `yc-bench sim resume` — advance simulation\n```\n\n**Synthetic openers (without-data run), all 11 probes collapsing to 5 distinct strings:**\n\nSimulation started. You are the CEO. What is your first action?\n\nSimulation started. Company initialized with $50,000 funds. You have 3 employees.\n\nSimulation started. You are the CEO. Begin by checking company status.\n\nSimulation started. What is your first action?\n\nSimulation started.\n\nNo `yc-bench`\n\nCLI, no structured state fields, no immediate-action list. The one number that does appear (`$50,000`\n\n) is off by 5x from the real `$250,000`\n\ninitial funds.\n\n### Contract extraction (NDA)\n\nNDA caught my eye as a clear outlier on the chart.\nIts novelty score is high, comparable to YC bench, which predicts a large data-ablation gap.\nBut the actual gap was small. On F1, the optimized extraction agent reached **66% with 100 real traces** and **64% with none**, about two percentage points apart. On strict exact-match, the chart’s primary metric, the gap is essentially zero.\nThat broke the trend the other six applications followed and warranted a closer look.\n\n**High novelty: Claude Code invents the wrong document genre.**\nWhen I asked Claude Code to synthesize an NDA corpus from the application spec alone, it produced clean, short, contemporary template-style NDAs (“This Non-Disclosure Agreement is entered into as of March 5, 2024, by and between…”), averaging **443 characters** per document.\nThe real Kleister-NDA corpus is SEC-EDGAR filings, averaging **19,328 characters**, about forty-four times longer, with multi-section legalese (`WHEREAS`\n\n, `IN WITNESS WHEREOF`\n\n, `NOW, THEREFORE`\n\n), full confidentiality clauses, and OCR provenance markers from their `EX-10.x`\n\nexhibit form (`Exhibit`\n\n, `dex##.htm`\n\n, page-number artifacts).\nThose markers appear in 39% to 80% of real rows and in **zero** synth rows.\nThe cause is again a harness underspecification: the system template says only “Given the OCR text of an NDA, extract the following fields”, and the user-side schema does not constrain length, provenance, or structure.\nClaude Code extrapolates from the words “NDA” and “OCR text” and writes a perfectly reasonable contemporary NDA template, which happens not to be what Kleister-NDA contains.\nThe without-data optimization probes shared the same template register: 18 probes across seeds collapsed to 11 unique openers, three of them repetitions of the same “This Non-Disclosure Agreement is entered into as of…” phrase.\n\n**Small gap: the extraction task is genre-agnostic.**\nThe output side stayed faithful in both runs.\n100% of synth outputs parsed, all four fields (`effective_date`\n\n, `jurisdiction`\n\n, `party`\n\n, `term`\n\n) were always populated, and null rates per field landed within fifteen percentage points of the real corpus.\nThe application agent’s prior knowledge of how to read an NDA and pull out four fields generalizes across genres.\nIt handles the contemporary templates Claude Code practiced against and the SEC-EDGAR filings the test set actually contains.\nThe data adds two F1 points and roughly zero exact-match points, not twenty, because the extraction skill is already in the application agent’s knowledge, whichever corpus Claude Code practiced on.\n\nThe novelty score measures a real distributional gap on NDA. For this task, that gap turns out to be orthogonal to the metric.\n\n### The remaining applications\n\nThe remaining four split along the same two axes. Scientific paper reproduction and Wordle both sit in the data-helps quadrant (Claude Code is at least partly flying blind and the metric cares), which is why they land on the positive side of the chart. Science behaves like a milder YC bench: novelty is high, the metric is unforgiving, and the data does real work. Wordle is milder still, worth a few percentage points.\n\nSoftware engineering and customer service are the two no-ops the visibility/tolerance axes do not explain: both lose the data dependence for reasons upstream of prior knowledge.\nSoftware engineering lands near zero because `gpt-5.4-mini`\n\nhits a performance ceiling on terminal-bench that no prompt proposed by Claude Code could move, with or without data: the agent model, not its visibility into the data, is the binding constraint.\nCustomer service (τ-bench retail) lands near zero for an adjacent reason: `gpt-5.4-mini`\n\nhas likely been trained on enough τ-bench traces that it recognizes the task from the user turn alone, so prompt optimization makes no difference either way.\n\n## What I take away\n\nData matters when the agent engineer’s prior knowledge does not. NER works without traces because the corpus is in Claude Code’s training data and the task shape is generic enough that its invented probes still land in the right neighborhood. YC bench falls apart without traces because the simulator postdates the training cutoff and the harness does not tell Claude Code enough to fill the gap. Embedding-space drift between Claude Code’s guesses and the real data tracks that pattern across all seven applications (Spearman ρ = +0.79, exact two-sided *p* = 0.040). But with n = 7 and one deliberate exception, I read it as evidence for the mechanism, not a law.\n\nThat exception is the second half of the lesson. NDA’s drift is high but its data-ablation gap is small: the task only requires reading each document and extracting four fields, which the application agent does on any reasonable NDA whether or not Claude Code practiced on the right genre. Drift tells you whether Claude Code is flying blind, not whether the task punishes it for that. **Visibility** and **tolerance** are two different axes, and only their conjunction means data will help.\n\nOne caveat for anyone hoping to use this as a pre-flight check: the drift estimator needs the real corpus to measure against, so it explains when data helped *after* the fact; it cannot, on its own, tell you whether to collect data before you have any. The mechanism still hands you two questions you can answer with no corpus at all. Is the task newer than your optimizer model’s training cutoff? And does the harness leave the input the agent will see underspecified (a bare `observation: string`\n\n, an OCR blob with no stated genre)? Those two questions flagged YC bench and NDA on their own. When both answers are no, reach for prompt optimization first, and save the trace collection for when they are not.\n\n## References\n\n- @Vtrivedy10.\n[X post](https://x.com/Vtrivedy10/status/2031408954517971368). - Osmani, A. (April 19, 2026).\n[Agent Harness Engineering](https://addyosmani.com/blog/agent-harness-engineering/). - Mehta, V., & Bianconi, G. (March 23, 2026).\n[We’re building an automated AI engineer, and it works](https://www.tensorzero.com/blog/automated-ai-engineer/).*TensorZero blog*. - Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (March 30, 2026).\n[Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052).*arXiv preprint arXiv:2603.28052*. - Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012).\n[A Kernel Two-Sample Test](https://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf).*Journal of Machine Learning Research*, 13, 723–773.\n\n## Citation\n\n```\n@misc{jesson2026whendoesdatahelp,\n  title        = {When does data help automated agent engineering?},\n  author       = {Jesson, Andrew},\n  year         = {2026},\n  month        = may,\n  howpublished = {andrewjesson.com},\n  url          = {https://andrewjesson.com/blog/when-does-data-help-automated-agent-engineering/},\n}\n```\n\n", "url": "https://wpnews.pro/news/when-does-data-help-automated-context-engineering", "canonical_source": "http://www.andrewjesson.com/blog/when-does-data-help-automated-agent-engineering/", "published_at": "2026-06-24 12:30:05+00:00", "updated_at": "2026-06-24 12:40:18.226950+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "machine-learning", "ai-research", "ai-tools"], "entities": ["Claude Code", "Codex", "TensorZero", "Wordle", "YCBench", "TAUBench", "SWE-bench", "Kleister-NDA"], "alternates": {"html": "https://wpnews.pro/news/when-does-data-help-automated-context-engineering", "markdown": "https://wpnews.pro/news/when-does-data-help-automated-context-engineering.md", "text": "https://wpnews.pro/news/when-does-data-help-automated-context-engineering.txt", "jsonld": "https://wpnews.pro/news/when-does-data-help-automated-context-engineering.jsonld"}}