When Does Data Help Automated Context Engineering? Claude Code can improve other AI agents without training data in four of seven tested applications, performing as well as with data. Data helps only where Claude Code's prior knowledge of the task runs out, and the drift between its self-generated inputs and real data predicts this gap with a Spearman correlation of +0.79. The finding suggests that automated agent engineering can often skip data collection when the model already knows the task domain. When does data help automated agent engineering? Claude Code can often improve another agent with no training data at all. Across seven applications, data helps only where Claude Code’s own prior knowledge of the task runs out, and how far its guesses drift from the real data mostly tells you which cases those are. The catch is that drift reveals when Claude Code is flying blind, not whether flying blind actually costs anything. An agent is more than a model. It is also the prompts, tools, context management, guardrails, orchestration, and compute infrastructure designed around the model. 1, 2 When the agent does something wrong, an agent engineer tunes one or more of these knobs so that the error never happens again. Automated agent engineering goes meta by putting an AI agent in the role of agent engineer. 3 https://www.tensorzero.com/blog/automated-ai-engineer/ , 4 https://arxiv.org/abs/2603.28052 Claude Code and Codex can improve agent prompts /blog/the-engineering-practices-claude-code-and-codex-use-to-improve-ai-agents/ and perform engineering practices like iteratively evaluating their changes. But what happens without training data? Surprisingly, Claude Code’s improvements performed as well without training data as with it on several of the applications I tested. On three Wordle, scientific paper reproduction, business simulation data did improve task success rate by between five and twenty percentage points, but on the other four entity extraction, contract extraction, customer service, software engineering the without-data version performed roughly the same. Looking into the conversation histories explained why. Even without training data, Claude Code still ran ad hoc evaluations: it made inferences on inputs it generated itself and judged whether the output matched its prompt edit. So the operative question is not whether Claude Code has data; it is whether Claude Code already knows what the data looks like. Data helps exactly when Claude Code’s prior knowledge of the application runs out, and that prior runs out in specific, identifiable ways: the corpus is not memorized, the task shape is unfamiliar, or the harness never says what the agent will actually be shown. That suggests a way to measure the missing prior from the outside: how far Claude Code’s self-generated inputs drift from the real data. Across the seven applications, this drift tracks the data-ablation gap the test-set success rate with data minus the rate without it at a Spearman rank correlation of +0.79 exact two-sided p = 0.040; Pearson r = +0.96 . The lone clean exception is the instructive part: drift measures whether Claude Code is flying blind, not whether flying blind hurts the metric. Those are two different axes, and one task scores high on the first while shrugging off the second. The experiment The seven applications I explored are: named entity extraction https://arxiv.org/abs/1909.01441v1 NER , NDA clause extraction https://github.com/applicaai/kleister-nda , Wordle, customer service https://taubench.com/ home , software engineering https://www.tbench.ai/ , business management https://www.ycbench.com/ , and scientific paper reproduction https://arxiv.org/abs/2510.24591 Science . For each application agent I ran the same experiment under two conditions, with data and without, using five independent seeds per condition. Claude Code was given configuration files for the agent prompts, models, tool lists, … . It was instructed to improve the initial prompts. The new prompts were scored on a held-out set of test tasks. The only difference was whether Claude Code was given 100 real traces as training data. In the without-data condition, the same baseline-data paths existed but the trace files were empty. Claude Code as Agent Harness Engineer - The optimizer is Claude Code running against an internal harness that lets it edit the application agent configuration file. Tool access is Read , Bash , Edit , Write . No external MCP servers. - The optimizer runs inside an isolated Docker container base image node:24-slim containing only the Claude Code CLI, curl , and git . There is no Python and no eval source code on the filesystem. The container shares a Docker network with the gateway, so Bash can curl http://gateway:3000/inference ... to test prompts but has no other route to the application code. - Claude Code is running claude-sonnet-4-6 . The application agent model is gpt-5.4-mini across all seven applications. Claude Code is given the following instruction at the start of every run, with placeholders like {config dir} and {function name} resolved per application before the run begins. The contents are held constant across all conditions. TensorZero Function Optimizer You are optimizing a TensorZero function to improve its performance metric. Environment - T0 config files: {config dir}/ only these and the baseline data below are relevant — don't explore elsewhere - Gateway URL: {gateway url} - Pre-dumped baseline data: {baseline data dir}/ read-only; direct DB access is not available - Restart after config edits: curl -sf -X POST http://eval:5111/restart-gateway - Isolated container. No Python or pip ; node and curl are on $PATH ; jq is not installed. Use node -e "..." for JSONL parsing readline + JSON.parse + project to stdout — prefer it over shell pipelines when you need fields per row. - Don't set temperature on any variant some models reject non-default values . Keep an initial variant as a baseline reference. - Don't run evaluation episodes yourself — the harness does that after you exit. Task - Function: {function name} - Metric: {metric name} . Check the metric's optimize field in tensorzero.toml for direction boolean and float metrics may minimize or maximize . - Baseline performance: {baseline metrics} Available Models {model list} Baseline data - {baseline data dir}/inferences.jsonl — one row per inference what the model said per task . - {baseline data dir}/feedback.jsonl — one row per metric value. - {baseline data dir}/initial config/ — read-only copy of the starting T0 config tree. Files are often 20+ MB. Don't cat them whole. Start by head -3 on each to learn the row shape field names and nesting vary by env , then project out the fields you need. The projection pattern grep first to narrow, then node -e to project: bash grep $TARGET ID {baseline data dir}/inferences.jsonl \ | node -e " require 'readline' .createInterface {input: process.stdin} .on 'line', l = { const r = JSON.parse l ; console.log r.id, r.variant name, JSON.stringify r.output .slice 0,200 ; } ;" cat inferences.jsonl | ... loads the whole file; grep -first keeps the pipeline cheap. Cross-record one-liners Adapt the failure predicate to your metric — boolean uses "value":0 / "value":1 ; float values depend on optimize direction. bash Inferences per episode grep -o '"episode id":" ^" "' {baseline data dir}/inferences.jsonl | sort | uniq -c | sort -rn | head Last inference of a failing episode grep $FAIL ID {baseline data dir}/inferences.jsonl | tail -1 Which metrics are present grep -o '"metric name":" ^" "' {baseline data dir}/feedback.jsonl | sort | uniq -c target ids of failures boolean example — adapt the predicate for float metrics grep '"metric name":"{metric name}"' {baseline data dir}/feedback.jsonl \ | node -e " require 'readline' .createInterface {input: process.stdin} .on 'line', l = { const r = JSON.parse l ; if r.value === 0 || r.value === false console.log r.target id ; } ;" /tmp/failed.txt head -5 /tmp/failed.txt | while read id; do grep "$id" {baseline data dir}/inferences.jsonl | head -1; done Templates, schemas, and the required content shape TensorZero has two co-existing config styles. Check which one the function uses in tensorzero.toml : Legacy per-role : toml functions."my fn" user schema = "functions/my fn/user schema.json" and system schema, assistant schema functions."my fn".variants.initial user template = "functions/my fn/initial/user template.minijinja" New named : toml functions."my fn" schemas.user query.path = "functions/my fn/user query schema.json" functions."my fn".variants.initial templates.user query.path = "functions/my fn/initial/user query.minijinja" Canonical content block for a templated message both styles : json "content": { "type": "template", "name": "