Coding agents perform common engineering practices when asked to improve AI agents. Will they subsume specialized tools for failure-mode analysis, evaluations, and prompt optimization?
Give a coding agent a simulated agent application, a hundred baseline traces, and a metric to optimize, and it will ship an improvement. Both Claude Code and Codex do this. I was interested in seeing what they do while doing it.
I’m checking the current TensorZero config and the baseline traces for yc_bench_tutorial_v0::yc_bench_act so I can identify failure patterns before editing variants.
I prompted Claude Code and Codex to optimize five simulated agent applications, varying only which agent CLI was in the container. I was surprised, though maybe I should not have been, to find that they both used unprompted practices like clustering and summarizing failure patterns. They also ran ad-hoc evaluations to refine and debug their proposed changes to the model or prompt. By performing these common engineering practices, they shipped improvements without calling any specialized tooling for failure mode analysis, evaluations, or prompt optimization. These observations gave me to reconsider the role and shape of such tooling as agent optimization becomes more automated. They are also why I started a project I call harness attribution; this post is its first probe.
Setup #
For each of the following applications, I ran a baseline agent with an initial prompt and model (gpt-5.4-mini
) on up to 100 different tasks. The resulting traces were scored with application-specific feedback.
| Application | Description | Metric |
|---|---|---|
| Software Engineering ( | ||
Long-horizon Linux agent solving coding tasks through execute_command / submit_solution |
reward (verifier score, 0–1) |
|
| Business Management ( | ||
Multi-turn CEO agent driving a business simulation through a single run_command tool |
tasks_succeeded (number of tasks delivered on or before deadline) |
|
| Data Extraction: NER ( | ||
Single-shot: a sentence → four entity lists (person , organization , location , miscellaneous ) |
exact_match on entity sets |
|
| Data Extraction: NDA ( | ||
Single-shot: OCR’d NDA text → effective_date , jurisdiction , party (list), term |
f1 over fields |
|
| Science ( | ||
Long-horizon agent reproducing a published astrophysics paper from a sandboxed dataset and a masked PDF via execute_command / submit_solution |
reward (binary match against paper’s value) |
The optimization task was to propose improvements to the application by modifying the baseline agent prompt and/or choosing a different similar-price-point model.
The optimizer agent (Claude Code on claude-sonnet-4-6
or Codex on gpt-5.4
) was then dropped into a container with access to those traces, feedback, a copy of the baseline agent config, and a markdown skill file describing the task. It analyzed the traces and feedback, wrote one or more new model-prompt variants into the agent config, and exited. Validation of the proposed improvements revealed that both coding agents shipped new variants that matched or beat the baseline on every application: decisively on NER, Business Management, and Software Engineering; within one standard error on NDA and Science.
Held-out test scores by application. Error bars are mean ± SE across 5 seeds for the optimized variants; the baseline was run with a single seed for budget reasons, so its seed variance is unmeasured.
What engineering practices do the agents use? #
Both coding agents use the same skill file. It includes the application name, metric, available models, data layout, some recipes for efficiency, and a four-bullet methodology that says survey → add variants → test → iterate.
The skill #
Placeholders like {config_dir}
, {function_name}
, {baseline_metrics}
, and {model_list}
are substituted per-run by the harness.
You are optimizing a TensorZero function to improve its performance metric.
## Environment
- T0 config files: {config_dir}/ (only these and the baseline data below are relevant — don't explore elsewhere)
- Gateway URL: {gateway_url}
- Pre-dumped baseline data: {baseline_data_dir}/ (read-only; direct DB access is not available)
- Restart after config edits: `curl -sf -X POST http://eval:5111/restart-gateway`
- Isolated container. No Python or `pip`; `node` and `curl` are on `$PATH`; `jq` is not installed. Use `node -e "..."` for JSONL parsing (`readline` + `JSON.parse` + project to stdout) — prefer it over shell pipelines when you need fields per row.
- Don't set `temperature` on any variant (some models reject non-default values). Keep an `initial` variant as a baseline reference.
- Don't run evaluation episodes yourself — the harness does that after you exit.
## Task
- Function: `{function_name}`
- Metric: `{metric_name}`. Check the metric's `optimize` field in `tensorzero.toml` for direction (boolean and float metrics may minimize or maximize).
- Baseline performance: {baseline_metrics}
## Available Models
{model_list}
## Baseline data
- `{baseline_data_dir}/inferences.jsonl` — one row per inference (what the model said per task).
- `{baseline_data_dir}/feedback.jsonl` — one row per metric value.
- `{baseline_data_dir}/initial_config/` — read-only copy of the starting T0 config tree.
Files are often 20+ MB. Don't `cat` them whole. Start by `head -3` on each to learn the row shape (field names and nesting vary by env), then project out the fields you need.
### The projection pattern
`grep` first to narrow, then `node -e` to project:
``` bash
grep $TARGET_ID {baseline_data_dir}/inferences.jsonl \
| node -e "
require('readline').createInterface({input: process.stdin}).on('line', l => {
const r = JSON.parse(l);
console.log(r.id, r.variant_name, JSON.stringify(r.output).slice(0,200));
});"
cat inferences.jsonl | ... loads the whole file; grep-first keeps the pipeline cheap.
Cross-record one-liners
Adapt the failure predicate to your metric — boolean uses "value":0 / "value":1; float values depend on optimize direction.
grep -o '"episode_id":"[^"]*"' {baseline_data_dir}/inferences.jsonl | sort | uniq -c | sort -rn | head
grep $FAIL_ID {baseline_data_dir}/inferences.jsonl | tail -1
grep -o '"metric_name":"[^"]*"' {baseline_data_dir}/feedback.jsonl | sort | uniq -c
grep '"metric_name":"{metric_name}"' {baseline_data_dir}/feedback.jsonl \
| node -e "
require('readline').createInterface({input: process.stdin}).on('line', l => {
const r = JSON.parse(l);
if (r.value === 0 || r.value === false) console.log(r.target_id);
});" > /tmp/failed.txt
head -5 /tmp/failed.txt | while read id; do grep "$id" {baseline_data_dir}/inferences.jsonl | head -1; done
Templates, schemas, and the required content shape
TensorZero has two co-existing config styles. Check which one the function uses in tensorzero.toml:
Legacy (per-role):
[functions."my_fn"]
user_schema = "functions/my_fn/user_schema.json" # and system_schema, assistant_schema
[functions."my_fn".variants.initial]
user_template = "functions/my_fn/initial/user_template.minijinja"
New (named):
[functions."my_fn"]
schemas.user_query.path = "functions/my_fn/user_query_schema.json"
[functions."my_fn".variants.initial]
templates.user_query.path = "functions/my_fn/initial/user_query.minijinja"
Canonical content block for a templated message (both styles):
"content": [{
"type": "template",
"name": "<template_name>",
"arguments": { /* object matching the schema */ }
}]
For legacy, "name" is the role ("user" / "system" / "assistant"). For new, it's the key under schemas. / templates..
For a role with no schema: "content": "Hello" or [{"type":"text","text":"Hello"}].
Methodology #
The core loop is: survey the baseline → add variants → test one → iterate. The decisions worth getting right:
- Metric direction defines "failure." Don't assume
value:0is bad; read the metric'soptimizefield. - Judge manual variant tests by the
curl /inferenceoutput itself — right tool call, right JSON, right content. - Multi-turn agentic envs (customer service, business management, coding) need real conversational state to be representative. Pick a real episode from
inferences.jsonl, copy its first 2–3 messages into your curl body, check how the variant continues. A turn-0 probe alone tells you little. - When done, leave the best config in place with the experimentation section below, and exit.
Routing: Experimentation Config #
After creating new variants, add an experimentation section — otherwise the gateway round-robins and wastes test episodes on bad variants. Keep candidates to your best ~3–4, including initial as a baseline.
[functions."{function_name}".experimentation]
type = "track_and_stop"
metric = "{metric_name}"
candidate_variants = ["initial", "your_new_variant_1", "your_new_variant_2"]
fallback_variants = []
min_samples_per_variant = 5
delta = 0.1
epsilon = 0.0
update_period_s = 5
min_prob = 0.0
max_samples_per_variant = 10000
The skill stays silent on *how* to abstract failure patterns, or how to validate an improvement beyond probing it.
Both agents fill that gap.
Each reads the baseline traces and feedback, abstracts a handful of failure modes from the raw rows, writes two to four prompt variants, runs a few inferences, analyzes the new outputs, and exits.
What they do in those gaps, and what each agent reaches for differently, is below.
### They perform failure mode analysis
Failure mode analysis here is going from a dataset of inferences and feedback to “the model over-extracts `miscellaneous`
because it treats it as a catch-all”.
The skill leaves both prerequisites up to the agent: projecting the failed rows out of JSONL, then abstracting them into a named pattern.
On the projection step, the data is split across two files: `feedback.jsonl`
says which `target_id`
s failed, `inferences.jsonl`
says what the model actually said for each one.
The original skill described the join in prose (*pull failing target_ids, then look up the corresponding inference rows*) but did not say how.
Both agents converged on the same recipe: grep the failing `target_id`
s out of feedback, then grep each one back into inferences and tail to the last row.
I folded that recipe back into the skill, alongside a few related cross-record one-liners (inferences-per-episode, which-metrics-are-present, last-inference-of-a-failing-episode), because re-discovering them cost three to six turns at the start of every session.
With the failed rows projected, both agents can do the abstraction across multiple traces, often including bugs not mentioned in the skill or the function’s documentation. Toggle the optimizer and environment below to land on the moment each agent enumerates the failure modes it just abstracted from the baseline traces. Use the arrow keys to step through the surrounding turns.