The engineering practices Claude Code and Codex use to improve AI agents

Claude Code and Codex, two coding agents, autonomously improved AI agent applications by performing engineering practices such as clustering failure patterns and running ad-hoc evaluations, without relying on specialized tooling. In tests across five simulated applications, both agents shipped improvements that matched or exceeded baselines, prompting reconsideration of the role of dedicated failure-mode analysis and prompt optimization tools.

The engineering practices Claude Code and Codex use to improve AI agents Coding agents perform common engineering practices when asked to improve AI agents. Will they subsume specialized tools for failure-mode analysis, evaluations, and prompt optimization? Give a coding agent a simulated agent application, a hundred baseline traces, and a metric to optimize, and it will ship an improvement. Both Claude Code and Codex do this. I was interested in seeing what they do while doing it. I’m checking the current TensorZero config and the baseline traces for yc bench tutorial v0::yc bench act so I can identify failure patterns before editing variants. I prompted Claude Code and Codex to optimize five simulated agent applications, varying only which agent CLI was in the container. I was surprised, though maybe I should not have been, to find that they both used unprompted practices like clustering and summarizing failure patterns. They also ran ad-hoc evaluations to refine and debug their proposed changes to the model or prompt. By performing these common engineering practices, they shipped improvements without calling any specialized tooling for failure mode analysis, evaluations, or prompt optimization. These observations gave me pause to reconsider the role and shape of such tooling as agent optimization becomes more automated. They are also why I started a project I call harness attribution ; this post is its first probe. Setup For each of the following applications, I ran a baseline agent with an initial prompt and model gpt-5.4-mini on up to 100 different tasks. The resulting traces were scored with application-specific feedback. | Application | Description | Metric | |---|---|---| Software Engineering | Long-horizon Linux agent solving coding tasks through execute command / submit solution | reward verifier score, 0–1 | Business Management | Multi-turn CEO agent driving a business simulation through a single run command tool | tasks succeeded number of tasks delivered on or before deadline | Data Extraction: NER | Single-shot: a sentence → four entity lists person , organization , location , miscellaneous | exact match on entity sets | Data Extraction: NDA | Single-shot: OCR’d NDA text → effective date , jurisdiction , party list , term | f1 over fields | Science | Long-horizon agent reproducing a published astrophysics paper from a sandboxed dataset and a masked PDF via execute command / submit solution | reward binary match against paper’s value | The optimization task was to propose improvements to the application by modifying the baseline agent prompt and/or choosing a different similar-price-point model. The optimizer agent Claude Code on claude-sonnet-4-6 or Codex on gpt-5.4 was then dropped into a container with access to those traces, feedback, a copy of the baseline agent config, and a markdown skill file describing the task. It analyzed the traces and feedback, wrote one or more new model-prompt variants into the agent config, and exited. Validation of the proposed improvements revealed that both coding agents shipped new variants that matched or beat the baseline on every application: decisively on NER, Business Management, and Software Engineering; within one standard error on NDA and Science. Held-out test scores by application. Error bars are mean ± SE across 5 seeds for the optimized variants; the baseline was run with a single seed for budget reasons, so its seed variance is unmeasured. What engineering practices do the agents use? Both coding agents use the same skill file. It includes the application name, metric, available models, data layout, some recipes for efficiency, and a four-bullet methodology that says survey → add variants → test → iterate . The skill Placeholders like {config dir} , {function name} , {baseline metrics} , and {model list} are substituted per-run by the harness. TensorZero Function Optimizer You are optimizing a TensorZero function to improve its performance metric. Environment - T0 config files: {config dir}/ only these and the baseline data below are relevant — don't explore elsewhere - Gateway URL: {gateway url} - Pre-dumped baseline data: {baseline data dir}/ read-only; direct DB access is not available - Restart after config edits: curl -sf -X POST http://eval:5111/restart-gateway - Isolated container. No Python or pip ; node and curl are on $PATH ; jq is not installed. Use node -e "..." for JSONL parsing readline + JSON.parse + project to stdout — prefer it over shell pipelines when you need fields per row. - Don't set temperature on any variant some models reject non-default values . Keep an initial variant as a baseline reference. - Don't run evaluation episodes yourself — the harness does that after you exit. Task - Function: {function name} - Metric: {metric name} . Check the metric's optimize field in tensorzero.toml for direction boolean and float metrics may minimize or maximize . - Baseline performance: {baseline metrics} Available Models {model list} Baseline data - {baseline data dir}/inferences.jsonl — one row per inference what the model said per task . - {baseline data dir}/feedback.jsonl — one row per metric value. - {baseline data dir}/initial config/ — read-only copy of the starting T0 config tree. Files are often 20+ MB. Don't cat them whole. Start by head -3 on each to learn the row shape field names and nesting vary by env , then project out the fields you need. The projection pattern grep first to narrow, then node -e to project: bash grep $TARGET ID {baseline data dir}/inferences.jsonl \ | node -e " require 'readline' .createInterface {input: process.stdin} .on 'line', l = { const r = JSON.parse l ; console.log r.id, r.variant name, JSON.stringify r.output .slice 0,200 ; } ;" cat inferences.jsonl | ... loads the whole file; grep -first keeps the pipeline cheap. Cross-record one-liners Adapt the failure predicate to your metric — boolean uses "value":0 / "value":1 ; float values depend on optimize direction. bash Inferences per episode grep -o '"episode id":" ^" "' {baseline data dir}/inferences.jsonl | sort | uniq -c | sort -rn | head Last inference of a failing episode grep $FAIL ID {baseline data dir}/inferences.jsonl | tail -1 Which metrics are present grep -o '"metric name":" ^" "' {baseline data dir}/feedback.jsonl | sort | uniq -c target ids of failures boolean example — adapt the predicate for float metrics grep '"metric name":"{metric name}"' {baseline data dir}/feedback.jsonl \ | node -e " require 'readline' .createInterface {input: process.stdin} .on 'line', l = { const r = JSON.parse l ; if r.value === 0 || r.value === false console.log r.target id ; } ;" /tmp/failed.txt head -5 /tmp/failed.txt | while read id; do grep "$id" {baseline data dir}/inferences.jsonl | head -1; done Templates, schemas, and the required content shape TensorZero has two co-existing config styles. Check which one the function uses in tensorzero.toml : Legacy per-role : toml functions."my fn" user schema = "functions/my fn/user schema.json" and system schema, assistant schema functions."my fn".variants.initial user template = "functions/my fn/initial/user template.minijinja" New named : toml functions."my fn" schemas.user query.path = "functions/my fn/user query schema.json" functions."my fn".variants.initial templates.user query.path = "functions/my fn/initial/user query.minijinja" Canonical content block for a templated message both styles : json "content": { "type": "template", "name": "<template name ", "arguments": { / object matching the schema / } } For legacy, "name" is the role "user" / "system" / "assistant" . For new, it's the key under schemas. / templates. . For a role with no schema: "content": "Hello" or {"type":"text","text":"Hello"} . Methodology The core loop is: survey the baseline → add variants → test one → iterate. The decisions worth getting right: - Metric direction defines "failure." Don't assume value:0 is bad; read the metric's optimize field. - Judge manual variant tests by the curl /inference output itself — right tool call, right JSON, right content. - Multi-turn agentic envs customer service, business management, coding need real conversational state to be representative. Pick a real episode from inferences.jsonl , copy its first 2–3 messages into your curl body, check how the variant continues. A turn-0 probe alone tells you little. - When done, leave the best config in place with the experimentation section below, and exit. Routing: Experimentation Config After creating new variants, add an experimentation section — otherwise the gateway round-robins and wastes test episodes on bad variants. Keep candidates to your best ~3–4, including initial as a baseline. toml functions."{function name}".experimentation type = "track and stop" metric = "{metric name}" candidate variants = "initial", "your new variant 1", "your new variant 2" fallback variants = min samples per variant = 5 delta = 0.1 epsilon = 0.0 update period s = 5 min prob = 0.0 max samples per variant = 10000 The skill stays silent on how to abstract failure patterns, or how to validate an improvement beyond probing it. Both agents fill that gap. Each reads the baseline traces and feedback, abstracts a handful of failure modes from the raw rows, writes two to four prompt variants, runs a few inferences, analyzes the new outputs, and exits. What they do in those gaps, and what each agent reaches for differently, is below. They perform failure mode analysis Failure mode analysis here is going from a dataset of inferences and feedback to “the model over-extracts miscellaneous because it treats it as a catch-all”. The skill leaves both prerequisites up to the agent: projecting the failed rows out of JSONL, then abstracting them into a named pattern. On the projection step, the data is split across two files: feedback.jsonl says which target id s failed, inferences.jsonl says what the model actually said for each one. The original skill described the join in prose pull failing target ids, then look up the corresponding inference rows but did not say how. Both agents converged on the same recipe: grep the failing target id s out of feedback, then grep each one back into inferences and tail to the last row. I folded that recipe back into the skill, alongside a few related cross-record one-liners inferences-per-episode, which-metrics-are-present, last-inference-of-a-failing-episode , because re-discovering them cost three to six turns at the start of every session. With the failed rows projected, both agents can do the abstraction across multiple traces, often including bugs not mentioned in the skill or the function’s documentation. Toggle the optimizer and environment below to land on the moment each agent enumerates the failure modes it just abstracted from the baseline traces. Use the arrow keys to step through the surrounding turns.