The engineering practices Claude Code and Codex use to improve AI agents Claude Code and Codex, two coding agents, autonomously improved AI agent applications by performing engineering practices such as clustering failure patterns and running ad-hoc evaluations, without relying on specialized tooling. In tests across five simulated applications, both agents shipped improvements that matched or exceeded baselines, prompting reconsideration of the role of dedicated failure-mode analysis and prompt optimization tools. The engineering practices Claude Code and Codex use to improve AI agents Coding agents perform common engineering practices when asked to improve AI agents. Will they subsume specialized tools for failure-mode analysis, evaluations, and prompt optimization? Give a coding agent a simulated agent application, a hundred baseline traces, and a metric to optimize, and it will ship an improvement. Both Claude Code and Codex do this. I was interested in seeing what they do while doing it. I’m checking the current TensorZero config and the baseline traces for yc bench tutorial v0::yc bench act so I can identify failure patterns before editing variants. I prompted Claude Code and Codex to optimize five simulated agent applications, varying only which agent CLI was in the container. I was surprised, though maybe I should not have been, to find that they both used unprompted practices like clustering and summarizing failure patterns. They also ran ad-hoc evaluations to refine and debug their proposed changes to the model or prompt. By performing these common engineering practices, they shipped improvements without calling any specialized tooling for failure mode analysis, evaluations, or prompt optimization. These observations gave me pause to reconsider the role and shape of such tooling as agent optimization becomes more automated. They are also why I started a project I call harness attribution ; this post is its first probe. Setup For each of the following applications, I ran a baseline agent with an initial prompt and model gpt-5.4-mini on up to 100 different tasks. The resulting traces were scored with application-specific feedback. | Application | Description | Metric | |---|---|---| Software Engineering | Long-horizon Linux agent solving coding tasks through execute command / submit solution | reward verifier score, 0–1 | Business Management | Multi-turn CEO agent driving a business simulation through a single run command tool | tasks succeeded number of tasks delivered on or before deadline | Data Extraction: NER | Single-shot: a sentence → four entity lists person , organization , location , miscellaneous | exact match on entity sets | Data Extraction: NDA | Single-shot: OCR’d NDA text → effective date , jurisdiction , party list , term | f1 over fields | Science | Long-horizon agent reproducing a published astrophysics paper from a sandboxed dataset and a masked PDF via execute command / submit solution | reward binary match against paper’s value | The optimization task was to propose improvements to the application by modifying the baseline agent prompt and/or choosing a different similar-price-point model. The optimizer agent Claude Code on claude-sonnet-4-6 or Codex on gpt-5.4 was then dropped into a container with access to those traces, feedback, a copy of the baseline agent config, and a markdown skill file describing the task. It analyzed the traces and feedback, wrote one or more new model-prompt variants into the agent config, and exited. Validation of the proposed improvements revealed that both coding agents shipped new variants that matched or beat the baseline on every application: decisively on NER, Business Management, and Software Engineering; within one standard error on NDA and Science. Held-out test scores by application. Error bars are mean ± SE across 5 seeds for the optimized variants; the baseline was run with a single seed for budget reasons, so its seed variance is unmeasured. What engineering practices do the agents use? Both coding agents use the same skill file. It includes the application name, metric, available models, data layout, some recipes for efficiency, and a four-bullet methodology that says survey → add variants → test → iterate . The skill Placeholders like {config dir} , {function name} , {baseline metrics} , and {model list} are substituted per-run by the harness. TensorZero Function Optimizer You are optimizing a TensorZero function to improve its performance metric. Environment - T0 config files: {config dir}/ only these and the baseline data below are relevant — don't explore elsewhere - Gateway URL: {gateway url} - Pre-dumped baseline data: {baseline data dir}/ read-only; direct DB access is not available - Restart after config edits: curl -sf -X POST http://eval:5111/restart-gateway - Isolated container. No Python or pip ; node and curl are on $PATH ; jq is not installed. Use node -e "..." for JSONL parsing readline + JSON.parse + project to stdout — prefer it over shell pipelines when you need fields per row. - Don't set temperature on any variant some models reject non-default values . Keep an initial variant as a baseline reference. - Don't run evaluation episodes yourself — the harness does that after you exit. Task - Function: {function name} - Metric: {metric name} . Check the metric's optimize field in tensorzero.toml for direction boolean and float metrics may minimize or maximize . - Baseline performance: {baseline metrics} Available Models {model list} Baseline data - {baseline data dir}/inferences.jsonl — one row per inference what the model said per task . - {baseline data dir}/feedback.jsonl — one row per metric value. - {baseline data dir}/initial config/ — read-only copy of the starting T0 config tree. Files are often 20+ MB. Don't cat them whole. Start by head -3 on each to learn the row shape field names and nesting vary by env , then project out the fields you need. The projection pattern grep first to narrow, then node -e to project: bash grep $TARGET ID {baseline data dir}/inferences.jsonl \ | node -e " require 'readline' .createInterface {input: process.stdin} .on 'line', l = { const r = JSON.parse l ; console.log r.id, r.variant name, JSON.stringify r.output .slice 0,200 ; } ;" cat inferences.jsonl | ... loads the whole file; grep -first keeps the pipeline cheap. Cross-record one-liners Adapt the failure predicate to your metric — boolean uses "value":0 / "value":1 ; float values depend on optimize direction. bash Inferences per episode grep -o '"episode id":" ^" "' {baseline data dir}/inferences.jsonl | sort | uniq -c | sort -rn | head Last inference of a failing episode grep $FAIL ID {baseline data dir}/inferences.jsonl | tail -1 Which metrics are present grep -o '"metric name":" ^" "' {baseline data dir}/feedback.jsonl | sort | uniq -c target ids of failures boolean example — adapt the predicate for float metrics grep '"metric name":"{metric name}"' {baseline data dir}/feedback.jsonl \ | node -e " require 'readline' .createInterface {input: process.stdin} .on 'line', l = { const r = JSON.parse l ; if r.value === 0 || r.value === false console.log r.target id ; } ;" /tmp/failed.txt head -5 /tmp/failed.txt | while read id; do grep "$id" {baseline data dir}/inferences.jsonl | head -1; done Templates, schemas, and the required content shape TensorZero has two co-existing config styles. Check which one the function uses in tensorzero.toml : Legacy per-role : toml functions."my fn" user schema = "functions/my fn/user schema.json" and system schema, assistant schema functions."my fn".variants.initial user template = "functions/my fn/initial/user template.minijinja" New named : toml functions."my fn" schemas.user query.path = "functions/my fn/user query schema.json" functions."my fn".variants.initial templates.user query.path = "functions/my fn/initial/user query.minijinja" Canonical content block for a templated message both styles : json "content": { "type": "template", "name": "