{"slug": "the-engineering-practices-claude-code-and-codex-use-to-improve-ai-agents", "title": "The engineering practices Claude Code and Codex use to improve AI agents", "summary": "Claude Code and Codex, two coding agents, autonomously improved AI agent applications by performing engineering practices such as clustering failure patterns and running ad-hoc evaluations, without relying on specialized tooling. In tests across five simulated applications, both agents shipped improvements that matched or exceeded baselines, prompting reconsideration of the role of dedicated failure-mode analysis and prompt optimization tools.", "body_md": "# The engineering practices Claude Code and Codex use to improve AI agents\n\n*Coding agents perform common engineering practices when asked to improve AI agents. Will they subsume specialized tools for failure-mode analysis, evaluations, and prompt optimization?*\n\nGive a coding agent a simulated agent application, a hundred baseline traces, and a metric to optimize, and it will ship an improvement.\nBoth Claude Code and Codex do this.\nI was interested in seeing *what* they do *while* doing it.\n\nI’m checking the current TensorZero config and the baseline traces for `yc_bench_tutorial_v0::yc_bench_act` so I can identify failure patterns before editing variants.\n\nI prompted Claude Code and Codex to optimize five simulated agent applications, varying only which agent CLI was in the container.\nI was surprised, though maybe I should not have been, to find that they both used unprompted practices like clustering and summarizing failure patterns.\nThey also ran ad-hoc evaluations to refine and debug their proposed changes to the model or prompt.\nBy performing these common engineering practices, they shipped improvements without calling any specialized tooling for failure mode analysis, evaluations, or prompt optimization.\nThese observations gave me pause to reconsider the role and shape of such tooling as agent optimization becomes more automated.\nThey are also why I started a project I call **harness attribution**; this post is its first probe.\n\n## Setup\n\nFor each of the following applications, I ran a baseline agent with an initial prompt and model (`gpt-5.4-mini`\n\n) on up to 100 different tasks.\nThe resulting traces were scored with application-specific feedback.\n\n| Application | Description | Metric |\n|---|---|---|\nSoftware Engineering (\n| Long-horizon Linux agent solving coding tasks through `execute_command` / `submit_solution` | `reward` (verifier score, 0–1) |\nBusiness Management (\n| Multi-turn CEO agent driving a business simulation through a single `run_command` tool | `tasks_succeeded` (number of tasks delivered on or before deadline) |\nData Extraction: NER (\n| Single-shot: a sentence → four entity lists (`person` , `organization` , `location` , `miscellaneous` ) | `exact_match` on entity sets |\nData Extraction: NDA (\n| Single-shot: OCR’d NDA text → `effective_date` , `jurisdiction` , `party` (list), `term` | `f1` over fields |\nScience (\n| Long-horizon agent reproducing a published astrophysics paper from a sandboxed dataset and a masked PDF via `execute_command` / `submit_solution` | `reward` (binary match against paper’s value) |\n\nThe optimization task was to propose improvements to the application by modifying the baseline agent prompt and/or choosing a different similar-price-point model.\nThe optimizer agent (Claude Code on `claude-sonnet-4-6`\n\nor Codex on `gpt-5.4`\n\n) was then dropped into a container with access to those traces, feedback, a copy of the baseline agent config, and a markdown skill file describing the task.\nIt analyzed the traces and feedback, wrote one or more new model-prompt variants into the agent config, and exited.\nValidation of the proposed improvements revealed that both coding agents shipped new variants that matched or beat the baseline on every application: decisively on NER, Business Management, and Software Engineering; within one standard error on NDA and Science.\n\n*Held-out test scores by application.*\n*Error bars are mean ± SE across 5 seeds for the optimized variants; the baseline was run with a single seed for budget reasons, so its seed variance is unmeasured.*\n\n## What engineering practices do the agents use?\n\nBoth coding agents use the same skill file.\nIt includes the application name, metric, available models, data layout, some recipes for efficiency, and a four-bullet methodology that says *survey → add variants → test → iterate*.\n\n## The skill\n\nPlaceholders like `{config_dir}`\n\n, `{function_name}`\n\n, `{baseline_metrics}`\n\n, and `{model_list}`\n\nare substituted per-run by the harness.\n\n```\n# TensorZero Function Optimizer\n\nYou are optimizing a TensorZero function to improve its performance metric.\n\n## Environment\n\n- T0 config files: {config_dir}/ (only these and the baseline data below are relevant — don't explore elsewhere)\n- Gateway URL: {gateway_url}\n- Pre-dumped baseline data: {baseline_data_dir}/ (read-only; direct DB access is not available)\n- Restart after config edits: `curl -sf -X POST http://eval:5111/restart-gateway`\n- Isolated container. No Python or `pip`; `node` and `curl` are on `$PATH`; `jq` is not installed. Use `node -e \"...\"` for JSONL parsing (`readline` + `JSON.parse` + project to stdout) — prefer it over shell pipelines when you need fields per row.\n- Don't set `temperature` on any variant (some models reject non-default values). Keep an `initial` variant as a baseline reference.\n- Don't run evaluation episodes yourself — the harness does that after you exit.\n\n## Task\n\n- Function: `{function_name}`\n- Metric: `{metric_name}`. Check the metric's `optimize` field in `tensorzero.toml` for direction (boolean and float metrics may minimize or maximize).\n- Baseline performance: {baseline_metrics}\n\n## Available Models\n\n{model_list}\n\n## Baseline data\n\n- `{baseline_data_dir}/inferences.jsonl` — one row per inference (what the model said per task).\n- `{baseline_data_dir}/feedback.jsonl` — one row per metric value.\n- `{baseline_data_dir}/initial_config/` — read-only copy of the starting T0 config tree.\n\nFiles are often 20+ MB. Don't `cat` them whole. Start by `head -3` on each to learn the row shape (field names and nesting vary by env), then project out the fields you need.\n\n### The projection pattern\n\n`grep` first to narrow, then `node -e` to project:\n\n``` bash\ngrep $TARGET_ID {baseline_data_dir}/inferences.jsonl \\\n  | node -e \"\n      require('readline').createInterface({input: process.stdin}).on('line', l => {\n        const r = JSON.parse(l);\n        console.log(r.id, r.variant_name, JSON.stringify(r.output).slice(0,200));\n      });\"\n```\n\n`cat inferences.jsonl | ...` loads the whole file; `grep`-first keeps the pipeline cheap.\n\n### Cross-record one-liners\n\nAdapt the failure predicate to your metric — boolean uses `\"value\":0` / `\"value\":1`; float values depend on `optimize` direction.\n\n``` bash\n# Inferences per episode\ngrep -o '\"episode_id\":\"[^\"]*\"' {baseline_data_dir}/inferences.jsonl | sort | uniq -c | sort -rn | head\n\n# Last inference of a failing episode\ngrep $FAIL_ID {baseline_data_dir}/inferences.jsonl | tail -1\n\n# Which metrics are present\ngrep -o '\"metric_name\":\"[^\"]*\"' {baseline_data_dir}/feedback.jsonl | sort | uniq -c\n\n# target_ids of failures (boolean example — adapt the predicate for float metrics)\ngrep '\"metric_name\":\"{metric_name}\"' {baseline_data_dir}/feedback.jsonl \\\n  | node -e \"\n      require('readline').createInterface({input: process.stdin}).on('line', l => {\n        const r = JSON.parse(l);\n        if (r.value === 0 || r.value === false) console.log(r.target_id);\n      });\" > /tmp/failed.txt\nhead -5 /tmp/failed.txt | while read id; do grep \"$id\" {baseline_data_dir}/inferences.jsonl | head -1; done\n```\n\n### Templates, schemas, and the required `content` shape\n\nTensorZero has two co-existing config styles. Check which one the function uses in `tensorzero.toml`:\n\n**Legacy** (per-role):\n\n``` toml\n[functions.\"my_fn\"]\nuser_schema = \"functions/my_fn/user_schema.json\"   # and system_schema, assistant_schema\n\n[functions.\"my_fn\".variants.initial]\nuser_template = \"functions/my_fn/initial/user_template.minijinja\"\n```\n\n**New** (named):\n\n``` toml\n[functions.\"my_fn\"]\nschemas.user_query.path = \"functions/my_fn/user_query_schema.json\"\n\n[functions.\"my_fn\".variants.initial]\ntemplates.user_query.path = \"functions/my_fn/initial/user_query.minijinja\"\n```\n\n**Canonical `content` block for a templated message** (both styles):\n\n``` json\n\"content\": [{\n  \"type\": \"template\",\n  \"name\": \"<template_name>\",\n  \"arguments\": { /* object matching the schema */ }\n}]\n```\n\nFor legacy, `\"name\"` is the role (`\"user\"` / `\"system\"` / `\"assistant\"`). For new, it's the key under `schemas.` / `templates.`.\n\nFor a role with no schema: `\"content\": \"Hello\"` or `[{\"type\":\"text\",\"text\":\"Hello\"}]`.\n\n## Methodology\n\nThe core loop is: survey the baseline → add variants → test one → iterate. The decisions worth getting right:\n\n- **Metric direction defines \"failure.\"** Don't assume `value:0` is bad; read the metric's `optimize` field.\n- **Judge manual variant tests by the `curl /inference` output itself** — right tool call, right JSON, right content.\n- **Multi-turn agentic envs** (customer service, business management, coding) need real conversational state to be representative. Pick a real episode from `inferences.jsonl`, copy its first 2–3 messages into your curl body, check how the variant continues. A turn-0 probe alone tells you little.\n- **When done, leave the best config in place** with the experimentation section below, and exit.\n\n## Routing: Experimentation Config\n\nAfter creating new variants, add an experimentation section — otherwise the gateway round-robins and wastes test episodes on bad variants. Keep candidates to your best ~3–4, including `initial` as a baseline.\n\n``` toml\n[functions.\"{function_name}\".experimentation]\ntype = \"track_and_stop\"\nmetric = \"{metric_name}\"\ncandidate_variants = [\"initial\", \"your_new_variant_1\", \"your_new_variant_2\"]\nfallback_variants = []\nmin_samples_per_variant = 5\ndelta = 0.1\nepsilon = 0.0\nupdate_period_s = 5\nmin_prob = 0.0\nmax_samples_per_variant = 10000\nThe skill stays silent on *how* to abstract failure patterns, or how to validate an improvement beyond probing it.\nBoth agents fill that gap.\nEach reads the baseline traces and feedback, abstracts a handful of failure modes from the raw rows, writes two to four prompt variants, runs a few inferences, analyzes the new outputs, and exits.\nWhat they do in those gaps, and what each agent reaches for differently, is below.\n\n### They perform failure mode analysis\n\nFailure mode analysis here is going from a dataset of inferences and feedback to “the model over-extracts `miscellaneous`\n\nbecause it treats it as a catch-all”.\nThe skill leaves both prerequisites up to the agent: projecting the failed rows out of JSONL, then abstracting them into a named pattern.\n\nOn the projection step, the data is split across two files: `feedback.jsonl`\n\nsays which `target_id`\n\ns failed, `inferences.jsonl`\n\nsays what the model actually said for each one.\nThe original skill described the join in prose (*pull failing target_ids, then look up the corresponding inference rows*) but did not say how.\nBoth agents converged on the same recipe: grep the failing `target_id`\n\ns out of feedback, then grep each one back into inferences and tail to the last row.\nI folded that recipe back into the skill, alongside a few related cross-record one-liners (inferences-per-episode, which-metrics-are-present, last-inference-of-a-failing-episode), because re-discovering them cost three to six turns at the start of every session.\n\nWith the failed rows projected, both agents can do the abstraction across multiple traces, often including bugs not mentioned in the skill or the function’s documentation. Toggle the optimizer and environment below to land on the moment each agent enumerates the failure modes it just abstracted from the baseline traces. Use the arrow keys to step through the surrounding turns.", "url": "https://wpnews.pro/news/the-engineering-practices-claude-code-and-codex-use-to-improve-ai-agents", "canonical_source": "https://www.andrewjesson.com/blog/the-engineering-practices-claude-code-and-codex-use-to-improve-ai-agents/", "published_at": "2026-06-17 12:28:55+00:00", "updated_at": "2026-06-17 12:53:16.262520+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "generative-ai", "ai-tools", "mlops"], "entities": ["Claude Code", "Codex", "TensorZero", "OpenAI", "Anthropic", "GPT-5.4-mini", "Claude Sonnet 4-6"], "alternates": {"html": "https://wpnews.pro/news/the-engineering-practices-claude-code-and-codex-use-to-improve-ai-agents", "markdown": "https://wpnews.pro/news/the-engineering-practices-claude-code-and-codex-use-to-improve-ai-agents.md", "text": "https://wpnews.pro/news/the-engineering-practices-claude-code-and-codex-use-to-improve-ai-agents.txt", "jsonld": "https://wpnews.pro/news/the-engineering-practices-claude-code-and-codex-use-to-improve-ai-agents.jsonld"}}