{"slug": "auto-itera-autonomous-experimentation-engine-for-ai-engineering-decisions", "title": "Auto-itera – autonomous experimentation engine for AI engineering decisions", "summary": "Auto-itera, an autonomous experimentation engine, now automates AI engineering decisions by running controlled experiments on real production data. Users define a goal, provide candidate arms, and set pre-registered thresholds, and the system returns a defensible ship-or-kill verdict with per-slice scores and publication-quality charts within hours. The tool aims to replace subjective \"vibes\" and notebook-based evaluations with rigorous, reproducible testing that prevents overfitting and maintains trustworthiness through pre-registered metrics and a sealed test set.", "body_md": "**Autonomous experimentation engine for AI engineering decisions.**\n\n*Define a goal. Give it the candidates. Get back a defensible ship-or-kill verdict in hours — sourced from real production data, scored across arms in parallel, sprint-iterated with discipline, and signed off on a sealed test set.*\n\nEvery team shipping an LLM product has decisions like these on the table:\n\n**Prompt optimization**— does the new system prompt actually beat the current one?** Model selection**— Sonnet, Haiku, or Opus for this hop?** Retrieval strategies**— BM25, dense, or hybrid on real customer queries?** Workflow tuning**— single-call vs two-call orchestration; sync vs queued?** Architecture experiments**— does adding a router LLM help or just add latency?\n\nToday most teams answer these with vibes, eyeballed diffs, or notebooks they've quietly tuned the system against. `auto-itera`\n\nautomates the rigorous version of this work: you state the goal, hand over the candidates, and the loop runs to a verdict you can defend in a code review.\n\n**You provide (3 inputs):**\n\n**Goal**— the question you want answered (\"does prompt-v2 beat v1 on classification?\")** Candidates**— the concrete arms to compare (a baseline + 1–3 alternatives)** Threshold**— pre-registered effect size + per-slice loss floor (\"ship if ≥5pp aggregate AND no slice regresses >2pp\")\n\n`auto-itera`\n\nruns (5 autonomous stages):\n\n**Source**— sample real production data, stratify by tenant/class, split into train / dev / sealed test** Score**— run baseline + every arm in parallel on dev, with variance baseline (≥3 trials) + cross-judge sanity check** Diagnose**— per-row diffs against baseline, identify wins/losses by cluster, write a hypothesis for the next change** Iterate**— sprint of up to 3 hypothesis-driven iterations, then a** generalization gate**that strips out dev-set memorization. Continue or lock.** Verdict**— ONE pass on the sealed test set. Per-slice scores. Ship / scope narrowly / kill — with a conclusion doc and three publication-quality charts.\n\nThe output: a one-page conclusion doc embedding `arm-bar`\n\n, `forest-plot`\n\n, and `cost-vs-accuracy`\n\nfigures, plus a discipline self-audit checklist. Code is throw-away; the conclusion is what compounds.\n\n`auto-itera`\n\nautomates the experimentalexecution + iteration. The evaluationcriteriastay with you, by design.\n\n**You provide (3 pre-registered inputs):**\n\n- Candidate arms — the concrete prompts / models / strategies to compare\n- Metric + judge — what counts as \"better\" and who scores it\n- Threshold + per-slice loss floor — what counts as \"ship-worthy\"\n\n`auto-itera`\n\nauto-designs and runs (everything else):\n\n- Sampling strategy + train/dev/test splits sized to your effect threshold\n- Parallel scoring with variance baseline + cross-judge sanity check\n- Per-row diagnosis, hypothesis-driven sprints, generalization gate\n- Held-out test pass + per-slice verdict + conclusion doc\n\nThe split is deliberate, not a capability gap. **A metric the system picks for itself is a metric the system can drift toward** — letting the evaluator design its own grading rubric is how teams accidentally ship +12% benchmark wins that regress 8% in production. Pre-registration is the discipline that keeps the verdict trustworthy.\n\nThink of it as **an autonomous experiment runner, not an autonomous AI scientist that invents the hypothesis AND grades it**.\n\nThe most interesting design choice is in stage 7. Naive \"iterate until it looks better\" optimizes the dev set — every refinement that doesn't survive the held-out test is overfitting in a lab coat. A flat \"stop after 3 iterations\" rule prevents that, but it also blocks legitimate deeper exploration.\n\n`auto-itera`\n\nsplits the difference:\n\n```\niterate × up to 3\n   ↓\ngeneralization gate\n   ↓\n   ├──  every change is a universal mechanism (or got promoted to one) → start next sprint\n   ├──  dev signal saturated → lock and run the test pass\n   └──  changes were mostly \"if input X return Y\" hardcodes → kill this arm\n```\n\nThe 3 inside a sprint is a working-memory cap (humans can't reliably attribute outcomes across more than ~3 simultaneous hypothesis edits). The gate between sprints separates **principled iteration** (finding deeper mechanisms) from **dev-set memorization** (adding rules that win specific rows but won't survive the test).\n\nMost decisions converge in 1–2 sprints. Past 3 sprints, the prior shifts toward \"the gate is failing to catch dev-memorization\" — the right move is to audit the gate, not to keep iterating.\n\nAutonomy without honesty is a worse outcome than vibes-based evals — at least vibes don't pretend to be science. `auto-itera`\n\nships with 22 explicit safeguards that block the specific moves that look reasonable but contaminate the verdict.\n\nFive real failure modes, drawn from actual shipped-and-regressed AI products:\n\n| What the team would have shipped | What `auto-itera` caught |\n|---|---|\n\"Prompt v3 is +12% on eval. Ship it.\" |\nThe \"improvement\" came from rows the engineer read during debugging. Test set was contaminated. Production regressed 8%. |\n\"GPT-4o beat Claude on our 50-row eval.\" |\nEval too small. Gap was 5pp, within-arm noise was 4pp. Not a real signal. |\n\"Aggregate accuracy up 6pp across all customers.\" |\nOne major tenant slice regressed -8pp. Aggregate winners are not winners when a major slice loses. |\n\"Best-of-5 trial: 91% accuracy.\" |\nMean was 84% ± 4pp. Best-of-N is biased high by ~√log N. |\n\"New rubric finally captures what matters.\" |\nRubric was rewritten after seeing scores. It happened to favor the arm they wanted to win. |\n\nEach one is a specific anti-pattern with a specific safeguard. Other guards include held-out test sealing, pre-registered metrics, variance-floor noise checks, the generalization gate, cross-judge sanity checks, and a per-slice loss floor. The full list — plus a \"Common Rationalizations\" table cataloging the excuses engineers reach for in the moment — lives in [SKILL.md](/clfhaha1234/auto-itera/blob/main/SKILL.md).\n\nSeveral solid tools exist for *running* evals — [Promptfoo](https://promptfoo.dev), [Inspect](https://inspect.aisi.org.uk), [LangSmith Evals](https://docs.smith.langchain.com), [OpenAI Evals](https://github.com/openai/evals). They give you the numbers. `auto-itera`\n\nis built for what comes *after* the numbers — the discipline that prevents the numbers from lying to you.\n\nWhat's unique to `auto-itera` |\nWhy it matters |\n|---|---|\nHeld-out test sealed + metric pre-registered by default |\n\"saw the score, edited the metric, re-ran\" is structurally forbidden, not just discouraged |\nSprint + generalization gate between iteration rounds |\nstrips dev-set memorization before each new sprint; iter-3 hardcodes get rejected before they reach the test set |\nPer-slice loss floor |\naggregate winner that regresses a major tenant slice is rejected automatically — no aggregate-winner-ships-and-quietly-breaks-SMB story |\nOne-shot test pass |\nthe sealed test set opens ONCE; conclusion doc + 3 charts + discipline self-audit is the output |\nRuns inside Claude Code |\nno separate CLI / dashboard to maintain; `git clone` and ask a question |\n\n**The other tools all support pieces of this opt-in.** `auto-itera`\n\n's value is making the discipline the default — and refusing to let you skip it mid-flight when the dev-set scores look exciting.\n\nIf you need a hosted eval dashboard with prompts in a UI, use LangSmith. If you need pre-built safety benchmarks, use Inspect. **If you need a defensible ship-or-kill verdict on real production data in the next afternoon, that's auto-itera's lane.**\n\nA complete worked example lives at [ examples/prompt-tuning-classifier/](/clfhaha1234/auto-itera/blob/main/examples/prompt-tuning-classifier) — three figures rendered from one\n\n`data.json`\n\n, telling one coherent teaching story.\n\nStage 5— both candidate arms beat baseline by +8.9pp, well above the 2× variance noise floor.\n\nStage 8— both arms clear the aggregate threshold. But`v3`\n\nregresses SMB tenants by -3.3pp, crossing the pre-registered loss floor.Aggregate winner ≠ winner.\n\nCost view—`v2`\n\nis the Pareto move: +8.9pp at +$0.60 / 1k rows. Ship`v2`\n\n. Kill`v3`\n\n.\n\nTeams **shipping LLM products** who need decisions that survive production:\n\n- Choosing between models when cost and accuracy both matter\n- Validating a prompt change actually helps before deploying it\n- Comparing retrieval strategies on real customer queries\n- Auditing whether your eval methodology is biased\n\nIf you're a solo dev experimenting in a notebook, you can skip this. If you have customers depending on whether your AI decisions are right, you can't.\n\n```\ngit clone https://github.com/clfhaha1234/auto-itera.git ~/.claude/skills/auto-itera\n```\n\nThat's it. `auto-itera`\n\nis now a Claude Code skill. No other dependencies until you want to render charts standalone.\n\nThe format that triggers the skill cleanly: state the **goal**, spell out the **candidates** (baseline + 1–3 alternatives), and **pre-register the threshold** before any data is sampled.\n\nI want to evaluate two versions of my classifier system prompt.\n\nBaseline (`prompt-v1`\n\n): current production prompt at`src/classify/processor.ts:42`\n\n.Candidate (`prompt-v2`\n\n): same prompt + an extra sentence: \"When the input mentions a payment processor (Stripe, PayPal, Square), classify as 'fees' not 'transfer'.\"\n\nI have ~200 real production rows in`data/classifier-eval.csv`\n\nwith human-labeled ground truth in the`expected_label`\n\ncolumn. About 60% are enterprise tenants and 40% SMB.\n\nShip criterion: ≥+5pp aggregate accuracy AND no per-tenant slice regresses more than -2pp.\n\n`auto-itera`\n\ndoes the rest:\n\n**Writes the Phase 0 frame** as a markdown table — your question, the two arms, the metric, the threshold. Asks ONCE if anything's wrong; otherwise locks it.**Splits your 200 rows** stratified by tenant tier — ~60 train / ~100 dev / ~40 sealed test. Prints a distribution audit so the split matches your prod traffic mix.**Runs a 1-row pilot** to catch instrumentation bugs before the full run wastes 5 minutes.**Scores baseline + v2 in parallel** on dev. Reports aggregate + per-slice + variance (≥3 trials per arm). Cross-judge sanity check on 5 rows with a 2nd-family model.**Sprint, if needed**— up to 3 iterations of \"diagnose per-row → hypothesis → tweak the arm → re-score\". Generalization gate strips dev-set memorization. Continue or lock.**One pass on the sealed test set.** Per-slice scores. Verdict: ship`v2`\n\n, ship narrowly to one slice, or kill.**Writes the conclusion doc** to`docs/experiments/YYYY-MM-DD-<topic>/`\n\n— three publication-quality charts embedded, full discipline self-audit checklist signed off.\n\nWall clock on a 200-row dataset: typically **5–15 minutes**. Most of that is LLM scoring; Claude's thinking is a tiny fraction.\n\n- No API keys to configure (uses whatever Claude Code already has)\n- No separate dashboard / hosted service\n- No YAML config files\n- No Python (unless you want to render charts standalone — see below)\n\nIf you want to regenerate the three figures from an already-completed experiment's `data.json`\n\n:\n\n```\nuv run scripts/chart.py arm-bar          --data data.json --out charts/arm-bar.png\nuv run scripts/chart.py forest-plot      --data data.json --out charts/forest-plot.png\nuv run scripts/chart.py cost-vs-accuracy --data data.json --out charts/cost-vs-accuracy.png\n```\n\nRe-render the loop diagram in this README (if you fork and want a customized version):\n\n```\nuv run scripts/render-loop-diagram.py --out docs/images/autonomous-loop.png\n```\n\nPEP 723 inline deps — `uv run`\n\nprovisions matplotlib automatically. `python3 scripts/*.py …`\n\nalso works if matplotlib is already installed.\n\n**What is auto-itera actually doing?**\nRunning the experimental discipline that AI teams know they should follow but skip because it's tedious. You hand it a goal + candidate arms + success threshold; it sources real production data, scores baseline + arms in parallel, diagnoses per-row, sprints with a generalization gate between rounds, picks the latest gate-passed iter, and runs the held-out test pass. The 22 safeguards make the discipline mechanical; the charts are a side effect; the verdict is the product.\n\n**Why \"autonomous experimentation\" and not \"AI scientist\"?**\nBecause the skill autonomously runs experiments — it does not autonomously invent hypotheses or brainstorm what to test. The human (or Claude in conversation with the human) supplies the candidate arms; `auto-itera`\n\nruns the loop from there to verdict. Calling it an \"AI scientist\" would oversell what it does and undersell what it does well: turning a vague \"should we use X or Y?\" question into a defensible verdict in hours.\n\n**Why doesn't auto-itera pick its own metric?**\nBy design. A metric the system picks for itself is a metric the system can drift toward — and \"the model that scored highest on the metric the model designed\" is exactly how teams accidentally ship +12% benchmark wins that regress 8% in production. Pre-registering the metric, threshold, and judge rubric BEFORE any data is sampled is the discipline that keeps the verdict trustworthy. The skill auto-designs the *experimental execution* (sampling, splits, variance trials, slicing, gating); you lock the *evaluation criteria* upfront. Splitting these responsibilities is what separates rigorous evaluation from automated self-confirmation.\n\n**Does this work with non-Claude models?**\nYes. The skill is model-agnostic — it tells Claude Code what discipline to enforce, but the arms you compare can be any models, any providers, any prompt / retrieval / architecture variants.\n\n**Why a 3-iteration sprint cap, not a hard \"stop at 3\" rule?**\n3 is a working-memory cap — humans can't reliably attribute outcomes across more than ~3 simultaneous hypothesis edits inside a single sprint. But iteration itself isn't the enemy; *un-audited* iteration is. After each sprint, the generalization gate strips out dev-set memorization. If the gate passes and dev signal isn't saturated, you start a fresh sprint. Most decisions converge in 1–2 sprints. Past 3 sprints, the prior shifts toward \"the gate is failing to catch dev-memorization\" — and the right move is to audit the gate, not to keep iterating.\n\n**How is this different from a Jupyter notebook with matplotlib?**\nA notebook lets you do anything — including all the things that quietly bias your conclusion. `auto-itera`\n\nmakes the discipline mechanical: forbids peeking at test, forbids picking best-of-N on test, forbids moving the metric after the score, forbids skipping the generalization gate between sprints. The discipline is what you can't get from a notebook.\n\n**Can I use this on experiments I've already run?**\nUse it on the *next* decision. For already-run experiments where the test set was peeked at during debugging, the test set is contaminated for that question — reseal from fresh production data before running `auto-itera`\n\non it.\n\n**Does the chart helper require Python?**\nYes. `scripts/chart.py`\n\nand `scripts/render-loop-diagram.py`\n\nuse matplotlib via a PEP 723 inline-deps header — `uv run`\n\nprovisions an isolated env automatically; `python3`\n\nworks as fallback if matplotlib is already installed. The skill itself (the discipline + conclusion doc) works without Python.\n\nIssues and PRs welcome. The highest-value contribution is a new entry to the *Common Rationalizations* table in [SKILL.md](/clfhaha1234/auto-itera/blob/main/SKILL.md) from a real experience — the kind that ends *\"…and we shipped it, and then production regressed.\"* Those are the stories the skill exists to prevent.", "url": "https://wpnews.pro/news/auto-itera-autonomous-experimentation-engine-for-ai-engineering-decisions", "canonical_source": "https://github.com/clfhaha1234/auto-itera", "published_at": "2026-05-27 17:04:00+00:00", "updated_at": "2026-05-27 17:16:25.700879+00:00", "lang": "en", "topics": ["ai-products", "ai-tools", "machine-learning", "large-language-models", "mlops"], "entities": ["auto-itera", "Sonnet", "Haiku", "Opus", "BM25"], "alternates": {"html": "https://wpnews.pro/news/auto-itera-autonomous-experimentation-engine-for-ai-engineering-decisions", "markdown": "https://wpnews.pro/news/auto-itera-autonomous-experimentation-engine-for-ai-engineering-decisions.md", "text": "https://wpnews.pro/news/auto-itera-autonomous-experimentation-engine-for-ai-engineering-decisions.txt", "jsonld": "https://wpnews.pro/news/auto-itera-autonomous-experimentation-engine-for-ai-engineering-decisions.jsonld"}}