Why I built an AI repair loop that stops after one fix on purpose

wpnews.pro

A few weeks ago I watched my AI coding agent successfully fix a bug.

The tests passed. The patch looked clean. The agent reported success. I shipped it.

Three hours later I noticed the agent had also "improved" four unrelated files in the same session. One of those improvements quietly removed a guard clause that I had spent a full evening writing two months earlier. There was no test covering that guard clause because it was the kind of thing you write defensively — the bug it prevents doesn't show up in the suite, it shows up in production at 4am.

The patch passed every check. The patch was also wrong.

This is the part of AI coding that nobody is talking about clearly. The model isn't dumb. The tests aren't broken. The CI didn't fail. The agent didn't lie. Everything in the loop reported green. And the repo was a little bit worse than before.

I've now lost count of how many times this exact pattern has played out across my projects — a multi-agent build engine, an e-commerce site, a trading bot fleet. Different stacks, different agents (Claude CLI, Codex CLI, both), same failure mode. The agent makes a "successful" patch that quietly breaks something three files over, and the only reason I catch it is because I happen to read the diff before bed.

So I built AutoMaxFix. And the core design choice is one most people would call a step backwards: it stops after one ticket.

Every AI coding tool on the market right now is racing toward more autonomy. Multi-step agents. Long-running loops. "Just describe the bug and walk away." The marketing screenshots show 50 commits done overnight while you slept.

I think this is the wrong direction. Not because autonomy is bad in principle, but because we don't have the reliability floor to justify it yet.

Here's the math that nobody runs:

If each step in an autonomous loop has a 90% chance of being correct (which is generous), then a 10-step autonomous run has a 35% chance of being fully correct. A 20-step run has a 12% chance. The error doesn't compound linearly — it compounds geometrically. And the agent doesn't know which step was wrong. It just reports green on the last one. Worse: the failures that slip through aren't random. They're systematically the ones that don't have test coverage, because if they did the agent would have caught them. So autonomous loops are biased toward producing exactly the bugs your test suite can't see.

AutoMaxFix is a Python CLI. You point it at a failing pytest run (or jest, vitest, mocha, go test, cargo test). It parses the failure output into a structured ticket. It generates a reproduction brief. It hands one ticket — one — to your local Claude CLI or Codex CLI. It validates the diff against a strict safety contract. It applies the patch only after you approve. It runs the targeted tests plus a regression sweep. It writes a report. Then it stops.

Not "stops until tomorrow." Stops. The loop is over. If there are 14 more failing tickets, that's 14 more runs you initiate when you're ready.

The reason is simple: every additional ticket inside the loop is another place where the agent can make a "successful" patch that quietly degrades the codebase. Stopping after one ticket means every patch has your eyes on it before the next one lands. It means the regression of one fix can't hide inside the success of another.

It also means AutoMaxFix is slower than every competitor. Intentionally. That's the trade.

The other choice that matters: every safety rule in AutoMaxFix is enforced before the agent sees a prompt, not asked of the agent inside one.

Patches that touch .git

, .env*

, secrets*

, .venv

, or node_modules

are rejected at validation. Diffs containing rm -rf

, sudo

, curl | bash

, wget | bash

, or package install commands are rejected. Binary patches are rejected. Mode-change-only patches are rejected. Anything exceeding the configured max_files_changed

cap is rejected. The workspace must be clean (no uncommitted changes) before apply. Ticket files carry an integrity hash verified on load. Credential-shaped strings in ticket content are redacted before write.

None of that is in a prompt. The agent literally cannot do those things because the validation layer rejects the diff before apply. Prompts get ignored. Validation doesn't.

This is the part I learned the hard way. I spent weeks early on writing increasingly elaborate prompts trying to make the agent "be careful." The agent was careful 95% of the time. The 5% was unbounded. The only fix that actually held was moving safety out of the prompt entirely and into the infrastructure.

AutoMaxFix does not call any hosted API. It drives whichever local agent CLI you already have configured. If you have a Claude or Codex subscription with CLI access, you have everything you need — no API costs, no separate billing.

It does not run unattended. There's a --yes

flag for CI use, but the default everywhere else is human approval.

It does not chain tickets. I cannot say this often enough. If you want multi-ticket autonomy, AutoMaxFix is the wrong tool. Go use a different one and good luck.

It does not install packages. It does not run shell commands. It does not touch your secrets. It is the boring, paranoid version of an AI coding tool, and I'm comfortable with that.

I'm releasing AutoMaxFix as MIT because I want the safety floor to be a thing that exists in the world, not a thing locked behind my paywall. The model where "be careful about your repo" is a premium feature is, frankly, dystopian. Every developer using AI agents should have these guardrails. Mine just happens to be the version I built for myself.

If it's useful, fork it. If it's broken, file an issue. If the design choice of "one ticket per run" sounds wrong to you, I'd genuinely like to hear why — I've been wrong before and will be again.

[https://github.com/Noumenon-ai/AutoMaxFix](https://github.com/Noumenon-ai/AutoMaxFix)

GitHub Actions composite action included. Works with pytest, jest, vitest, mocha, go test, cargo test, and a generic format for anything else.

Built because I needed it. Shipped because maybe you do too.

source & further reading

dev.to — original article How I Cut My OpenAI Bill by 97% — The Full Migration Guide My extension "worked" — it just quietly saved half of every conversation Automate AI Workflows with Qualcomm AI Runtime

Why I built an AI repair loop that stops after one fix on purpose

Run your AI side-project on zahid.host