Self improving code using the agentic evaluator workflow

A developer built a multi-agent AI system where one agent writes code, a second scores it, and a third refines it in an automated loop. The pipeline uses Claude models to generate, evaluate, and improve Python scripts until they reach a minimum score of 9.6 out of 10. The system passes full history to prevent score regression and uses structured diff feedback for deterministic refinements.

I've been messing around with multi-agent AI systems recently. I had a crazy idea - what if I could get one AI agent to write code, have another score it, and a third refine it based on that score? All automatically. All in a loop. That's what I'm going to walk through here. The things I wanted to explore were: TLDR - if you just want the code it's here: https://github.com/codecowboydotio/ai-self-propagate-experiment https://github.com/codecowboydotio/ai-self-propagate-experiment I built a pipeline where Agent 1 generates a Python script, a scorer evaluates it, and a refiner improves it - round and round until the score is good enought. Once the code passes the threshold, Agent 1 writes it to a temp file and executes it as a child process. There are a few configurable constants that control the loop: MAX REFINEMENTS = 3 MIN SCORE = 9.6 If the code scores 9.6 or above out of 10, it gets accepted. Otherwise we refine, up to three times. If it still hasn't hit the bar, the script exits with a non-zero code. Agent 1 uses claude-opus-4-8 with a tight system prompt that tells it to respond only with source code - no markdown, no commentary, no backticks. response = client.messages.create model="claude-opus-4-8", max tokens=1024, system= "You are a coding agent that responds only with source code. " "Do not include any commentary, markdown, or backticks. " "Respond only with valid, self-contained Python code." , messages= {"role": "user", "content": ORIGINAL PROMPT} , agent2 code = response.content 0 .text The task I gave it is simple - write a Python script that calls Claude and asks it "What is 2 + 2?". The point isn't the task, it's the pattern. Info The generated code becomes Agent 2 - not a model, but an actual Python script that gets executed as a subprocess later. Agent 1 literally creates Agent 2. The scorer uses claude-haiku-4-5-20251001 - faster and cheaper. It receives the original prompt, the code to evaluate, and the full history of previous attempts. The history part is important. Without it, scores regress. The scorer forgets what it already rewarded and starts penalising things it previously accepted. I learned this the hard way - early runs would score something highly, then the next iteration would penalise the same thing again. Passing the full history fixes this. The scorer returns a structured diff format: Score: 8.5/10 Reason: The code is functional but missing error handling. Diff: - REMOVE: response = client.messages.create ... ADD: try:\n response = client.messages.create ... \nexcept anthropic.APIError as e:\n print f"API error: {e}" \n sys.exit 1 +1.5 Forcing exact REMOVE/ADD pairs rather than vague feedback makes the refiner's job much more deterministic. "Improve error handling" is useless. "Replace this exact line with this exact block" is not. php def score code code: str, history: list dict - tuple float, str : history text = "" for i, entry in enumerate history, 1 : history text += f"--- Attempt {i} ---\n" f"Code:\n{entry 'code' }\n" f"Your previous score and feedback:\n{entry 'feedback' }\n\n" ... When the score comes back below the threshold, the refiner kicks in - also claude-opus-4-8 . It gets the full history plus the latest structured diff and applies the changes. php def refine code history: list dict - str: refine response = client.messages.create model="claude-opus-4-8", max tokens=1024, system="You are a coding agent that responds only with source code...", messages= { "role": "user", "content": f"Original prompt:\n{ORIGINAL PROMPT}\n\n" f"History of previous attempts:\n{history text}" f"Apply the structured diff from the latest feedback to produce an improved version." , } , return refine response.content 0 .text Without the history injection, the refiner might fix one thing and accidentally break someting it doesn't know was already fixed in a previous pass. The refinement loop itself is pretty clean: refinement history: list dict = refinements = 0 while True: score, scorer text = score code agent2 code, refinement history if score = MIN SCORE: log f"Score {score}/10 — accepted." break if refinements = MAX REFINEMENTS: log f"Score {score}/10 — maximum refinements reached. Stopping." sys.exit 1 refinement history.append {"code": agent2 code, "feedback": scorer text} refinements += 1 agent2 code = refine code refinement history Each cycle: score → check threshold → refine → score again. The history list grows with every pass. Once the loop exits with an accepted score, Agent 1 writes the final code to a temp file and runs it: with tempfile.NamedTemporaryFile mode="w", suffix=".py", delete=False, dir=os.path.dirname file as f: f.write agent2 code agent2 path = f.name proc = subprocess.Popen sys.executable, agent2 path , stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, stdout, stderr = proc.communicate timeout=60 The temp file gets cleaned up in a finally block no matter what happens. stdout and stderr are both captured - if Agent 2 blows up you'll see why. Info This is what makes it genuinely agentic. Agent 1 isn't just generating code for a human to run - it's generating, scoring, refining, and executing the result itself. I used different models for different roles and I did this deliberately. The generator and refiner both use claude-opus-4-8 because they need the reasoning capacity to either produce or correctly apply a structured diff. The scorer uses claude-haiku-4-5-20251001 because scoring is cheaper work - fast and sufficient. You could swap haiku for sonnet if you want richer feedback, but I haven't found it necessary. Drop the script in a directory with a .env file containing your ANTHROPIC API KEY : pip install anthropic python-dotenv python agent1.py normal output python agent1.py --debug full verbose output The output looks something like this: === Agent 1 PID 12345 : generating Agent 2 === Score 9.8/10 — accepted. === Agent 1 PID 12345 : running Agent 2 === === Agent 2 PID 12346 === Agent 2 output: 2 + 2 equals 4. This is a simple but flexible pattern for self-improving code generation. The structured diff format, full history passing, and different model tiers for different roles are the things that make it actually work. The next step is more than likely going to be introducing a truly distributed message bus similar to the article here https://codecowboy.io/ai/autonomous-ai-agents/ https://codecowboy.io/ai/autonomous-ai-agents/ . This way, each of the agents could refine others within the network and work together toward a shared goal. I'm already thinking about implementing a shared goal deconstructor.