{"slug": "self-improving-code-using-the-agentic-evaluator-workflow", "title": "Self improving code using the agentic evaluator workflow", "summary": "A developer built a multi-agent AI system where one agent writes code, a second scores it, and a third refines it in an automated loop. The pipeline uses Claude models to generate, evaluate, and improve Python scripts until they reach a minimum score of 9.6 out of 10. The system passes full history to prevent score regression and uses structured diff feedback for deterministic refinements.", "body_md": "I've been messing around with multi-agent AI systems recently. I had a crazy idea - what if I could get one AI agent to write code, have another score it, and a third refine it based on that score? All automatically. All in a loop.\n\nThat's what I'm going to walk through here.\n\nThe things I wanted to explore were:\n\nTLDR - if you just want the code it's here: [https://github.com/codecowboydotio/ai-self-propagate-experiment](https://github.com/codecowboydotio/ai-self-propagate-experiment)\n\nI built a pipeline where Agent 1 generates a Python script, a scorer evaluates it, and a refiner improves it - round and round until the score is good enought. Once the code passes the threshold, Agent 1 writes it to a temp file and executes it as a child process.\n\nThere are a few configurable constants that control the loop:\n\n```\nMAX_REFINEMENTS = 3\nMIN_SCORE = 9.6\n```\n\nIf the code scores 9.6 or above out of 10, it gets accepted. Otherwise we refine, up to three times. If it still hasn't hit the bar, the script exits with a non-zero code.\n\nAgent 1 uses `claude-opus-4-8`\n\nwith a tight system prompt that tells it to respond only with source code - no markdown, no commentary, no backticks.\n\n```\nresponse = client.messages.create(\n    model=\"claude-opus-4-8\",\n    max_tokens=1024,\n    system=(\n        \"You are a coding agent that responds only with source code. \"\n        \"Do not include any commentary, markdown, or backticks. \"\n        \"Respond only with valid, self-contained Python code.\"\n    ),\n    messages=[{\"role\": \"user\", \"content\": ORIGINAL_PROMPT}],\n)\nagent2_code = response.content[0].text\n```\n\nThe task I gave it is simple - write a Python script that calls Claude and asks it \"What is 2 + 2?\". The point isn't the task, it's the pattern.\n\nInfo\n\nThe generated code becomes Agent 2 - not a model, but an actual Python script that gets executed as a subprocess later. Agent 1 literally creates Agent 2.\n\nThe scorer uses `claude-haiku-4-5-20251001`\n\n- faster and cheaper. It receives the original prompt, the code to evaluate, and the full history of previous attempts.\n\nThe history part is important. Without it, scores regress. The scorer forgets what it already rewarded and starts penalising things it previously accepted. I learned this the hard way - early runs would score something highly, then the next iteration would penalise the same thing again. Passing the full history fixes this.\n\nThe scorer returns a structured diff format:\n\n```\nScore: 8.5/10\nReason: The code is functional but missing error handling.\nDiff:\n- REMOVE: response = client.messages.create(...)\n  ADD: try:\\n    response = client.messages.create(...)\\nexcept anthropic.APIError as e:\\n    print(f\"API error: {e}\")\\n    sys.exit(1) (+1.5)\n```\n\nForcing exact `REMOVE/ADD`\n\npairs rather than vague feedback makes the refiner's job much more deterministic. \"Improve error handling\" is useless. \"Replace this exact line with this exact block\" is not.\n\n``` php\ndef score_code(code: str, history: list[dict]) -> tuple[float, str]:\n    history_text = \"\"\n    for i, entry in enumerate(history, 1):\n        history_text += (\n            f\"--- Attempt {i} ---\\n\"\n            f\"Code:\\n{entry['code']}\\n\"\n            f\"Your previous score and feedback:\\n{entry['feedback']}\\n\\n\"\n        )\n    # ...\n```\n\nWhen the score comes back below the threshold, the refiner kicks in - also `claude-opus-4-8`\n\n. It gets the full history plus the latest structured diff and applies the changes.\n\n``` php\ndef refine_code(history: list[dict]) -> str:\n    refine_response = client.messages.create(\n        model=\"claude-opus-4-8\",\n        max_tokens=1024,\n        system=\"You are a coding agent that responds only with source code...\",\n        messages=[{\n            \"role\": \"user\",\n            \"content\": (\n                f\"Original prompt:\\n{ORIGINAL_PROMPT}\\n\\n\"\n                f\"History of previous attempts:\\n{history_text}\"\n                f\"Apply the structured diff from the latest feedback to produce an improved version.\"\n            ),\n        }],\n    )\n    return refine_response.content[0].text\n```\n\nWithout the history injection, the refiner might fix one thing and accidentally break someting it doesn't know was already fixed in a previous pass.\n\nThe refinement loop itself is pretty clean:\n\n```\nrefinement_history: list[dict] = []\nrefinements = 0\n\nwhile True:\n    score, scorer_text = score_code(agent2_code, refinement_history)\n\n    if score >= MIN_SCORE:\n        log(f\"Score {score}/10 — accepted.\")\n        break\n\n    if refinements >= MAX_REFINEMENTS:\n        log(f\"Score {score}/10 — maximum refinements reached. Stopping.\")\n        sys.exit(1)\n\n    refinement_history.append({\"code\": agent2_code, \"feedback\": scorer_text})\n    refinements += 1\n    agent2_code = refine_code(refinement_history)\n```\n\nEach cycle: score → check threshold → refine → score again. The history list grows with every pass.\n\nOnce the loop exits with an accepted score, Agent 1 writes the final code to a temp file and runs it:\n\n```\nwith tempfile.NamedTemporaryFile(\n    mode=\"w\", suffix=\".py\", delete=False, dir=os.path.dirname(__file__)\n) as f:\n    f.write(agent2_code)\n    agent2_path = f.name\n\nproc = subprocess.Popen(\n    [sys.executable, agent2_path],\n    stdout=subprocess.PIPE,\n    stderr=subprocess.PIPE,\n    text=True,\n)\nstdout, stderr = proc.communicate(timeout=60)\n```\n\nThe temp file gets cleaned up in a `finally`\n\nblock no matter what happens. stdout and stderr are both captured - if Agent 2 blows up you'll see why.\n\nInfo\n\nThis is what makes it genuinely agentic. Agent 1 isn't just generating code for a human to run - it's generating, scoring, refining, and executing the result itself.\n\nI used different models for different roles and I did this deliberately. The generator and refiner both use `claude-opus-4-8`\n\nbecause they need the reasoning capacity to either produce or correctly apply a structured diff. The scorer uses `claude-haiku-4-5-20251001`\n\nbecause scoring is cheaper work - fast and sufficient. You could swap haiku for sonnet if you want richer feedback, but I haven't found it necessary.\n\nDrop the script in a directory with a `.env`\n\nfile containing your `ANTHROPIC_API_KEY`\n\n:\n\n```\npip install anthropic python-dotenv\npython agent1.py          # normal output\npython agent1.py --debug  # full verbose output\n```\n\nThe output looks something like this:\n\n```\n=== Agent 1 (PID 12345): generating Agent 2 ===\nScore 9.8/10 — accepted.\n=== Agent 1 (PID 12345): running Agent 2 ===\n=== Agent 2 (PID 12346) ===\nAgent 2 output: 2 + 2 equals 4.\n```\n\nThis is a simple but flexible pattern for self-improving code generation. The structured diff format, full history passing, and different model tiers for different roles are the things that make it actually work.\n\nThe next step is more than likely going to be introducing a truly distributed message bus similar to the article here [https://codecowboy.io/ai/autonomous-ai-agents/](https://codecowboy.io/ai/autonomous-ai-agents/). This way, each of the agents could refine others within the network and work together toward a shared goal.\n\nI'm already thinking about implementing a shared goal deconstructor.", "url": "https://wpnews.pro/news/self-improving-code-using-the-agentic-evaluator-workflow", "canonical_source": "https://dev.to/codecowboydotio/self-improving-code-using-the-agentic-evaluator-workflow-1i3i", "published_at": "2026-06-30 01:18:41+00:00", "updated_at": "2026-06-30 01:48:35.489238+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "generative-ai", "developer-tools"], "entities": ["Claude", "Anthropic", "codecowboydotio", "claude-opus-4-8", "claude-haiku-4-5-20251001"], "alternates": {"html": "https://wpnews.pro/news/self-improving-code-using-the-agentic-evaluator-workflow", "markdown": "https://wpnews.pro/news/self-improving-code-using-the-agentic-evaluator-workflow.md", "text": "https://wpnews.pro/news/self-improving-code-using-the-agentic-evaluator-workflow.txt", "jsonld": "https://wpnews.pro/news/self-improving-code-using-the-agentic-evaluator-workflow.jsonld"}}