Scoring AI Agents: Deterministic Metrics + an LLM Judge

A developer built an evaluation framework for AI agents that combines deterministic metrics with an optional LLM judge. The framework runs agents as isolated subprocesses, scores them on five deterministic metrics like accuracy and reproducibility, and can use Claude to provide structured verdicts with actionable prompt fixes. The system also includes an automated improvement loop that mutates prompts based on identified weaknesses.

I run a lot of small autonomous agents — backend, frontend, mobile, devops, monitoring tiers, each one a prompt with a job. The moment you have more than a handful, a question gets uncomfortable: are they actually any good, and did my last prompt edit make them better or worse? "It looked fine when I tried it" doesn't scale. So I built a small evaluation framework that answers it with numbers, and then closes the loop by improving the prompts automatically. Here's how it's put together. The core principle: measure what you can measure deterministically, and only reach for an LLM judge where you must. Deterministic metrics are free, instant, and reproducible. An LLM judge is none of those things — so it's opt-in and purely additive. The harness runs each agent as an isolated subprocess , feeds it a fixed fixture on stdin, captures stdout, and scores the result against expected outputs. No shared state, no network, no flakiness. python3 harness/evaluate.py \ --agents-dir ./agents \ --out-dir ./out \ --seed 42 \ --timeout 10 That single command produces report.json , a human-readable report.txt / .html , a failures.json , and appends to history.jsonl so you can track drift over time. No SDK, no API key required. Every agent is just a program that reads a task from stdin and writes an answer to stdout. That's the whole interface — which is exactly why subprocess isolation works. python agents/sample agent/agent.py import sys def main : task = sys.stdin.read .strip ... the agent's real logic ... print answer if name == " main ": main Because the contract is a process boundary, an "agent" can be Python, a shell script, or anything that respects stdin/stdout. The harness doesn't care. Each run is scored on five deterministic metrics, checked against thresholds declared in metrics.yaml : thresholds: accuracy: 0.8 exact normalized matches fuzzy score: 0.7 average sequence similarity 0-1 timeout rate: 0.1 fraction of runs that timed out safety violations: 0 outputs matching unsafe patterns reproducibility variance: 0.05 std-dev across repeated runs reproducibility variance is the one people forget. Running an agent once tells you what it did; running it several times and measuring the spread tells you whether you can trust what it did. A correct-but-nondeterministic agent is a latent bug. Some qualities aren't string-comparable: did the agent stay in role? Did it respect its constraints? Is the output well-formed and complete? For those, an opt-in judge sends the rubric, the task, and the agent's real output to Claude and gets back a structured verdict — validated against a JSON schema so a malformed judgment can't poison the report. { "overall": 7.5, "dimensions": { "contract adherence": 8, "role fidelity": 9, "constraint safety": 7, "output format": 6, "completeness": 8 }, "verdict": "needs improvement", "weaknesses": { "dimension": "output format", "prompt fix": "Require a fenced JSON block in the system prompt." } } The judge runs three ways depending on what you have: the Anthropic API --llm-judge , the headless Claude Code CLI for subscription-only setups --llm-judge-cli , or pre-computed verdicts from any source --llm-verdicts . Same report either way. Identical outputs are judged once to bound cost. The important detail: every weakness must map to a fixable line in the agent's prompt . The judge isn't there to vibe-check; it produces edits. This is where it gets fun. A fail verdict and its prompt fixes land in failures.json , which feeds a GEPA-style improve loop: judge each candidate prompt per dimension , mutate the frontier candidate that owns the weakest dimension, keep a pool of candidates rather than greedily chasing one best, and write back only the best pool member. Scores and mutations are persisted to repo memory so the next run starts informed, and a nightly job commits improvements. The diagram above shows the whole flow: inputs → harness → metrics + judge → reports → improve loop, with a feedback edge carrying mutated prompts back to re-evaluation. history.jsonl is the trend that tells you whether you're actually getting better.The payoff is a system where I can change a prompt, run one command, and know — numerically — whether I helped or hurt, with the loop quietly fixing the easy regressions for me.