cd /news/ai-agents/scoring-ai-agents-deterministic-metr… Β· home β€Ί topics β€Ί ai-agents β€Ί article
[ARTICLE Β· art-32172] src=dev.to β†— pub= topic=ai-agents verified=true sentiment=↑ positive

Scoring AI Agents: Deterministic Metrics + an LLM Judge

A developer built an evaluation framework for AI agents that combines deterministic metrics with an optional LLM judge. The framework runs agents as isolated subprocesses, scores them on five deterministic metrics like accuracy and reproducibility, and can use Claude to provide structured verdicts with actionable prompt fixes. The system also includes an automated improvement loop that mutates prompts based on identified weaknesses.

read3 min views1 publishedJun 18, 2026

I run a lot of small autonomous agents β€” backend, frontend, mobile, devops, monitoring tiers, each one a prompt with a job. The moment you have more than a handful, a question gets uncomfortable: are they actually any good, and did my last prompt edit make them better or worse? "It looked fine when I tried it" doesn't scale. So I built a small evaluation framework that answers it with numbers, and then closes the loop by improving the prompts automatically.

Here's how it's put together.

The core principle: measure what you can measure deterministically, and only reach for an LLM judge where you must. Deterministic metrics are free, instant, and reproducible. An LLM judge is none of those things β€” so it's opt-in and purely additive.

The harness runs each agent as an isolated subprocess, feeds it a fixed fixture on stdin, captures stdout, and scores the result against expected outputs. No shared state, no network, no flakiness.

python3 harness/evaluate.py \
  --agents-dir ./agents \
  --out-dir ./out \
  --seed 42 \
  --timeout 10

That single command produces report.json

, a human-readable report.txt

/.html

, a failures.json

, and appends to history.jsonl

so you can track drift over time. No SDK, no API key required.

Every agent is just a program that reads a task from stdin and writes an answer to stdout. That's the whole interface β€” which is exactly why subprocess isolation works.

import sys

def main():
    task = sys.stdin.read().strip()
    print(answer)

if __name__ == "__main__":
    main()

Because the contract is a process boundary, an "agent" can be Python, a shell script, or anything that respects stdin/stdout. The harness doesn't care.

Each run is scored on five deterministic metrics, checked against thresholds declared in metrics.yaml

:

thresholds:
  accuracy: 0.8                  # exact normalized matches
  fuzzy_score: 0.7               # average sequence similarity 0-1
  timeout_rate: 0.1              # fraction of runs that timed out
  safety_violations: 0           # outputs matching unsafe patterns
  reproducibility_variance: 0.05 # std-dev across repeated runs

reproducibility_variance

is the one people forget. Running an agent once tells you what it did; running it several times and measuring the spread tells you whether you can trust what it did. A correct-but-nondeterministic agent is a latent bug.

Some qualities aren't string-comparable: did the agent stay in role? Did it respect its constraints? Is the output well-formed and complete? For those, an opt-in judge sends the rubric, the task, and the agent's real output to Claude and gets back a structured verdict β€” validated against a JSON schema so a malformed judgment can't poison the report.

{
  "overall": 7.5,
  "dimensions": {
    "contract_adherence": 8,
    "role_fidelity": 9,
    "constraint_safety": 7,
    "output_format": 6,
    "completeness": 8
  },
  "verdict": "needs_improvement",
  "weaknesses": [
    { "dimension": "output_format", "prompt_fix": "Require a fenced JSON block in the system prompt." }
  ]
}

The judge runs three ways depending on what you have: the Anthropic API (--llm-judge

), the headless Claude Code CLI for subscription-only setups (--llm-judge-cli

), or pre-computed verdicts from any source (--llm-verdicts

). Same report either way. Identical outputs are judged once to bound cost.

The important detail: every weakness must map to a fixable line in the agent's prompt. The judge isn't there to vibe-check; it produces edits.

This is where it gets fun. A fail

verdict and its prompt fixes land in failures.json

, which feeds a GEPA-style improve loop: judge each candidate prompt per dimension, mutate the frontier candidate that owns the weakest dimension, keep a pool of candidates rather than greedily chasing one best, and write back only the best pool member. Scores and mutations are persisted to repo memory so the next run starts informed, and a nightly job commits improvements.

The diagram above shows the whole flow: inputs β†’ harness β†’ (metrics + judge) β†’ reports β†’ improve loop, with a feedback edge carrying mutated prompts back to re-evaluation.

history.jsonl

is the trend that tells you whether you're actually getting better.The payoff is a system where I can change a prompt, run one command, and know β€” numerically β€” whether I helped or hurt, with the loop quietly fixing the easy regressions for me.

── more in #ai-agents 4 stories Β· sorted by recency
── more on @claude 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/scoring-ai-agents-de…] indexed:0 read:3min 2026-06-18 Β· β€”