Scoring AI Agents: Deterministic Metrics + an LLM Judge

wpnews.pro

cd /news/ai-agents/scoring-ai-agents-deterministic-metr… · home › topics › ai-agents › article

[ARTICLE · art-32172] src=dev.to ↗ pub=2026-06-18T05:57Z topic=ai-agents verified=true sentiment=↑ positive

Scoring AI Agents: Deterministic Metrics + an LLM Judge

A developer built an evaluation framework for AI agents that combines deterministic metrics with an optional LLM judge. The framework runs agents as isolated subprocesses, scores them on five deterministic metrics like accuracy and reproducibility, and can use Claude to provide structured verdicts with actionable prompt fixes. The system also includes an automated improvement loop that mutates prompts based on identified weaknesses.

read3 min views30 publishedJun 18, 2026

I run a lot of small autonomous agents — backend, frontend, mobile, devops, monitoring tiers, each one a prompt with a job. The moment you have more than a handful, a question gets uncomfortable: are they actually any good, and did my last prompt edit make them better or worse? "It looked fine when I tried it" doesn't scale. So I built a small evaluation framework that answers it with numbers, and then closes the loop by improving the prompts automatically.

Here's how it's put together.

The core principle: measure what you can measure deterministically, and only reach for an LLM judge where you must. Deterministic metrics are free, instant, and reproducible. An LLM judge is none of those things — so it's opt-in and purely additive.

The harness runs each agent as an isolated subprocess, feeds it a fixed fixture on stdin, captures stdout, and scores the result against expected outputs. No shared state, no network, no flakiness.

python3 harness/evaluate.py \
  --agents-dir ./agents \
  --out-dir ./out \
  --seed 42 \
  --timeout 10

That single command produces report.json

, a human-readable report.txt

/.html

, a failures.json

, and appends to history.jsonl

so you can track drift over time. No SDK, no API key required.

Every agent is just a program that reads a task from stdin and writes an answer to stdout. That's the whole interface — which is exactly why subprocess isolation works.

import sys

def main():
    task = sys.stdin.read().strip()
    print(answer)

if __name__ == "__main__":
    main()

Because the contract is a process boundary, an "agent" can be Python, a shell script, or anything that respects stdin/stdout. The harness doesn't care.

Each run is scored on five deterministic metrics, checked against thresholds declared in metrics.yaml

thresholds:
  accuracy: 0.8                  # exact normalized matches
  fuzzy_score: 0.7               # average sequence similarity 0-1
  timeout_rate: 0.1              # fraction of runs that timed out
  safety_violations: 0           # outputs matching unsafe patterns
  reproducibility_variance: 0.05 # std-dev across repeated runs

reproducibility_variance

is the one people forget. Running an agent once tells you what it did; running it several times and measuring the spread tells you whether you can trust what it did. A correct-but-nondeterministic agent is a latent bug.

Some qualities aren't string-comparable: did the agent stay in role? Did it respect its constraints? Is the output well-formed and complete? For those, an opt-in judge sends the rubric, the task, and the agent's real output to Claude and gets back a structured verdict — validated against a JSON schema so a malformed judgment can't poison the report.

{
  "overall": 7.5,
  "dimensions": {
    "contract_adherence": 8,
    "role_fidelity": 9,
    "constraint_safety": 7,
    "output_format": 6,
    "completeness": 8
  },
  "verdict": "needs_improvement",
  "weaknesses": [
    { "dimension": "output_format", "prompt_fix": "Require a fenced JSON block in the system prompt." }
  ]
}

The judge runs three ways depending on what you have: the Anthropic API (--llm-judge

), the headless Claude Code CLI for subscription-only setups (--llm-judge-cli

), or pre-computed verdicts from any source (--llm-verdicts

). Same report either way. Identical outputs are judged once to bound cost.

The important detail: every weakness must map to a fixable line in the agent's prompt. The judge isn't there to vibe-check; it produces edits.

This is where it gets fun. A fail

verdict and its prompt fixes land in failures.json

, which feeds a GEPA-style improve loop: judge each candidate prompt per dimension, mutate the frontier candidate that owns the weakest dimension, keep a pool of candidates rather than greedily chasing one best, and write back only the best pool member. Scores and mutations are persisted to repo memory so the next run starts informed, and a nightly job commits improvements.

The diagram above shows the whole flow: inputs → harness → (metrics + judge) → reports → improve loop, with a feedback edge carrying mutated prompts back to re-evaluation.

history.jsonl

is the trend that tells you whether you're actually getting better.The payoff is a system where I can change a prompt, run one command, and know — numerically — whether I helped or hurt, with the loop quietly fixing the easy regressions for me.

source & further reading

dev.to — original article I ported Picomatch to Rust. It passed 1,977 tests and lost the benchmark by 18x I Ran 8 AI APIs Through the Same 50 Prompts — Here's the Real Cost Breakdown Wiring SlopScan into Claude Code — A Skill, a Hook, and a Bug I Almost Shipped

~/api · this article 200

$curl api.wpnews.pro/v1/news/scoring-ai-agents-determ…

Read original on dev.to → dev.to/pponali/scoring-ai-agents-deterministic-m…

mentioned entities

Claude

Anthropic

GEPA

metadata

slugscoring-ai-agents-deterministic-metrics-an-llm-judge

topic#ai-agents

secondary3 topics

sentimentpositive

canonicaldev.to

navigation

← prevHow I Cut My AI API Bill by 40% …

next →Strengthening Teen Accounts with…

── more in #ai-agents 4 stories · sorted by recency

dev.to · 2 Aug · #ai-agents

I Ran 8 AI APIs Through the Same 50 Prompts — Here's the Real Cost Breakdown

promptcube3.com · 2 Aug · #ai-agents

Claude Code errors, best AI coding tools 2026, adv

businessinsider.com · 2 Aug · #ai-agents

After starting the tokenmaxxing panic, Uber's CTO is back with a very different AI story

byteiota.com · 2 Aug · #ai-agents

Claude Opus 4.1 Retires August 5: Migrate to 4.8 Now

── more on @claude 3 stories trending now

wpnews · 1 Aug · #ai-products

OpenAI Atlas Shuts Down August 9: Migration Guide

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

wpnews · 2 Aug · #developer-tools

Agent-Browser – Browser Automation for AI

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required