{"slug": "scoring-ai-agents-deterministic-metrics-an-llm-judge", "title": "Scoring AI Agents: Deterministic Metrics + an LLM Judge", "summary": "A developer built an evaluation framework for AI agents that combines deterministic metrics with an optional LLM judge. The framework runs agents as isolated subprocesses, scores them on five deterministic metrics like accuracy and reproducibility, and can use Claude to provide structured verdicts with actionable prompt fixes. The system also includes an automated improvement loop that mutates prompts based on identified weaknesses.", "body_md": "I run a lot of small autonomous agents — backend, frontend, mobile, devops, monitoring tiers, each one a prompt with a job. The moment you have more than a handful, a question gets uncomfortable: *are they actually any good, and did my last prompt edit make them better or worse?* \"It looked fine when I tried it\" doesn't scale. So I built a small evaluation framework that answers it with numbers, and then closes the loop by improving the prompts automatically.\n\nHere's how it's put together.\n\nThe core principle: **measure what you can measure deterministically, and only reach for an LLM judge where you must.** Deterministic metrics are free, instant, and reproducible. An LLM judge is none of those things — so it's opt-in and purely additive.\n\nThe harness runs each agent as an **isolated subprocess**, feeds it a fixed fixture on stdin, captures stdout, and scores the result against expected outputs. No shared state, no network, no flakiness.\n\n```\npython3 harness/evaluate.py \\\n  --agents-dir ./agents \\\n  --out-dir ./out \\\n  --seed 42 \\\n  --timeout 10\n```\n\nThat single command produces `report.json`\n\n, a human-readable `report.txt`\n\n/`.html`\n\n, a `failures.json`\n\n, and appends to `history.jsonl`\n\nso you can track drift over time. No SDK, no API key required.\n\nEvery agent is just a program that reads a task from stdin and writes an answer to stdout. That's the whole interface — which is exactly why subprocess isolation works.\n\n``` python\n# agents/sample_agent/agent.py\nimport sys\n\ndef main():\n    task = sys.stdin.read().strip()\n    # ... the agent's real logic ...\n    print(answer)\n\nif __name__ == \"__main__\":\n    main()\n```\n\nBecause the contract is a process boundary, an \"agent\" can be Python, a shell script, or anything that respects stdin/stdout. The harness doesn't care.\n\nEach run is scored on five deterministic metrics, checked against thresholds declared in `metrics.yaml`\n\n:\n\n```\nthresholds:\n  accuracy: 0.8                  # exact normalized matches\n  fuzzy_score: 0.7               # average sequence similarity 0-1\n  timeout_rate: 0.1              # fraction of runs that timed out\n  safety_violations: 0           # outputs matching unsafe patterns\n  reproducibility_variance: 0.05 # std-dev across repeated runs\n```\n\n`reproducibility_variance`\n\nis the one people forget. Running an agent once tells you what it did; running it several times and measuring the spread tells you whether you can *trust* what it did. A correct-but-nondeterministic agent is a latent bug.\n\nSome qualities aren't string-comparable: did the agent stay in role? Did it respect its constraints? Is the output well-formed and complete? For those, an opt-in judge sends the rubric, the task, and the agent's real output to Claude and gets back a **structured verdict** — validated against a JSON schema so a malformed judgment can't poison the report.\n\n```\n{\n  \"overall\": 7.5,\n  \"dimensions\": {\n    \"contract_adherence\": 8,\n    \"role_fidelity\": 9,\n    \"constraint_safety\": 7,\n    \"output_format\": 6,\n    \"completeness\": 8\n  },\n  \"verdict\": \"needs_improvement\",\n  \"weaknesses\": [\n    { \"dimension\": \"output_format\", \"prompt_fix\": \"Require a fenced JSON block in the system prompt.\" }\n  ]\n}\n```\n\nThe judge runs three ways depending on what you have: the Anthropic API (`--llm-judge`\n\n), the headless Claude Code CLI for subscription-only setups (`--llm-judge-cli`\n\n), or pre-computed verdicts from any source (`--llm-verdicts`\n\n). Same report either way. Identical outputs are judged once to bound cost.\n\nThe important detail: every weakness must map to **a fixable line in the agent's prompt**. The judge isn't there to vibe-check; it produces edits.\n\nThis is where it gets fun. A `fail`\n\nverdict and its prompt fixes land in `failures.json`\n\n, which feeds a GEPA-style improve loop: judge each candidate prompt **per dimension**, mutate the frontier candidate that owns the *weakest* dimension, keep a pool of candidates rather than greedily chasing one best, and write back only the best pool member. Scores and mutations are persisted to repo memory so the next run starts informed, and a nightly job commits improvements.\n\nThe diagram above shows the whole flow: inputs → harness → (metrics + judge) → reports → improve loop, with a feedback edge carrying mutated prompts back to re-evaluation.\n\n`history.jsonl`\n\nis the trend that tells you whether you're actually getting better.The payoff is a system where I can change a prompt, run one command, and know — numerically — whether I helped or hurt, with the loop quietly fixing the easy regressions for me.", "url": "https://wpnews.pro/news/scoring-ai-agents-deterministic-metrics-an-llm-judge", "canonical_source": "https://dev.to/pponali/scoring-ai-agents-deterministic-metrics-an-llm-judge-poj", "published_at": "2026-06-18 05:57:49+00:00", "updated_at": "2026-06-18 06:22:01.320539+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "large-language-models", "artificial-intelligence"], "entities": ["Claude", "Anthropic", "GEPA"], "alternates": {"html": "https://wpnews.pro/news/scoring-ai-agents-deterministic-metrics-an-llm-judge", "markdown": "https://wpnews.pro/news/scoring-ai-agents-deterministic-metrics-an-llm-judge.md", "text": "https://wpnews.pro/news/scoring-ai-agents-deterministic-metrics-an-llm-judge.txt", "jsonld": "https://wpnews.pro/news/scoring-ai-agents-deterministic-metrics-an-llm-judge.jsonld"}}