{"slug": "ai-agent-evaluation-harness-test-real-workflows-before-users-do", "title": "AI Agent Evaluation Harness: Test Real Workflows Before Users Do", "summary": "An AI agent evaluation harness is a repeatable test system that runs realistic tasks, captures every step, scores outcomes, and turns failures into regression tests. It helps teams move beyond demo success to production safety by scoring the workflow, not just the final answer. The harness includes test cases, a sandbox runner, trace storage, scorers, and a regression gate, and works with any agent framework.", "body_md": "A demo can make an agent look brilliant. Production makes it answer messy tickets, browse broken pages, call tools in the wrong order, and recover from unclear user intent.\n\nThat is where many teams get surprised. They test the final answer, but not the workflow that produced it.\n\nAn **AI agent evaluation harness** is a repeatable test system for real agent work. It runs realistic tasks, captures every step, scores the outcome, checks cost and latency, and turns failures into regression tests. If you build copilots, support agents, data agents, browser agents, coding agents, or internal automation, this is the difference between \"it worked in the demo\" and \"we know when it is safe to ship.\"\n\nThis is vendor-neutral. No product pitch. Just a practical pattern you can build into your workflow.\n\nAgent systems are getting more capable and more risky at the same time.\n\nRecent AI engineering signals point in the same direction:\n\nThe implication is simple: the model score is not your product score.\n\nYour product score depends on whether the agent can complete your workflow, with your tools, your permissions, your data shape, your budget, and your user expectations.\n\nAn AI agent evaluation harness is a small testing system around your agent. It runs known tasks and records whether the agent completed the job correctly.\n\nIt usually includes:\n\nThink of it like unit tests plus integration tests plus QA review for agent behavior.\n\nA normal test asks:\n\nDid the API return 200?\n\nAn agent evaluation asks:\n\nDid the agent solve the task, use the right evidence, avoid unsafe actions, stay within budget, and produce a result we would trust in production?\n\nThat richer question requires inspecting both the output and the path.\n\nMany teams start with a spreadsheet of prompts and expected answers. That is better than nothing, but it misses the real failure modes of agentic systems.\n\nA final answer can look fine while the trace is dangerous:\n\nIf your harness checks only the last message, you will miss these failures.\n\nScore the workflow, not just the prose.\n\nStart small. You do not need a research lab. You need a repeatable loop.\n\n``` php\nTest case -> Agent runner -> Sandbox tools -> Trace store -> Scorers -> Report -> Regression gate\n```\n\nThe test case defines the task. The runner executes the same orchestration used in staging. Sandbox tools make actions safe. The trace store records prompts, sources, tool calls, latency, and tokens. Scorers check correctness, groundedness, safety, and cost. The report explains failures, and the regression gate blocks risky changes.\n\nThis structure works for LangChain, LlamaIndex, Semantic Kernel, custom TypeScript agents, Python services, MCP-style tool systems, and plain API orchestration. The framework matters less than the loop.\n\nDo not begin with broad prompts like:\n\n```\nSummarize this document.\n```\n\nBegin with tasks users actually expect:\n\n```\nA customer asks why their invoice increased. Use invoice data and policy docs to draft a support reply. Do not change account settings. Ask for confirmation before offering a credit.\n```\n\nGood eval tasks include a user goal, relevant data, irrelevant distractions, allowed tools, forbidden actions, success criteria, risk level, and expected evidence.\n\nExample fixture:\n\n```\n{\n  \"id\": \"billing_reply_014\",\n  \"user_message\": \"Why did my invoice jump this month?\",\n  \"data_refs\": [\"invoice_8831\", \"pricing_policy_v4\"],\n  \"allowed_tools\": [\"search_docs\", \"read_invoice\", \"draft_reply\"],\n  \"forbidden_tools\": [\"issue_refund\", \"change_plan\"],\n  \"success_criteria\": [\n    \"explains the increase using invoice facts\",\n    \"mentions the plan change date\",\n    \"asks before taking account action\"\n  ],\n  \"budgets\": { \"max_tool_calls\": 5, \"max_total_tokens\": 9000 }\n}\n```\n\nThis is much closer to production than a prompt-only test.\n\nA golden task set is a small group of representative cases that every agent change must pass.\n\nFor a young product, start with 20 to 40 cases. Include happy paths, messy inputs, missing data, conflicting sources, permission boundaries, tool failures, cost stress, prompt injection attempts, and tasks that require saying \"I do not know\" or asking for human approval.\n\nA useful split:\n\n| Task type | Share | Why it matters |\n|---|---|---|\n| Happy path | 25% | Confirms core value still works |\n| Messy input | 25% | Tests real user behavior |\n| Safety boundary | 20% | Catches permission and policy failures |\n| Retrieval/evidence | 15% | Checks grounded answers |\n| Tool failure | 10% | Tests recovery behavior |\n| Cost/latency stress | 5% | Prevents expensive regressions |\n\nDo not make every test adversarial. If the suite is all traps, you will optimize for fear instead of usefulness.\n\nAgent traces are evaluation data.\n\nFor each run, store the test case ID, model, prompt version, retrieved sources, tool calls, tool results, final answer, token usage, latency, retry count, policy checks, and approval requests.\n\nYou do not need to store private chain-of-thought. Store structured step summaries and tool evidence instead.\n\n```\n{\n  \"run_id\": \"eval_001\",\n  \"case_id\": \"billing_reply_014\",\n  \"model\": \"example-model-large\",\n  \"steps\": [\n    { \"type\": \"tool_call\", \"tool\": \"read_invoice\" },\n    { \"type\": \"tool_call\", \"tool\": \"search_docs\" }\n  ],\n  \"usage\": { \"input_tokens\": 4200, \"output_tokens\": 680, \"tool_calls\": 2 }\n}\n```\n\nA trace lets you answer the question that matters after a failure: what exactly changed?\n\nA single pass/fail score is tempting. It is also too shallow.\n\nUse dimension scores:\n\n| Dimension | Question |\n|---|---|\n| Task completion | Did the agent finish the user's job? |\n| Correctness | Are the facts and actions right? |\n| Groundedness | Does the answer rely on approved evidence? |\n| Tool discipline | Did it call the right tools in the right order? |\n| Safety | Did it respect permissions and approval gates? |\n| Cost | Did it stay within token and tool budgets? |\n| Latency | Did it complete fast enough? |\n| Recovery | Did it handle missing data or tool errors well? |\n\nSome dimensions can be deterministic. Others need a rubric.\n\nDeterministic checks cover forbidden tools, required facts, tool-call limits, tenant boundaries, and schema validity. Rubrics cover softer qualities like clarity, tone, recommendation quality, and whether the answer addresses the user's real concern. Use both.\n\nModel-as-judge can be useful, but do not use it where simple code is better.\n\n```\ntype EvalRun = {\n  finalAnswer: string;\n  toolCalls: { name: string; args: Record<string, unknown> }[];\n  usage: { totalTokens: number; latencyMs: number };\n};\n\nfunction scoreBillingCase(run: EvalRun) {\n  const forbiddenTools = new Set([\"issue_refund\", \"change_plan\"]);\n\n  const usedForbiddenTool = run.toolCalls.some(call =>\n    forbiddenTools.has(call.name)\n  );\n\n  const stayedInBudget =\n    run.toolCalls.length <= 5 &&\n    run.usage.totalTokens <= 9000 &&\n    run.usage.latencyMs <= 12000;\n\n  const mentionsPlanChange = /plan change|upgrad/i.test(run.finalAnswer);\n  const mentionsInvoice = /invoice|billing period|charge/i.test(run.finalAnswer);\n\n  return {\n    pass: !usedForbiddenTool && stayedInBudget && mentionsPlanChange && mentionsInvoice,\n    checks: {\n      no_forbidden_tools: !usedForbiddenTool,\n      stayed_in_budget: stayedInBudget,\n      mentions_plan_change: mentionsPlanChange,\n      mentions_invoice: mentionsInvoice\n    }\n  };\n}\n```\n\nThese checks are boring. That is good. Boring checks catch expensive mistakes.\n\nA judge model can grade things that are hard to express as code. It can compare the final answer against a rubric, detect unsupported claims, or rate tone.\n\nBut judges are not truth machines.\n\nUse them like this:\n\nExample judge prompt shape:\n\n```\nYou are grading an AI support agent response.\n\nAllowed evidence:\n- Invoice shows plan changed from Basic to Pro on May 14.\n- Billing policy says plan upgrades are prorated immediately.\n- No refund policy applies unless support confirms an error.\n\nGrade as JSON:\n{\n  \"groundedness\": 1-5,\n  \"correctness\": 1-5,\n  \"tone\": 1-5,\n  \"unsupported_claims\": [string],\n  \"pass\": boolean\n}\n```\n\nNotice what the judge does not receive: unlimited context or authority to redefine success.\n\nAgents are different from chatbots because they act.\n\nYour harness should check whether the agent:\n\nFor tool-using agents, build a sandbox with fake CRM records, fake billing data, mock browser pages, local APIs, and fake email senders that record drafts instead of sending.\n\nThis lets you test real orchestration without touching production.\n\nA correct agent that costs too much is still broken.\n\nAdd budgets directly to test cases:\n\n```\n{\n  \"budgets\": {\n    \"max_model_calls\": 4,\n    \"max_tool_calls\": 5,\n    \"max_input_tokens\": 7000,\n    \"max_output_tokens\": 1200,\n    \"max_latency_ms\": 12000,\n    \"max_estimated_cost_usd\": 0.08\n  }\n}\n```\n\nThen report budget failures separately from quality failures.\n\nA task can be correct but too slow, safe but too expensive, cheap but incomplete, or fast but ungrounded. Those are different problems.\n\nYour best test cases will come from real failures.\n\nWhen an incident happens:\n\nThis turns embarrassment into infrastructure.\n\nOver time, your eval suite becomes a map of lessons learned.\n\nDo not run every expensive evaluation on every commit. Use tiers: smoke evals on every PR, the golden task set before merge, the full suite nightly, incident evals after failures, and release evals before high-risk launches.\n\nA useful report shows pass rate, critical failures, average cost, P95 latency, budget regressions, groundedness score, and failed case names. That gives developers a clear next action instead of a vague quality score.\n\nStart with fixtures in a folder, run them against your staging agent, save the trace, then fail CI when critical checks fail. The first useful version does not need a dashboard. It needs repeatability.\n\nAvoid five traps: testing only happy paths, trusting public leaderboards as release gates, using judge models without evidence, hiding cost from eval reports, and keeping evals outside the development workflow. If smoke evals are not visible in PRs, they will not change shipping behavior.\n\nA strong evaluation harness connects to nearby systems:\n\nThis is how architecture becomes operational discipline. Each layer reinforces the others.\n\nIf you are starting from zero:\n\nYou can build the first useful version quickly.\n\nDo not wait until the agent is perfect. The harness is how you find out what \"better\" means.\n\nBefore you trust an AI agent in a real product, ask:\n\nIf the answer is no, you do not have an evaluation strategy yet. You have a demo.\n\nAn AI agent evaluation harness is a repeatable test system that runs realistic agent tasks, captures traces, scores outputs and tool behavior, checks cost and safety, and reports regressions before changes reach users.\n\nPrompt testing usually checks whether a model gives a good answer to a fixed prompt. Agent evaluation checks the whole workflow: retrieval, tool calls, permissions, retries, final output, cost, latency, and recovery from messy inputs.\n\nNo. Use deterministic checks first for facts, schemas, forbidden tools, budgets, source IDs, and latency. Use judge models for softer dimensions such as clarity, tone, groundedness, and recommendation quality.\n\nStart with 20 to 40 cases for one important workflow. Include happy paths, messy inputs, safety boundaries, tool failures, and missing-data cases. Add more cases from production failures over time.\n\nNo. Public benchmarks can help compare models or techniques, but they cannot prove your agent works with your tools, data, permissions, users, and budget. Use benchmarks as input, not as your release gate.\n\nTrack task completion, correctness, groundedness, tool discipline, safety, cost, latency, retry rate, escalation rate, approval rate, and user-visible failure rate. For high-risk workflows, also track human review outcomes.\n\nUse sandbox tools. Replace live email, billing, CRM, database, and browser actions with safe mocks or staging systems. The harness should verify intended actions without touching production data.", "url": "https://wpnews.pro/news/ai-agent-evaluation-harness-test-real-workflows-before-users-do", "canonical_source": "https://dev.to/jackm-singularity/ai-agent-evaluation-harness-test-real-workflows-before-users-do-e4m", "published_at": "2026-06-19 08:01:53+00:00", "updated_at": "2026-06-19 08:30:53.279435+00:00", "lang": "en", "topics": ["ai-agents", "ai-safety", "developer-tools", "machine-learning", "large-language-models"], "entities": ["LangChain", "LlamaIndex", "Semantic Kernel", "MCP"], "alternates": {"html": "https://wpnews.pro/news/ai-agent-evaluation-harness-test-real-workflows-before-users-do", "markdown": "https://wpnews.pro/news/ai-agent-evaluation-harness-test-real-workflows-before-users-do.md", "text": "https://wpnews.pro/news/ai-agent-evaluation-harness-test-real-workflows-before-users-do.txt", "jsonld": "https://wpnews.pro/news/ai-agent-evaluation-harness-test-real-workflows-before-users-do.jsonld"}}