{"slug": "stop-asserting-equality-how-to-test-agents-when-every-run-is-different", "title": "Stop Asserting Equality: How to Test Agents When Every Run Is Different", "summary": "A developer has proposed replacing equality-based assertions with invariant checks for testing AI agents, arguing that non-deterministic outputs make traditional tests flaky or fake. The approach uses structural and factual invariants—such as verifying a customer ID exists or output length is shorter than input—that remain true regardless of wording, alongside semantic similarity thresholds calibrated from known-good outputs. This method eliminates the need for fragile normalization functions and regex chains that often lead teams to delete tests entirely.", "body_md": "Here is the test that quietly destroys most agent codebases:\n\n```\nexpect(await agent.run(\"summarize this ticket\")).toBe(EXPECTED_SUMMARY);\n```\n\nIt passes on Tuesday. It fails on Wednesday because the model reworded one sentence. So someone adds a `.trim()`\n\n, then a lowercase, then a regex, and three weeks later the assertion is a 40-line normalization function that still flakes twice a week. Eventually the team does the only rational thing left: they delete the test. Now the agent has no tests at all, and everyone agrees \"agents are just hard to test.\"\n\nAgents are not hard to test. **Asserting equality on non-deterministic output is hard, because it's impossible.** Those are different problems, and conflating them is why so much agent CI is either flaky or fake.\n\nThe same prompt, same model, same temperature can produce different tokens on different runs. Provider-side routing, batching, and floating-point nondeterminism mean even `temperature: 0`\n\nis not a guarantee. If your test strategy assumes a fixed output, you are testing the wrong layer of reality. Here's what to do instead.\n\nThe core move: stop asking \"did the agent produce exactly this?\" and start asking \"is every property that must be true, true?\"\n\nAn invariant is something that holds across *all* valid outputs, regardless of wording. For a ticket summarizer, the exact prose is irrelevant. What must be true is structural and factual:\n\nNone of those care about phrasing. All of them are deterministic, free, and unfakeable.\n\n```\ntype Invariant = {\n  name: string;\n  check: (output: string, input: TicketInput) => boolean;\n  severity: \"block\" | \"warn\";\n};\n\nconst summarizerInvariants: Invariant[] = [\n  {\n    name: \"references real customer id\",\n    check: (out, input) => out.includes(input.customerId),\n    severity: \"block\",\n  },\n  {\n    name: \"no fabricated ids\",\n    check: (out, input) => {\n      const idsInOutput = [...out.matchAll(/CUST-\\d{5}/g)].map((m) => m[0]);\n      return idsInOutput.every((id) => input.knownIds.includes(id));\n    },\n    severity: \"block\",\n  },\n  {\n    name: \"actually summarizes (shorter than source)\",\n    check: (out, input) => out.length < input.body.length,\n    severity: \"block\",\n  },\n  {\n    name: \"has recommendation\",\n    check: (out) => /recommend|next step|action/i.test(out),\n    severity: \"warn\",\n  },\n];\n\nfunction assertInvariants(output: string, input: TicketInput) {\n  const failures = summarizerInvariants.filter((inv) => !inv.check(output, input));\n  const blockers = failures.filter((f) => f.severity === \"block\");\n  if (blockers.length > 0) {\n    throw new Error(\n      `Invariant violations: ${blockers.map((b) => b.name).join(\", \")}`,\n    );\n  }\n  return failures; // warnings bubble up as metadata, not failures\n}\n```\n\nThis test will pass for *any* correctly-behaving output and fail for *any* broken one — which is the entire point of a test. The reworded sentence on Wednesday no longer breaks CI. A fabricated customer ID does.\n\nSometimes an invariant isn't enough — you genuinely need \"is this answer close to the right answer?\" The mistake is reaching for string equality. Reach for semantic similarity instead, with an explicit threshold.\n\n``` js\nimport { cosineSimilarity, embed } from \"./embeddings\";\n\nasync function assertSemanticMatch(\n  output: string,\n  reference: string,\n  threshold = 0.82,\n) {\n  const [a, b] = await Promise.all([embed(output), embed(reference)]);\n  const score = cosineSimilarity(a, b);\n  if (score < threshold) {\n    throw new Error(\n      `Semantic drift: similarity ${score.toFixed(3)} < ${threshold}`,\n    );\n  }\n  return score;\n}\n```\n\nThe threshold is a real engineering decision, not a magic number. Calibrate it: run 50 known-good outputs against the reference, look at the distribution of scores, and set the threshold a couple of standard deviations below the mean. Now you have a test that tolerates rewording but catches an answer that has wandered off topic.\n\nHere is the part most teams skip. Because output is a distribution, a single test run samples that distribution exactly once. If your agent is correct 90% of the time, a one-shot test is a coin that lands green 9 times out of 10 — and you will spend the tenth morning debugging a \"flaky test\" that is actually telling you the truth.\n\nThe fix is to treat agent tests like the statistical experiments they are: run the case N times and assert on the *pass rate*, not on a single result.\n\n``` js\nasync function assertReliability(\n  scenario: () => Promise<boolean>,\n  { runs = 10, minPassRate = 0.9 }: { runs?: number; minPassRate?: number },\n) {\n  const results = await Promise.all(\n    Array.from({ length: runs }, () => scenario()),\n  );\n  const passed = results.filter(Boolean).length;\n  const rate = passed / runs;\n  if (rate < minPassRate) {\n    throw new Error(\n      `Reliability ${(rate * 100).toFixed(0)}% < required ${minPassRate * 100}% (${passed}/${runs})`,\n    );\n  }\n  return rate;\n}\n```\n\nThis reframes the question from \"did it work?\" to \"how often does it work, and is that often enough?\" That is the only question that matters in production, where the agent runs ten thousand times a day. It also turns flakiness from a nuisance into a signal: a pass rate that drifts from 95% to 78% across builds is a regression, even though no single run \"failed.\"\n\nThe cost objection is real — N model calls per scenario adds up. Two mitigations: only multi-sample the scenarios that gate a deploy, and run the expensive suite nightly rather than on every commit. Determinism-friendly checks (invariants) run on every push; statistical checks run on a schedule.\n\nPutting it together, a sane agent test pyramid looks like this, cheapest first:\n\nThe ordering matters because most regressions are caught at layer 1 for free. You never want to pay for an LLM judge call to discover the output was empty, or that the agent fabricated an ID — a string check already knew that.\n\nDeterministic software has a true/false oracle: the output is either right or it isn't. Agents don't have that oracle, and pretending they do produces tests that are simultaneously flaky and uninformative.\n\nThe shift is to stop testing for *equality* and start testing for *correctness properties* — invariants that must always hold, distributions that must stay within bounds, and meaning that must stay close to intent. Those are all measurable. They just aren't `toBe`\n\n.\n\nOnce you internalize that, the flaky-test death spiral disappears, because you're no longer fighting nondeterminism — you're describing it.\n\nIf you want a harness that already encodes this layering, [agent-eval](https://github.com/sauravbhattacharya001/agent-eval) gives you tiered invariant, statistical, and judge assertions so a prompt change either holds your properties or it doesn't — and a sliding pass rate across builds shows up as a regression instead of a \"flaky test.\" Test the distribution, not the string — your CI will finally start telling you the truth.", "url": "https://wpnews.pro/news/stop-asserting-equality-how-to-test-agents-when-every-run-is-different", "canonical_source": "https://dev.to/saurav_bhattacharya/stop-asserting-equality-how-to-test-agents-when-every-run-is-different-3024", "published_at": "2026-06-12 01:02:09+00:00", "updated_at": "2026-06-12 01:42:45.587003+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "mlops", "generative-ai"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/stop-asserting-equality-how-to-test-agents-when-every-run-is-different", "markdown": "https://wpnews.pro/news/stop-asserting-equality-how-to-test-agents-when-every-run-is-different.md", "text": "https://wpnews.pro/news/stop-asserting-equality-how-to-test-agents-when-every-run-is-different.txt", "jsonld": "https://wpnews.pro/news/stop-asserting-equality-how-to-test-agents-when-every-run-is-different.jsonld"}}