Stop Asserting Equality: How to Test Agents When Every Run Is Different

A developer has proposed replacing equality-based assertions with invariant checks for testing AI agents, arguing that non-deterministic outputs make traditional tests flaky or fake. The approach uses structural and factual invariants—such as verifying a customer ID exists or output length is shorter than input—that remain true regardless of wording, alongside semantic similarity thresholds calibrated from known-good outputs. This method eliminates the need for fragile normalization functions and regex chains that often lead teams to delete tests entirely.

Here is the test that quietly destroys most agent codebases: expect await agent.run "summarize this ticket" .toBe EXPECTED SUMMARY ; It passes on Tuesday. It fails on Wednesday because the model reworded one sentence. So someone adds a .trim , then a lowercase, then a regex, and three weeks later the assertion is a 40-line normalization function that still flakes twice a week. Eventually the team does the only rational thing left: they delete the test. Now the agent has no tests at all, and everyone agrees "agents are just hard to test." Agents are not hard to test. Asserting equality on non-deterministic output is hard, because it's impossible. Those are different problems, and conflating them is why so much agent CI is either flaky or fake. The same prompt, same model, same temperature can produce different tokens on different runs. Provider-side routing, batching, and floating-point nondeterminism mean even temperature: 0 is not a guarantee. If your test strategy assumes a fixed output, you are testing the wrong layer of reality. Here's what to do instead. The core move: stop asking "did the agent produce exactly this?" and start asking "is every property that must be true, true?" An invariant is something that holds across all valid outputs, regardless of wording. For a ticket summarizer, the exact prose is irrelevant. What must be true is structural and factual: None of those care about phrasing. All of them are deterministic, free, and unfakeable. type Invariant = { name: string; check: output: string, input: TicketInput = boolean; severity: "block" | "warn"; }; const summarizerInvariants: Invariant = { name: "references real customer id", check: out, input = out.includes input.customerId , severity: "block", }, { name: "no fabricated ids", check: out, input = { const idsInOutput = ...out.matchAll /CUST-\d{5}/g .map m = m 0 ; return idsInOutput.every id = input.knownIds.includes id ; }, severity: "block", }, { name: "actually summarizes shorter than source ", check: out, input = out.length < input.body.length, severity: "block", }, { name: "has recommendation", check: out = /recommend|next step|action/i.test out , severity: "warn", }, ; function assertInvariants output: string, input: TicketInput { const failures = summarizerInvariants.filter inv = inv.check output, input ; const blockers = failures.filter f = f.severity === "block" ; if blockers.length 0 { throw new Error Invariant violations: ${blockers.map b = b.name .join ", " } , ; } return failures; // warnings bubble up as metadata, not failures } This test will pass for any correctly-behaving output and fail for any broken one — which is the entire point of a test. The reworded sentence on Wednesday no longer breaks CI. A fabricated customer ID does. Sometimes an invariant isn't enough — you genuinely need "is this answer close to the right answer?" The mistake is reaching for string equality. Reach for semantic similarity instead, with an explicit threshold. js import { cosineSimilarity, embed } from "./embeddings"; async function assertSemanticMatch output: string, reference: string, threshold = 0.82, { const a, b = await Promise.all embed output , embed reference ; const score = cosineSimilarity a, b ; if score < threshold { throw new Error Semantic drift: similarity ${score.toFixed 3 } < ${threshold} , ; } return score; } The threshold is a real engineering decision, not a magic number. Calibrate it: run 50 known-good outputs against the reference, look at the distribution of scores, and set the threshold a couple of standard deviations below the mean. Now you have a test that tolerates rewording but catches an answer that has wandered off topic. Here is the part most teams skip. Because output is a distribution, a single test run samples that distribution exactly once. If your agent is correct 90% of the time, a one-shot test is a coin that lands green 9 times out of 10 — and you will spend the tenth morning debugging a "flaky test" that is actually telling you the truth. The fix is to treat agent tests like the statistical experiments they are: run the case N times and assert on the pass rate , not on a single result. js async function assertReliability scenario: = Promise<boolean , { runs = 10, minPassRate = 0.9 }: { runs?: number; minPassRate?: number }, { const results = await Promise.all Array.from { length: runs }, = scenario , ; const passed = results.filter Boolean .length; const rate = passed / runs; if rate < minPassRate { throw new Error Reliability ${ rate 100 .toFixed 0 }% < required ${minPassRate 100}% ${passed}/${runs} , ; } return rate; } This reframes the question from "did it work?" to "how often does it work, and is that often enough?" That is the only question that matters in production, where the agent runs ten thousand times a day. It also turns flakiness from a nuisance into a signal: a pass rate that drifts from 95% to 78% across builds is a regression, even though no single run "failed." The cost objection is real — N model calls per scenario adds up. Two mitigations: only multi-sample the scenarios that gate a deploy, and run the expensive suite nightly rather than on every commit. Determinism-friendly checks invariants run on every push; statistical checks run on a schedule. Putting it together, a sane agent test pyramid looks like this, cheapest first: The ordering matters because most regressions are caught at layer 1 for free. You never want to pay for an LLM judge call to discover the output was empty, or that the agent fabricated an ID — a string check already knew that. Deterministic software has a true/false oracle: the output is either right or it isn't. Agents don't have that oracle, and pretending they do produces tests that are simultaneously flaky and uninformative. The shift is to stop testing for equality and start testing for correctness properties — invariants that must always hold, distributions that must stay within bounds, and meaning that must stay close to intent. Those are all measurable. They just aren't toBe . Once you internalize that, the flaky-test death spiral disappears, because you're no longer fighting nondeterminism — you're describing it. If you want a harness that already encodes this layering, agent-eval https://github.com/sauravbhattacharya001/agent-eval gives you tiered invariant, statistical, and judge assertions so a prompt change either holds your properties or it doesn't — and a sliding pass rate across builds shows up as a regression instead of a "flaky test." Test the distribution, not the string — your CI will finally start telling you the truth.