Stop Asserting Equality: How to Test Agents When Every Run Is Different A developer has proposed replacing equality-based assertions with invariant checks for testing AI agents, arguing that non-deterministic outputs make traditional tests flaky or fake. The approach uses structural and factual invariants—such as verifying a customer ID exists or output length is shorter than input—that remain true regardless of wording, alongside semantic similarity thresholds calibrated from known-good outputs. This method eliminates the need for fragile normalization functions and regex chains that often lead teams to delete tests entirely. Here is the test that quietly destroys most agent codebases: expect await agent.run "summarize this ticket" .toBe EXPECTED SUMMARY ; It passes on Tuesday. It fails on Wednesday because the model reworded one sentence. So someone adds a .trim , then a lowercase, then a regex, and three weeks later the assertion is a 40-line normalization function that still flakes twice a week. Eventually the team does the only rational thing left: they delete the test. Now the agent has no tests at all, and everyone agrees "agents are just hard to test." Agents are not hard to test. Asserting equality on non-deterministic output is hard, because it's impossible. Those are different problems, and conflating them is why so much agent CI is either flaky or fake. The same prompt, same model, same temperature can produce different tokens on different runs. Provider-side routing, batching, and floating-point nondeterminism mean even temperature: 0 is not a guarantee. If your test strategy assumes a fixed output, you are testing the wrong layer of reality. Here's what to do instead. The core move: stop asking "did the agent produce exactly this?" and start asking "is every property that must be true, true?" An invariant is something that holds across all valid outputs, regardless of wording. For a ticket summarizer, the exact prose is irrelevant. What must be true is structural and factual: None of those care about phrasing. All of them are deterministic, free, and unfakeable. type Invariant = { name: string; check: output: string, input: TicketInput = boolean; severity: "block" | "warn"; }; const summarizerInvariants: Invariant = { name: "references real customer id", check: out, input = out.includes input.customerId , severity: "block", }, { name: "no fabricated ids", check: out, input = { const idsInOutput = ...out.matchAll /CUST-\d{5}/g .map m = m 0 ; return idsInOutput.every id = input.knownIds.includes id ; }, severity: "block", }, { name: "actually summarizes shorter than source ", check: out, input = out.length < input.body.length, severity: "block", }, { name: "has recommendation", check: out = /recommend|next step|action/i.test out , severity: "warn", }, ; function assertInvariants output: string, input: TicketInput { const failures = summarizerInvariants.filter inv = inv.check output, input ; const blockers = failures.filter f = f.severity === "block" ; if blockers.length 0 { throw new Error Invariant violations: ${blockers.map b = b.name .join ", " } , ; } return failures; // warnings bubble up as metadata, not failures } This test will pass for any correctly-behaving output and fail for any broken one — which is the entire point of a test. The reworded sentence on Wednesday no longer breaks CI. A fabricated customer ID does. Sometimes an invariant isn't enough — you genuinely need "is this answer close to the right answer?" The mistake is reaching for string equality. Reach for semantic similarity instead, with an explicit threshold. js import { cosineSimilarity, embed } from "./embeddings"; async function assertSemanticMatch output: string, reference: string, threshold = 0.82, { const a, b = await Promise.all embed output , embed reference ; const score = cosineSimilarity a, b ; if score < threshold { throw new Error Semantic drift: similarity ${score.toFixed 3 } < ${threshold} , ; } return score; } The threshold is a real engineering decision, not a magic number. Calibrate it: run 50 known-good outputs against the reference, look at the distribution of scores, and set the threshold a couple of standard deviations below the mean. Now you have a test that tolerates rewording but catches an answer that has wandered off topic. Here is the part most teams skip. Because output is a distribution, a single test run samples that distribution exactly once. If your agent is correct 90% of the time, a one-shot test is a coin that lands green 9 times out of 10 — and you will spend the tenth morning debugging a "flaky test" that is actually telling you the truth. The fix is to treat agent tests like the statistical experiments they are: run the case N times and assert on the pass rate , not on a single result. js async function assertReliability scenario: = Promise