AI Evals, Part 5: From a Number to a Gate Evals in CI and Production

TextStack, a .NET-based AI product, has implemented a two-part evaluation system: an offline gate that runs golden dataset evals as opt-in dotnet tests to catch regressions before shipping, and production monitoring that tags every AI call with feature metadata and records cost, latency, and errors. The system uses a custom IEvaluator on Microsoft.Extensions.AI.Evaluation to produce numeric metrics and a Pass/Fail interpretation against a quality floor, with plans to add automatic baseline-versus-regression gating. This approach turns AI quality into an automated engineering process, similar to unit testing, and creates a continuous-improvement flywheel where production failures become new golden test cases.

Part 5, the finale, of a series on building production AI on .NET. We've built the pieces — what evals are, error analysis, golden datasets, and a trustworthy judge. Now we make them earn their keep. By now you can produce a defensible quality score for an AI feature. But a score you only look at is a vanity metric. The entire point of all that work is to make quality something your engineering process acts on automatically — the same way a failing unit test stops a bad commit. That means two homes for your evals: a gate before you ship, and monitoring after. Because TextStack's judge is a custom IEvaluator on Microsoft.Extensions.AI.Evaluation, an eval is just a dotnet test . The MEAI evaluator emits the rubric's axes plus an overall as numeric metrics, and a quality floor is expressed as a Pass/Fail interpretation on the overall: // In the evaluator: the overall metric is interpreted Pass/Fail against a floor. if overallFloor is { } floor overall.Interpretation = new EvaluationMetricInterpretation RatingFor score.Mean , failed: score.Mean < floor, reason: $"floor {floor:0.0} mean {score.Mean:0.00} " ; That catches gross breakage — "something is badly wrong." But the more valuable gate is relative : store a baseline score per feature, and fail the build when a change drops quality by more than a threshold versus that baseline. That turns "did this prompt change help?" into a red/green answer and makes improving a prompt a tight loop — change, run, compare, keep or revert. It's the AI equivalent of TDD. Honest status from our codebase: the floor and on-demand runs exist today; the automatic baseline-versus-regression gate is the next step. I'm flagging that deliberately, because plenty of "we do eval-driven development" claims are really "we have a number nobody gates on." The hard 80% — the measuring instrument — is built; wiring the ratchet is the lighter remaining 20%. Every eval case is a real generation plus a real judge call. Running the full suite on every commit is slow and expensive, so evals have to be deliberate. TextStack's are opt-in : tagged so default CI skips them, and they self-skip when the provider isn't configured. OPENAI API KEY=… dotnet test tests/TextStack.AiEvals --filter Category=Eval Default CI stays green and free; the expensive truth runs on purpose. The pragmatic pattern: a small, cheap subset on pull requests for a fast signal, and the full suite nightly or pre-release. Treat eval spend like any cloud cost — budget it, don't let it run unbounded. A curated golden set, however good, is a snapshot of inputs you imagined . Production sends inputs you didn't. So the offline gate is only half the system; the other half runs against live traffic. This is where evals and observability become one thing. Every AI call in TextStack is tagged with its feature and recorded — cost, latency, tokens, errors — and runs persist to an eval runs table surfaced on an internal /ai-quality dashboard Traces and Evals tabs , with an admin "Run evals" button to trigger the suite on demand. Because the judge is the Put the two homes together and you get a loop that compounds. Production surfaces a new failure mode → you do error analysis on it → it becomes a new golden case → your gate now defends against it → quality climbs → cleaner output produces cleaner traffic. Each turn makes the next regression harder to ship. That continuous-improvement flywheel — not any single dashboard — is the real product of an eval system. That's the whole discipline, start to finish: None of it requires Python or a heavyweight platform. On .NET it's an ILlmService seam, a golden dataset in JSON, a custom IEvaluator on Microsoft.Extensions.AI.Evaluation, and an opt-in test category — built on a real product, in production. Done right, evals turn "I think this AI feature is fine" into "I can prove it, and I'll know the moment it stops being true." That's the difference between shipping AI and gambling with it. TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive observability, evals, RAG, agents as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.