# AI Evals, Part 5: From a Number to a Gate Evals in CI and Production

> Source: <https://dev.to/mrviduus/ai-evals-part-5-from-a-number-to-a-gate-evals-in-ci-and-production-1j33>
> Published: 2026-06-17 17:43:25+00:00

*Part 5, the finale, of a series on building production AI on .NET. We've built the pieces — what evals are, error analysis, golden datasets, and a trustworthy judge. Now we make them earn their keep.*

By now you can produce a defensible quality score for an AI feature. But a score you only *look at* is a vanity metric. The entire point of all that work is to make quality something your engineering process **acts on automatically** — the same way a failing unit test stops a bad commit. That means two homes for your evals: a **gate** before you ship, and **monitoring** after.

Because TextStack's judge is a custom `IEvaluator`

on Microsoft.Extensions.AI.Evaluation, an eval is just a `dotnet test`

. The MEAI evaluator emits the rubric's axes plus an overall as numeric metrics, and a quality *floor* is expressed as a Pass/Fail interpretation on the overall:

```
// In the evaluator: the overall metric is interpreted Pass/Fail against a floor.
if (overallFloor is { } floor)
    overall.Interpretation = new EvaluationMetricInterpretation(
        RatingFor(score.Mean),
        failed: score.Mean < floor,
        reason: $"floor {floor:0.0} (mean {score.Mean:0.00})");
```

That catches *gross* breakage — "something is badly wrong." But the more valuable gate is **relative**: store a baseline score per feature, and fail the build when a change drops quality by more than a threshold versus that baseline. That turns "did this prompt change help?" into a red/green answer and makes improving a prompt a tight loop — change, run, compare, keep or revert. It's the AI equivalent of TDD.

Honest status from our codebase: the floor and on-demand runs exist today; the automatic *baseline-versus-regression* gate is the next step. I'm flagging that deliberately, because plenty of "we do eval-driven development" claims are really "we have a number nobody gates on." The hard 80% — the measuring instrument — is built; wiring the ratchet is the lighter remaining 20%.

Every eval case is a real generation **plus** a real judge call. Running the full suite on every commit is slow and expensive, so evals have to be deliberate. TextStack's are **opt-in**: tagged so default CI skips them, and they self-skip when the provider isn't configured.

```
OPENAI_API_KEY=… dotnet test tests/TextStack.AiEvals --filter Category=Eval
```

Default CI stays green and free; the expensive truth runs on purpose. The pragmatic pattern: a small, cheap subset on pull requests for a fast signal, and the full suite nightly or pre-release. Treat eval spend like any cloud cost — budget it, don't let it run unbounded.

A curated golden set, however good, is a snapshot of inputs you *imagined*. Production sends inputs you didn't. So the offline gate is only half the system; the other half runs against live traffic.

This is where evals and observability become one thing. Every AI call in TextStack is tagged with its feature and recorded — cost, latency, tokens, errors — and runs persist to an `eval_runs`

table surfaced on an internal ** /ai-quality** dashboard (Traces and Evals tabs), with an admin "Run evals" button to trigger the suite on demand. Because the judge is the

Put the two homes together and you get a loop that compounds. Production surfaces a new failure mode → you do error analysis on it → it becomes a new golden case → your gate now defends against it → quality climbs → cleaner output produces cleaner traffic. Each turn makes the next regression harder to ship. That continuous-improvement flywheel — not any single dashboard — is the real product of an eval system.

That's the whole discipline, start to finish:

None of it requires Python or a heavyweight platform. On .NET it's an `ILlmService`

seam, a golden dataset in JSON, a custom `IEvaluator`

on Microsoft.Extensions.AI.Evaluation, and an opt-in test category — built on a real product, in production. Done right, evals turn *"I think this AI feature is fine"* into *"I can prove it, and I'll know the moment it stops being true."* That's the difference between shipping AI and gambling with it.

*TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.*
