Part 5, the finale, of a series on building production AI on .NET. We've built the pieces β what evals are, error analysis, golden datasets, and a trustworthy judge. Now we make them earn their keep.
By now you can produce a defensible quality score for an AI feature. But a score you only look at is a vanity metric. The entire point of all that work is to make quality something your engineering process acts on automatically β the same way a failing unit test stops a bad commit. That means two homes for your evals: a gate before you ship, and monitoring after.
Because TextStack's judge is a custom IEvaluator
on Microsoft.Extensions.AI.Evaluation, an eval is just a dotnet test
. The MEAI evaluator emits the rubric's axes plus an overall as numeric metrics, and a quality floor is expressed as a Pass/Fail interpretation on the overall:
// In the evaluator: the overall metric is interpreted Pass/Fail against a floor.
if (overallFloor is { } floor)
overall.Interpretation = new EvaluationMetricInterpretation(
RatingFor(score.Mean),
failed: score.Mean < floor,
reason: $"floor {floor:0.0} (mean {score.Mean:0.00})");
That catches gross breakage β "something is badly wrong." But the more valuable gate is relative: store a baseline score per feature, and fail the build when a change drops quality by more than a threshold versus that baseline. That turns "did this prompt change help?" into a red/green answer and makes improving a prompt a tight loop β change, run, compare, keep or revert. It's the AI equivalent of TDD.
Honest status from our codebase: the floor and on-demand runs exist today; the automatic baseline-versus-regression gate is the next step. I'm flagging that deliberately, because plenty of "we do eval-driven development" claims are really "we have a number nobody gates on." The hard 80% β the measuring instrument β is built; wiring the ratchet is the lighter remaining 20%.
Every eval case is a real generation plus a real judge call. Running the full suite on every commit is slow and expensive, so evals have to be deliberate. TextStack's are opt-in: tagged so default CI skips them, and they self-skip when the provider isn't configured.
OPENAI_API_KEY=β¦ dotnet test tests/TextStack.AiEvals --filter Category=Eval
Default CI stays green and free; the expensive truth runs on purpose. The pragmatic pattern: a small, cheap subset on pull requests for a fast signal, and the full suite nightly or pre-release. Treat eval spend like any cloud cost β budget it, don't let it run unbounded.
A curated golden set, however good, is a snapshot of inputs you imagined. Production sends inputs you didn't. So the offline gate is only half the system; the other half runs against live traffic.
This is where evals and observability become one thing. Every AI call in TextStack is tagged with its feature and recorded β cost, latency, tokens, errors β and runs persist to an eval_runs
table surfaced on an internal ** /ai-quality** dashboard (Traces and Evals tabs), with an admin "Run evals" button to trigger the suite on demand. Because the judge is the
Put the two homes together and you get a loop that compounds. Production surfaces a new failure mode β you do error analysis on it β it becomes a new golden case β your gate now defends against it β quality climbs β cleaner output produces cleaner traffic. Each turn makes the next regression harder to ship. That continuous-improvement flywheel β not any single dashboard β is the real product of an eval system.
That's the whole discipline, start to finish:
None of it requires Python or a heavyweight platform. On .NET it's an ILlmService
seam, a golden dataset in JSON, a custom IEvaluator
on Microsoft.Extensions.AI.Evaluation, and an opt-in test category β built on a real product, in production. Done right, evals turn "I think this AI feature is fine" into "I can prove it, and I'll know the moment it stops being true." That's the difference between shipping AI and gambling with it.
TextStack is a reader that helps you finish the dense technical book you keep quitting β it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.