AI Evals, Part 5: From a Number to a Gate Evals in CI and Production

wpnews.pro

cd /news/ai-products/ai-evals-part-5-from-a-number-to-a-g… · home › topics › ai-products › article

[ARTICLE · art-31514] src=dev.to ↗ pub=2026-06-17T17:43Z topic=ai-products verified=true sentiment=↑ positive

AI Evals, Part 5: From a Number to a Gate Evals in CI and Production

TextStack, a .NET-based AI product, has implemented a two-part evaluation system: an offline gate that runs golden dataset evals as opt-in dotnet tests to catch regressions before shipping, and production monitoring that tags every AI call with feature metadata and records cost, latency, and errors. The system uses a custom IEvaluator on Microsoft.Extensions.AI.Evaluation to produce numeric metrics and a Pass/Fail interpretation against a quality floor, with plans to add automatic baseline-versus-regression gating. This approach turns AI quality into an automated engineering process, similar to unit testing, and creates a continuous-improvement flywheel where production failures become new golden test cases.

read4 min views32 publishedJun 17, 2026

Part 5, the finale, of a series on building production AI on .NET. We've built the pieces — what evals are, error analysis, golden datasets, and a trustworthy judge. Now we make them earn their keep.

By now you can produce a defensible quality score for an AI feature. But a score you only look at is a vanity metric. The entire point of all that work is to make quality something your engineering process acts on automatically — the same way a failing unit test stops a bad commit. That means two homes for your evals: a gate before you ship, and monitoring after.

Because TextStack's judge is a custom IEvaluator

on Microsoft.Extensions.AI.Evaluation, an eval is just a dotnet test

. The MEAI evaluator emits the rubric's axes plus an overall as numeric metrics, and a quality floor is expressed as a Pass/Fail interpretation on the overall:

// In the evaluator: the overall metric is interpreted Pass/Fail against a floor.
if (overallFloor is { } floor)
    overall.Interpretation = new EvaluationMetricInterpretation(
        RatingFor(score.Mean),
        failed: score.Mean < floor,
        reason: $"floor {floor:0.0} (mean {score.Mean:0.00})");

That catches gross breakage — "something is badly wrong." But the more valuable gate is relative: store a baseline score per feature, and fail the build when a change drops quality by more than a threshold versus that baseline. That turns "did this prompt change help?" into a red/green answer and makes improving a prompt a tight loop — change, run, compare, keep or revert. It's the AI equivalent of TDD.

Honest status from our codebase: the floor and on-demand runs exist today; the automatic baseline-versus-regression gate is the next step. I'm flagging that deliberately, because plenty of "we do eval-driven development" claims are really "we have a number nobody gates on." The hard 80% — the measuring instrument — is built; wiring the ratchet is the lighter remaining 20%.

Every eval case is a real generation plus a real judge call. Running the full suite on every commit is slow and expensive, so evals have to be deliberate. TextStack's are opt-in: tagged so default CI skips them, and they self-skip when the provider isn't configured.

OPENAI_API_KEY=… dotnet test tests/TextStack.AiEvals --filter Category=Eval

Default CI stays green and free; the expensive truth runs on purpose. The pragmatic pattern: a small, cheap subset on pull requests for a fast signal, and the full suite nightly or pre-release. Treat eval spend like any cloud cost — budget it, don't let it run unbounded.

A curated golden set, however good, is a snapshot of inputs you imagined. Production sends inputs you didn't. So the offline gate is only half the system; the other half runs against live traffic.

This is where evals and observability become one thing. Every AI call in TextStack is tagged with its feature and recorded — cost, latency, tokens, errors — and runs persist to an eval_runs

table surfaced on an internal ** /ai-quality** dashboard (Traces and Evals tabs), with an admin "Run evals" button to trigger the suite on demand. Because the judge is the

Put the two homes together and you get a loop that compounds. Production surfaces a new failure mode → you do error analysis on it → it becomes a new golden case → your gate now defends against it → quality climbs → cleaner output produces cleaner traffic. Each turn makes the next regression harder to ship. That continuous-improvement flywheel — not any single dashboard — is the real product of an eval system.

That's the whole discipline, start to finish:

None of it requires Python or a heavyweight platform. On .NET it's an ILlmService

seam, a golden dataset in JSON, a custom IEvaluator

on Microsoft.Extensions.AI.Evaluation, and an opt-in test category — built on a real product, in production. Done right, evals turn "I think this AI feature is fine" into "I can prove it, and I'll know the moment it stops being true." That's the difference between shipping AI and gambling with it.

TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.

source & further reading

dev.to — original article Payment Rail vs. Settlement Layer: What AEON's Coinbase x402 Partnership Actually Validates Evidence Gates for AI Coding Agents in CI — Recoverable Merge over Mean Time to Green How I Built a Durable Cloud Cell AI Agent: $0 Idle Costs

~/api · this article 200

$curl api.wpnews.pro/v1/news/ai-evals-part-5-from-a-n…

Read original on dev.to → dev.to/mrviduus/ai-evals-part-5-from-a-number-to…

mentioned entities

TextStack

Microsoft.Extensions.AI.Evaluation

OpenAI

.NET

IEvaluator

ILlmService

metadata

slugai-evals-part-5-from-a-number-to-a-gate-evals-in-ci-and-production

topic#ai-products

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevWorld model maker Odyssey nabs $…

next →NCCL: The Hidden Engine Behind M…

── more in #ai-products 4 stories · sorted by recency

dev.to · 2 Aug · #ai-products

Claude Code in CI: Running Agentic Code Review, Test Generation, and Auto-Fix on Every Pull Request

agent-browser.dev · 2 Aug · #ai-products

Agent-Browser – Browser Automation for AI

officechai.com · 2 Aug · #ai-products

DeepSeek V4 Flash Outperforms Fable 5 On Terminal Bench While Being 99% Cheaper

byteiota.com · 2 Aug · #ai-products

EU AI Act Enforcement Is Live: What Developers Must Do

── more on @textstack 3 stories trending now

wpnews · 1 Aug · #ai-products

OpenAI Atlas Shuts Down August 9: Migration Guide

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

wpnews · 1 Aug · #developer-tools

I Built a Portable AI Skill That Safely Upgrades .NET Applications

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required