{"slug": "ai-evals-part-5-from-a-number-to-a-gate-evals-in-ci-and-production", "title": "AI Evals, Part 5: From a Number to a Gate Evals in CI and Production", "summary": "TextStack, a .NET-based AI product, has implemented a two-part evaluation system: an offline gate that runs golden dataset evals as opt-in dotnet tests to catch regressions before shipping, and production monitoring that tags every AI call with feature metadata and records cost, latency, and errors. The system uses a custom IEvaluator on Microsoft.Extensions.AI.Evaluation to produce numeric metrics and a Pass/Fail interpretation against a quality floor, with plans to add automatic baseline-versus-regression gating. This approach turns AI quality into an automated engineering process, similar to unit testing, and creates a continuous-improvement flywheel where production failures become new golden test cases.", "body_md": "*Part 5, the finale, of a series on building production AI on .NET. We've built the pieces — what evals are, error analysis, golden datasets, and a trustworthy judge. Now we make them earn their keep.*\n\nBy now you can produce a defensible quality score for an AI feature. But a score you only *look at* is a vanity metric. The entire point of all that work is to make quality something your engineering process **acts on automatically** — the same way a failing unit test stops a bad commit. That means two homes for your evals: a **gate** before you ship, and **monitoring** after.\n\nBecause TextStack's judge is a custom `IEvaluator`\n\non Microsoft.Extensions.AI.Evaluation, an eval is just a `dotnet test`\n\n. The MEAI evaluator emits the rubric's axes plus an overall as numeric metrics, and a quality *floor* is expressed as a Pass/Fail interpretation on the overall:\n\n```\n// In the evaluator: the overall metric is interpreted Pass/Fail against a floor.\nif (overallFloor is { } floor)\n    overall.Interpretation = new EvaluationMetricInterpretation(\n        RatingFor(score.Mean),\n        failed: score.Mean < floor,\n        reason: $\"floor {floor:0.0} (mean {score.Mean:0.00})\");\n```\n\nThat catches *gross* breakage — \"something is badly wrong.\" But the more valuable gate is **relative**: store a baseline score per feature, and fail the build when a change drops quality by more than a threshold versus that baseline. That turns \"did this prompt change help?\" into a red/green answer and makes improving a prompt a tight loop — change, run, compare, keep or revert. It's the AI equivalent of TDD.\n\nHonest status from our codebase: the floor and on-demand runs exist today; the automatic *baseline-versus-regression* gate is the next step. I'm flagging that deliberately, because plenty of \"we do eval-driven development\" claims are really \"we have a number nobody gates on.\" The hard 80% — the measuring instrument — is built; wiring the ratchet is the lighter remaining 20%.\n\nEvery eval case is a real generation **plus** a real judge call. Running the full suite on every commit is slow and expensive, so evals have to be deliberate. TextStack's are **opt-in**: tagged so default CI skips them, and they self-skip when the provider isn't configured.\n\n```\nOPENAI_API_KEY=… dotnet test tests/TextStack.AiEvals --filter Category=Eval\n```\n\nDefault CI stays green and free; the expensive truth runs on purpose. The pragmatic pattern: a small, cheap subset on pull requests for a fast signal, and the full suite nightly or pre-release. Treat eval spend like any cloud cost — budget it, don't let it run unbounded.\n\nA curated golden set, however good, is a snapshot of inputs you *imagined*. Production sends inputs you didn't. So the offline gate is only half the system; the other half runs against live traffic.\n\nThis is where evals and observability become one thing. Every AI call in TextStack is tagged with its feature and recorded — cost, latency, tokens, errors — and runs persist to an `eval_runs`\n\ntable surfaced on an internal ** /ai-quality** dashboard (Traces and Evals tabs), with an admin \"Run evals\" button to trigger the suite on demand. Because the judge is the\n\nPut the two homes together and you get a loop that compounds. Production surfaces a new failure mode → you do error analysis on it → it becomes a new golden case → your gate now defends against it → quality climbs → cleaner output produces cleaner traffic. Each turn makes the next regression harder to ship. That continuous-improvement flywheel — not any single dashboard — is the real product of an eval system.\n\nThat's the whole discipline, start to finish:\n\nNone of it requires Python or a heavyweight platform. On .NET it's an `ILlmService`\n\nseam, a golden dataset in JSON, a custom `IEvaluator`\n\non Microsoft.Extensions.AI.Evaluation, and an opt-in test category — built on a real product, in production. Done right, evals turn *\"I think this AI feature is fine\"* into *\"I can prove it, and I'll know the moment it stops being true.\"* That's the difference between shipping AI and gambling with it.\n\n*TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.*", "url": "https://wpnews.pro/news/ai-evals-part-5-from-a-number-to-a-gate-evals-in-ci-and-production", "canonical_source": "https://dev.to/mrviduus/ai-evals-part-5-from-a-number-to-a-gate-evals-in-ci-and-production-1j33", "published_at": "2026-06-17 17:43:25+00:00", "updated_at": "2026-06-17 17:51:37.630764+00:00", "lang": "en", "topics": ["ai-products", "ai-tools", "developer-tools", "mlops", "large-language-models"], "entities": ["TextStack", "Microsoft.Extensions.AI.Evaluation", "OpenAI", ".NET", "IEvaluator", "ILlmService"], "alternates": {"html": "https://wpnews.pro/news/ai-evals-part-5-from-a-number-to-a-gate-evals-in-ci-and-production", "markdown": "https://wpnews.pro/news/ai-evals-part-5-from-a-number-to-a-gate-evals-in-ci-and-production.md", "text": "https://wpnews.pro/news/ai-evals-part-5-from-a-number-to-a-gate-evals-in-ci-and-production.txt", "jsonld": "https://wpnews.pro/news/ai-evals-part-5-from-a-number-to-a-gate-evals-in-ci-and-production.jsonld"}}