cd /news/large-language-models/llm-evaluation-in-production-buildin… · home topics large-language-models article
[ARTICLE · art-31130] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

LLM Evaluation in Production: Building the Eval Pipeline That Runs on Every Deploy

A developer built an evaluation pipeline for LLM-based RAG systems that runs on every deploy to detect drift and hallucinations. The pipeline uses RAGAS with LLM-as-judge to measure faithfulness and answer relevance on production traffic, while context precision and answer correctness run in CI. The system samples 5% of live traffic for async evaluation and alerts on drops in 7-day rolling scores.

read2 min views1 publishedJun 17, 2026

Everyone ships the RAG system. Almost nobody ships the eval system that tells them when the RAG system starts lying.

You updated the embedding model. Tweaked the system prompt. Swapped the re-ranker. Metrics look fine. Three weeks later, support tickets arrive — the system is drawing inferences the source documents never made. No alarm fired. No test failed. The system drifted silently.

This is not a model quality problem. It is an evaluation infrastructure problem.

Faithfulness — of the claims in the response, what fraction are directly supported by the retrieved context? Your primary hallucination guard. Does not require ground truth.

Answer Relevance — how directly does the response address the user's question? Catches the "technically correct but useless" failure mode. Does not require ground truth.

Context Precision — of the retrieved chunks, what fraction were actually relevant? Requires ground truth. Belongs in offline CI eval.

Answer Correctness — how factually accurate vs the reference answer? Most expensive, requires curated ground truth. Pre-deploy regression suite only.

Operational rule: Faithfulness and Answer Relevance run on every deploy and on sampled production traffic. Context Precision and Answer Correctness run in CI against the golden dataset.

RAGAS uses an LLM to evaluate LLM output — the only practical way to evaluate semantic quality at scale.

Pitfalls to manage:

Calibrate against human labels using Cohen's Kappa on 50-100 examples. Below 0.4 means your judge prompt needs revision.

The eval pipeline triggers on every PR touching RAG code, prompts, or model configuration:

Cost: ~$0.50-$2.00 per full eval run at Claude Sonnet pricing. On PRs, run only faithfulness + relevance (cheapest). Full suite runs nightly.

CI catches regressions from code changes. Production sampling catches drift from corpus staleness, query distribution shift, and model behavior changes.

Sample 5% of live traffic for async evaluation. Never evaluate synchronously — judge calls add 2-5s per request. Track 7-day rolling faithfulness and answer relevance. Alert when they drop >0.05 from monthly baseline.

LLM systems do not have stable, deterministic behavior. They drift through corpus changes, model updates, prompt evolution, and query distribution shift. Evaluation is not a checkpoint — it is continuous infrastructure.

Build the eval system before you need it. By the time you need it, it is already too late — you will be debugging a production quality regression with no historical baseline and no automated detection.

This is a summary of my deep dive into LLM evaluation infrastructure. The full article covers the complete eval stack with implementation examples:

👉 LLM Evaluation in Production — Full Article

The full article includes:

── more in #large-language-models 4 stories · sorted by recency
── more on @ragas 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/llm-evaluation-in-pr…] indexed:0 read:2min 2026-06-17 ·