LLM Evaluation in Production: Building the Eval Pipeline That Runs on Every Deploy

wpnews.pro

cd /news/large-language-models/llm-evaluation-in-production-buildin… · home › topics › large-language-models › article

[ARTICLE · art-31130] src=dev.to ↗ pub=2026-06-17T13:22Z topic=large-language-models verified=true sentiment=· neutral

LLM Evaluation in Production: Building the Eval Pipeline That Runs on Every Deploy

A developer built an evaluation pipeline for LLM-based RAG systems that runs on every deploy to detect drift and hallucinations. The pipeline uses RAGAS with LLM-as-judge to measure faithfulness and answer relevance on production traffic, while context precision and answer correctness run in CI. The system samples 5% of live traffic for async evaluation and alerts on drops in 7-day rolling scores.

read2 min views30 publishedJun 17, 2026

Everyone ships the RAG system. Almost nobody ships the eval system that tells them when the RAG system starts lying.

You updated the embedding model. Tweaked the system prompt. Swapped the re-ranker. Metrics look fine. Three weeks later, support tickets arrive — the system is drawing inferences the source documents never made. No alarm fired. No test failed. The system drifted silently.

This is not a model quality problem. It is an evaluation infrastructure problem.

Faithfulness — of the claims in the response, what fraction are directly supported by the retrieved context? Your primary hallucination guard. Does not require ground truth.

Answer Relevance — how directly does the response address the user's question? Catches the "technically correct but useless" failure mode. Does not require ground truth.

Context Precision — of the retrieved chunks, what fraction were actually relevant? Requires ground truth. Belongs in offline CI eval.

Answer Correctness — how factually accurate vs the reference answer? Most expensive, requires curated ground truth. Pre-deploy regression suite only.

Operational rule: Faithfulness and Answer Relevance run on every deploy and on sampled production traffic. Context Precision and Answer Correctness run in CI against the golden dataset.

RAGAS uses an LLM to evaluate LLM output — the only practical way to evaluate semantic quality at scale.

Pitfalls to manage:

Calibrate against human labels using Cohen's Kappa on 50-100 examples. Below 0.4 means your judge prompt needs revision.

The eval pipeline triggers on every PR touching RAG code, prompts, or model configuration:

Cost: ~$0.50-$2.00 per full eval run at Claude Sonnet pricing. On PRs, run only faithfulness + relevance (cheapest). Full suite runs nightly.

CI catches regressions from code changes. Production sampling catches drift from corpus staleness, query distribution shift, and model behavior changes.

Sample 5% of live traffic for async evaluation. Never evaluate synchronously — judge calls add 2-5s per request. Track 7-day rolling faithfulness and answer relevance. Alert when they drop >0.05 from monthly baseline.

LLM systems do not have stable, deterministic behavior. They drift through corpus changes, model updates, prompt evolution, and query distribution shift. Evaluation is not a checkpoint — it is continuous infrastructure.

Build the eval system before you need it. By the time you need it, it is already too late — you will be debugging a production quality regression with no historical baseline and no automated detection.

This is a summary of my deep dive into LLM evaluation infrastructure. The full article covers the complete eval stack with implementation examples:

👉 LLM Evaluation in Production — Full Article

The full article includes:

source & further reading

dev.to — original article Creating a Knowledge Graph with Drupal Content Add Live Bilingual Tech News to Your Portfolio Site in One Line Nearly half of MCP servers expose tools an agent could plausibly confuse

~/api · this article 200

$curl api.wpnews.pro/v1/news/llm-evaluation-in-produc…

Read original on dev.to → dev.to/aloknecessary/llm-evaluation-in-productio…

mentioned entities

RAGAS

Claude Sonnet

Cohen's Kappa

metadata

slugllm-evaluation-in-production-building-the-eval-pipeline-that-runs-on-every

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevA Learning System Made of Learni…

next →I've worked on this side project…

── more in #large-language-models 4 stories · sorted by recency

ycrootaccess.com · 2 Aug · #large-language-models

Boris Cherny on Trying to Get Claude Code to Rewrite the Claude App

towardsdev.com · 2 Aug · #large-language-models

I Built a Prompt Injection Detector Using Go’s Standard Library

github.com · 2 Aug · #large-language-models

I gave local AI eyes

getreadyforagents.com · 2 Aug · #large-language-models

OmegaAgent releases Handoff, a human-approval protocol for AI agents

── more on @ragas 3 stories trending now

wpnews · 1 Aug · #ai-products

OpenAI Atlas Shuts Down August 9: Migration Guide

wpnews · 2 Aug · #artificial-intelligence

I Ran 8 AI APIs Through the Same 50 Prompts — Here's the Real Cost Breakdown

wpnews · 2 Aug · #developer-tools

Agent-Browser – Browser Automation for AI

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required