{"slug": "llm-evaluation-in-production-building-the-eval-pipeline-that-runs-on-every", "title": "LLM Evaluation in Production: Building the Eval Pipeline That Runs on Every Deploy", "summary": "A developer built an evaluation pipeline for LLM-based RAG systems that runs on every deploy to detect drift and hallucinations. The pipeline uses RAGAS with LLM-as-judge to measure faithfulness and answer relevance on production traffic, while context precision and answer correctness run in CI. The system samples 5% of live traffic for async evaluation and alerts on drops in 7-day rolling scores.", "body_md": "Everyone ships the RAG system. Almost nobody ships the eval system that tells them when the RAG system starts lying.\n\nYou updated the embedding model. Tweaked the system prompt. Swapped the re-ranker. Metrics look fine. Three weeks later, support tickets arrive — the system is drawing inferences the source documents never made. No alarm fired. No test failed. The system drifted silently.\n\nThis is not a model quality problem. It is an evaluation infrastructure problem.\n\n**Faithfulness** — of the claims in the response, what fraction are directly supported by the retrieved context? Your primary hallucination guard. Does not require ground truth.\n\n**Answer Relevance** — how directly does the response address the user's question? Catches the \"technically correct but useless\" failure mode. Does not require ground truth.\n\n**Context Precision** — of the retrieved chunks, what fraction were actually relevant? Requires ground truth. Belongs in offline CI eval.\n\n**Answer Correctness** — how factually accurate vs the reference answer? Most expensive, requires curated ground truth. Pre-deploy regression suite only.\n\n**Operational rule:** Faithfulness and Answer Relevance run on every deploy and on sampled production traffic. Context Precision and Answer Correctness run in CI against the golden dataset.\n\nRAGAS uses an LLM to evaluate LLM output — the only practical way to evaluate semantic quality at scale.\n\nPitfalls to manage:\n\nCalibrate against human labels using Cohen's Kappa on 50-100 examples. Below 0.4 means your judge prompt needs revision.\n\nThe eval pipeline triggers on every PR touching RAG code, prompts, or model configuration:\n\nCost: ~$0.50-$2.00 per full eval run at Claude Sonnet pricing. On PRs, run only faithfulness + relevance (cheapest). Full suite runs nightly.\n\nCI catches regressions from code changes. Production sampling catches drift from corpus staleness, query distribution shift, and model behavior changes.\n\nSample 5% of live traffic for async evaluation. Never evaluate synchronously — judge calls add 2-5s per request. Track 7-day rolling faithfulness and answer relevance. Alert when they drop >0.05 from monthly baseline.\n\nLLM systems do not have stable, deterministic behavior. They drift through corpus changes, model updates, prompt evolution, and query distribution shift. Evaluation is not a checkpoint — it is continuous infrastructure.\n\nBuild the eval system before you need it. By the time you need it, it is already too late — you will be debugging a production quality regression with no historical baseline and no automated detection.\n\nThis is a summary of my deep dive into LLM evaluation infrastructure. The full article covers the complete eval stack with implementation examples:\n\n**👉 LLM Evaluation in Production — Full Article**\n\nThe full article includes:", "url": "https://wpnews.pro/news/llm-evaluation-in-production-building-the-eval-pipeline-that-runs-on-every", "canonical_source": "https://dev.to/aloknecessary/llm-evaluation-in-production-building-the-eval-pipeline-that-runs-on-every-deploy-5eki", "published_at": "2026-06-17 13:22:36+00:00", "updated_at": "2026-06-17 13:51:57.610573+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-agents", "developer-tools"], "entities": ["RAGAS", "Claude Sonnet", "Cohen's Kappa"], "alternates": {"html": "https://wpnews.pro/news/llm-evaluation-in-production-building-the-eval-pipeline-that-runs-on-every", "markdown": "https://wpnews.pro/news/llm-evaluation-in-production-building-the-eval-pipeline-that-runs-on-every.md", "text": "https://wpnews.pro/news/llm-evaluation-in-production-building-the-eval-pipeline-that-runs-on-every.txt", "jsonld": "https://wpnews.pro/news/llm-evaluation-in-production-building-the-eval-pipeline-that-runs-on-every.jsonld"}}