LLM benchmarks are answering someone else's question

LLM benchmarks like MMLU and HumanEval are irrelevant for most businesses building AI products, as they measure generic performance rather than specific system tasks. Teams should instead build custom evaluations (evals) tied to their own workloads, failure modes, and golden test sets to catch regressions and silent failures that benchmarks miss.

Fight Evils with Evals Benchmarks measure benchmarks. Your system needs its own measures. Every new model arrives wearing a tuxedo of benchmarks. MMLU: 92.4%. HumanEval: 87.2%. LLeMU: 88.7%. MATH: 73.6%. AGI: 127% Yet, for 99% of businesses building process & product with AI, none of it matters. What matters? How are YOUR workloads doing? Getting better or worse? The only sane way to know that is to write Evals tests for LLMs that reflect the specific tasks, data, and failure modes of your system. The benchmarks are not lying. They are answering someone else’s question. What “Vibes-Based Evaluation” Actually Costs The standard approach: ship a model change, watch the complaint channels, roll back if the room gets loud. That misses almost everything interesting: You only catch loud failures. Users who get a confidently wrong answer and don’t realize it? Silent. Users who get a worse answer and abandon the feature? Silent. Support tickets and error rates capture only a fraction of quality regression. You can’t distinguish regressions from improvements. If the new model is better at task A and worse at task B, complaints about B look identical to generic “the AI got worse” feedback. You don’t know what to fix. You’re using your users as test infrastructure. They didn’t sign up for that. The Eval Spectrum and Where Most Teams Get It Wrong Evaluation approaches sit on a spectrum from “fast but flimsy” to “expensive but valid.” LLM-as-judge is the current darling: ask a powerful model to grade another model’s outputs. Fast, scalable, cheap. The problem: it bakes in the grader model’s biases, can be gamed, and creates a circular dependency. If you use GPT-5 to grade GPT-5’s outputs, you’re measuring something like “how much does GPT-5 agree with GPT-5.” That’s not nothing, but it’s not what you think. Human eval is the gold standard everyone tries to skip. Getting humans to evaluate outputs is expensive, slow, inconsistent across evaluators, and annoying to schedule. But it is the only thing that validates whether your system is useful to real humans. Task-specific automated checks are where most teams should spend more time. They are not glamorous, but they are fast, deterministic, and tied to what matters in your system. What Actually Works 1. Define Failure Before You Ship Before changing a model or prompt, write down what bad looks like. Specifically. Not “the output should be accurate.” That’s not a test. More like: - Structured JSON output must parse without errors - All citations in the response must appear verbatim in the retrieved context - Responses must not mention competitor product names - SQL queries must be syntactically valid and reference only tables that exist in the schema - Sentiment classification must not flip from positive to negative more than 3% of the time on the existing test set You can check these programmatically. No judge model required. Eval harness: deterministic checks 2. Build a Golden Set From Your Worst Days Your best evaluation data is the embarrassing stuff: the outputs that made someone file a ticket, screenshot a hallucination, or quietly stop using the feature. Every time a user reports a bad output, flags a hallucination, or you notice a failure manually, add it to your golden set: the input, the context, and the correct behavior. Keep 50-100 cases and run them on every model change. This feels manual at first. After six months, you have a test suite no public benchmark can game, because every case came from your own failure history. Golden case shape 3. Regression Testing, Not Just Acceptance Testing Most teams run evals only when considering a model change. That’s acceptance testing: “is this new thing good enough?” You also need regression testing: “did this break something that used to work?” Run your golden set on every prompt change, not just model changes. A prompt that was working fine can silently degrade when you add a new tool, change a RAG retrieval strategy, or update your context template. You won’t know without a baseline. Tools like Langfuse https://langfuse.com/ attach eval scores to production traces so regression shows up in dashboards, not just in incident reports. Eval harness: baseline vs candidate comparison If a candidate regresses on known failures, the upgrade conversation gets wonderfully specific: which cases improved, which cases broke, and whether the trade is worth it. 4. Use LLM-as-Judge for Exactly One Thing LLM-as-judge is useful for open-ended outputs where there is no deterministic right answer: “is this response helpful?”, “does this summary preserve the key points?”, “is this explanation right for a beginner?” Use it there. Don’t use it for deterministic answers. When you do use it, make the grading rubric explicit: Eval harness: rubric-based judge An explicit rubric reduces evaluator variance, gives you interpretable output, and makes it easier to audit when the judge is wrong. Libraries like Autoevals https://github.com/braintrustdata/autoevals and Braintrust https://www.braintrust.dev/ ship prebuilt rubrics for common tasks — worth stealing before writing your own from scratch. Tools Worth Knowing You don’t have to build all of this from scratch. Several tools have made serious progress on the eval infrastructure problem: Braintrust — Full eval platform with experiment tracking, dataset management, and scoring functions. Organizes eval runs by prompt, model, and deployment so you can diff quality over time, not just across releases. Pairs well with their open-source library, which ships prebuilt model-graded scoring functions for common tasks factual accuracy, helpfulness, toxicity, semantic similarity . Autoevals https://github.com/braintrustdata/autoevals Langfuse — Open-source LLM observability that sits between your app and your models. Traces every call, attaches eval scores human or automated to individual spans, and surfaces quality trends over production traffic. Good choice if you want observability and evals in the same tool rather than a separate eval harness. Evalite — TypeScript-native eval framework by Matt Pocock. Low ceremony: define a task, define a scorer, run it in your existing test setup. Targets teams who want evals that feel like unit tests rather than a separate ML experiment platform. promptfoo — CLI-first eval runner focused on prompt comparison and red-teaming. Easy to configure via YAML, integrates with most model providers, and has built-in support for detecting prompt injection and other adversarial inputs. deepeval — Python eval framework with a large library of built-in metrics G-Eval, RAG faithfulness, answer relevancy, hallucination detection . Useful for RAG pipelines where you want specific grading for retrieval quality, not just generation quality. The right tool depends on your stack and where you’re starting from. What matters more than the choice of framework is the discipline of running evals at all — consistently, on every significant change. The Uncomfortable Part Most teams skip this because it asks an irritating question early: what would “good” look like here? That is genuinely hard for a new AI feature. It is also non-optional if you care about reliability. Teams that ship trustworthy AI are doing the same thing they’d do for any critical code path: define expected behavior, test it, and run those tests continuously. The benchmarks are not lying. They are answering someone else’s question. Stop reading them as product roadmaps and start writing tests that match your system. Your users will notice before your dashboards do. Build the test suite first.